Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Tom Lane
Tomas Vondra  writes:
> On 14.5.2014 22:29, Andres Freund wrote:
>> I'm afraid it's more in the year range from what i've seen. I.e. not
>> practical.

> Yeah, that wouldn't be very practical. I'll try to run it though and if
> it'd run more than a few days, I'll switch it to CLOBBER_CACHE_ALWAYS.

I think it might be interesting to try running CLOBBER_CACHE_RECURSIVELY
on some subset of the regression tests.  But I'm not sure which subset,
nor whether the buildfarm scripts could readily be coaxed to do that.
Let's see how far you get with a full run before you get bored ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Tomas Vondra
On 14.5.2014 22:29, Andres Freund wrote:
> Hi,
> 
> On 2014-05-14 21:04:41 +0200, Tomas Vondra wrote:
>> On 14.5.2014 17:52, Andres Freund wrote:
>>> On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
 On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
> Apparently there's something wrong with 'test-decoding-check':

 Man. I shouldn't have asked... My code. There's some output in there
 that's probably triggered by the extraordinarily long runtimes, but
 there's definitely something else wrong.
 My gut feeling says it's in RelationGetIndexList().
>>>
>>> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
>>>
>>> Tomas, thanks for that. I've never (and probably will never) run
>>> CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
>>> regularly is really helpful. How long does a single testrun take? It
>>> takes hundreds of seconds here to do a single UPDATE?
>>
>> Don't know yet, as it fails at the beginning.
> 
> test decoding is at the beginning? That's somewhat odd?

Judging from the timestamps of log files, it seems to be running after
pg_upgrade checks. Or maybe I'm reading that wrong.

>> But I suppose it will be
>> tens or possibly hundreds of hours. For example these are the logs from
>> regular build (no clobber etc.)
> 
>> May 14 19:00 SCM-checkout.log
>> May 14 19:00 githead.log
>> May 14 19:00 configure.log
>> May 14 19:00 config.log
>> May 14 19:05 make.log
>> May 14 19:05 check.log
>> May 14 19:06 make-contrib.log
>> May 14 19:06 make-install.log
>> May 14 19:06 install-contrib.log
>> May 14 19:07 check-pg_upgrade.log
>> May 14 19:08 test-decoding-check.log
>>
>> while these are the logs from recursive clobber:
>>
>> May 14 00:19 SCM-checkout.log
>> May 14 00:20 configure.log
>> May 14 00:20 config.log
>> May 14 00:26 make.log
>> May 14 03:12 check.log
>> May 14 03:13 make-contrib.log
>> May 14 03:13 make-install.log
>> May 14 03:13 install-contrib.log
>> May 14 08:25 check-pg_upgrade.log
>> May 14 09:07 test-decoding-check.log
>> May 14 09:07 web-txn.data
>>
>>
>> So with the regular build, it took <1 minute to do 'make check' and ~1
>> minute to test pg_upgrade, with recursive clobber it takes ~3 hours and
>> ~5 hours. That's a factor of ~300, although it's a very rough
>> estimate.
> 
> I seriously doubt that's recursive clobber. That should take *way* much
> longer. And indeed you have:
> 
>> -DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY -DMEMORY_CONTEXT_CHECKING
>> -DRANDOMIZE_ALLOCATED_MEMORY -DCLOBBER_CACHE_RECURSIVELY
>>
>> it does not happen with
>>
>> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY
>> -DMEMORY_CONTEXT_CHECKING -DRANDOMIZE_ALLOCATED_MEMORY',
> 
> #if defined(CLOBBER_CACHE_ALWAYS)
>   {
>   static bool in_recursion = false;
> 
>   if (!in_recursion)
>   {
>   in_recursion = true;
>   InvalidateSystemCaches();
>   in_recursion = false;
>   }
>   }
> #elif defined(CLOBBER_CACHE_RECURSIVELY)
>   InvalidateSystemCaches();
> #endif
> 
> i.e. you can't specifiy -DCLOBBER_CACHE_ALWAYS and
> -DCLOBBER_CACHE_RECURSIVELY together. The former will take precedence.

Oh, I see :-/

>> Without clobber the whole run (for a "C" locale) takes ~10 minutes, so
>> my estimate is ~50 hours for the recursive one. But I wouldn't be
>> surprised by 100 hours.
> 
> I'm afraid it's more in the year range from what i've seen. I.e. not
> practical.

Yeah, that wouldn't be very practical. I'll try to run it though and if
it'd run more than a few days, I'll switch it to CLOBBER_CACHE_ALWAYS.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Andres Freund
Hi,

On 2014-05-14 21:04:41 +0200, Tomas Vondra wrote:
> On 14.5.2014 17:52, Andres Freund wrote:
> > On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
> >> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
> >>> Apparently there's something wrong with 'test-decoding-check':
> >>
> >> Man. I shouldn't have asked... My code. There's some output in there
> >> that's probably triggered by the extraordinarily long runtimes, but
> >> there's definitely something else wrong.
> >> My gut feeling says it's in RelationGetIndexList().
> > 
> > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
> > 
> > Tomas, thanks for that. I've never (and probably will never) run
> > CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
> > regularly is really helpful. How long does a single testrun take? It
> > takes hundreds of seconds here to do a single UPDATE?
> 
> Don't know yet, as it fails at the beginning.

test decoding is at the beginning? That's somewhat odd?

> But I suppose it will be
> tens or possibly hundreds of hours. For example these are the logs from
> regular build (no clobber etc.)

> May 14 19:00 SCM-checkout.log
> May 14 19:00 githead.log
> May 14 19:00 configure.log
> May 14 19:00 config.log
> May 14 19:05 make.log
> May 14 19:05 check.log
> May 14 19:06 make-contrib.log
> May 14 19:06 make-install.log
> May 14 19:06 install-contrib.log
> May 14 19:07 check-pg_upgrade.log
> May 14 19:08 test-decoding-check.log
> 
> while these are the logs from recursive clobber:
> 
> May 14 00:19 SCM-checkout.log
> May 14 00:20 configure.log
> May 14 00:20 config.log
> May 14 00:26 make.log
> May 14 03:12 check.log
> May 14 03:13 make-contrib.log
> May 14 03:13 make-install.log
> May 14 03:13 install-contrib.log
> May 14 08:25 check-pg_upgrade.log
> May 14 09:07 test-decoding-check.log
> May 14 09:07 web-txn.data
> 
> 
> So with the regular build, it took <1 minute to do 'make check' and ~1
> minute to test pg_upgrade, with recursive clobber it takes ~3 hours and
> ~5 hours. That's a factor of ~300, although it's a very rough
> estimate.

I seriously doubt that's recursive clobber. That should take *way* much
longer. And indeed you have:

> -DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY -DMEMORY_CONTEXT_CHECKING
> -DRANDOMIZE_ALLOCATED_MEMORY -DCLOBBER_CACHE_RECURSIVELY
> 
> it does not happen with
> 
> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY
> -DMEMORY_CONTEXT_CHECKING -DRANDOMIZE_ALLOCATED_MEMORY',

#if defined(CLOBBER_CACHE_ALWAYS)
{
static bool in_recursion = false;

if (!in_recursion)
{
in_recursion = true;
InvalidateSystemCaches();
in_recursion = false;
}
}
#elif defined(CLOBBER_CACHE_RECURSIVELY)
InvalidateSystemCaches();
#endif

i.e. you can't specifiy -DCLOBBER_CACHE_ALWAYS and
-DCLOBBER_CACHE_RECURSIVELY together. The former will take precedence.

> Without clobber the whole run (for a "C" locale) takes ~10 minutes, so
> my estimate is ~50 hours for the recursive one. But I wouldn't be
> surprised by 100 hours.

I'm afraid it's more in the year range from what i've seen. I.e. not
practical.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Tomas Vondra
On 14.5.2014 17:52, Andres Freund wrote:
> On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
>> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
>>> Apparently there's something wrong with 'test-decoding-check':
>>
>> Man. I shouldn't have asked... My code. There's some output in there
>> that's probably triggered by the extraordinarily long runtimes, but
>> there's definitely something else wrong.
>> My gut feeling says it's in RelationGetIndexList().
> 
> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
> 
> Tomas, thanks for that. I've never (and probably will never) run
> CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
> regularly is really helpful. How long does a single testrun take? It
> takes hundreds of seconds here to do a single UPDATE?

Don't know yet, as it fails at the beginning. But I suppose it will be
tens or possibly hundreds of hours. For example these are the logs from
regular build (no clobber etc.)

May 14 19:00 SCM-checkout.log
May 14 19:00 githead.log
May 14 19:00 configure.log
May 14 19:00 config.log
May 14 19:05 make.log
May 14 19:05 check.log
May 14 19:06 make-contrib.log
May 14 19:06 make-install.log
May 14 19:06 install-contrib.log
May 14 19:07 check-pg_upgrade.log
May 14 19:08 test-decoding-check.log

while these are the logs from recursive clobber:

May 14 00:19 SCM-checkout.log
May 14 00:20 configure.log
May 14 00:20 config.log
May 14 00:26 make.log
May 14 03:12 check.log
May 14 03:13 make-contrib.log
May 14 03:13 make-install.log
May 14 03:13 install-contrib.log
May 14 08:25 check-pg_upgrade.log
May 14 09:07 test-decoding-check.log
May 14 09:07 web-txn.data


So with the regular build, it took <1 minute to do 'make check' and ~1
minute to test pg_upgrade, with recursive clobber it takes ~3 hours and
~5 hours. That's a factor of ~300, although it's a very rough estimate.

Without clobber the whole run (for a "C" locale) takes ~10 minutes, so
my estimate is ~50 hours for the recursive one. But I wouldn't be
surprised by 100 hours.

> 
> There were some more differences but those are all harmless and caused
> by the extraordinarily long runtime (autovacuums). I think we need to
> add a feature to test_decoding to suppress displaying transactions
> without changes. Ick.
> 

I expect to hit more timing-related issues with the recursive clobber
tests - not necessarily in the code/tests itself, but I guess the
buildfarm tooling doesn't really expect runs that long.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Tom Lane
Andres Freund  writes:
> On 2014-05-14 12:15:27 -0400, Tom Lane wrote:
>> And why does the header
>> comment for RelationGetIndexList make no mention of this new side-effect?
>> Somebody did a seriously poor job of adding this functionality to
>> relcache.

> It's not like it's not documented: There's a comment about it in struct
> RelationData. I must have missed that rd_oidindex has a comment abou
> it's lifetime over RelationGetIndexList().

If rd_replidindex is being managed like rd_oidindex, then it should be
managed just like rd_oidindex, including getting reset in all the places
rd_oidindex is.  This might be just a matter of cleanliness but I think
it's important for readability and debuggability.

I notice also that rd_keyattr and rd_idattr have been implemented with
bad copies of the logic for rd_indexattr.  This is at least leading
to a permanent memory leak in CacheMemoryContext during every relcache
flush, and maybe worse things.  The bugs for rd_keyattr appear to predate
your patch though.

Working on a patch for this now.  One thing I'm wondering about is
RelationSetIndexList.  It's probably okay for it not to touch rd_keyattr
and rd_idattr, but I'm not too clear on what the use cases for those
attnum sets are.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Andres Freund
On 2014-05-14 13:32:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-05-14 12:15:27 -0400, Tom Lane wrote:
> >> And why does the header
> >> comment for RelationGetIndexList make no mention of this new side-effect?
> >> Somebody did a seriously poor job of adding this functionality to
> >> relcache.
> 
> > It's not like it's not documented: There's a comment about it in struct
> > RelationData. I must have missed that rd_oidindex has a comment abou
> > it's lifetime over RelationGetIndexList().
> 
> If rd_replidindex is being managed like rd_oidindex, then it should be
> managed just like rd_oidindex, including getting reset in all the places
> rd_oidindex is.  This might be just a matter of cleanliness but I think
> it's important for readability and debuggability.

Agreed. I am not against resetting it. I think I might not have been
aware of rd_oidindex when writing that code...

> I notice also that rd_keyattr and rd_idattr have been implemented with
> bad copies of the logic for rd_indexattr.  This is at least leading
> to a permanent memory leak in CacheMemoryContext during every relcache
> flush, and maybe worse things.  The bugs for rd_keyattr appear to predate
> your patch though.

Hm. Yes, the bitmapsets should be freed. I guess I copied the logic for
keyattr and didn't find any relevant places that touch it. rd_keyattr
should go back to 9.3.

> Working on a patch for this now.  One thing I'm wondering about is
> RelationSetIndexList.  It's probably okay for it not to touch rd_keyattr
> and rd_idattr, but I'm not too clear on what the use cases for those
> attnum sets are.

rd_keyattr is used to determine whether a heap_update() changed any keys
that could be referenced by a foreign key. That's then used to determine
which locklevel an update requires.
rd_idattr does something similar. It decides whether the configured
REPLICA IDENTITY key has changed so whether to log the old primary key
for logical decoding or not.

I can't see why either would need to care about forced index lists right
now, but will do a scan of the sources to see if I am wrong.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Andres Freund
On 2014-05-14 12:15:27 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
> >> My gut feeling says it's in RelationGetIndexList().
> 
> > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
> 
> TBH, I don't believe this patch at all.  Where exactly is rd_replidindex
> reset?  If it's supposed to have similar lifespan to, say, rd_oidindex,
> why isn't it being handled like rd_oidindex? 

I don't see why it'd have a different lifespan than rd_oidindex at all?
If the latter were used inside the loop it'd be a bug as well.

> And why does the header
> comment for RelationGetIndexList make no mention of this new side-effect?
> Somebody did a seriously poor job of adding this functionality to
> relcache.

It's not like it's not documented: There's a comment about it in struct
RelationData. I must have missed that rd_oidindex has a comment abou
it's lifetime over RelationGetIndexList().
I personally actually prefer the struct as the location for the
lifetime. I can send a patch to commonalize the location in either place
with the other location pointing to it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Tom Lane
Andres Freund  writes:
> On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
>> My gut feeling says it's in RelationGetIndexList().

> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.

TBH, I don't believe this patch at all.  Where exactly is rd_replidindex
reset?  If it's supposed to have similar lifespan to, say, rd_oidindex,
why isn't it being handled like rd_oidindex?  And why does the header
comment for RelationGetIndexList make no mention of this new side-effect?
Somebody did a seriously poor job of adding this functionality to
relcache.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Cache invalidation bug in RelationGetIndexAttrBitmap()

2014-05-14 Thread Andres Freund
On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
> > Apparently there's something wrong with 'test-decoding-check':
> 
> Man. I shouldn't have asked... My code. There's some output in there
> that's probably triggered by the extraordinarily long runtimes, but
> there's definitely something else wrong.
> My gut feeling says it's in RelationGetIndexList().

Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.

Tomas, thanks for that. I've never (and probably will never) run
CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
regularly is really helpful. How long does a single testrun take? It
takes hundreds of seconds here to do a single UPDATE?

There were some more differences but those are all harmless and caused
by the extraordinarily long runtime (autovacuums). I think we need to
add a feature to test_decoding to suppress displaying transactions
without changes. Ick.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 9dfe879d2b6940b6072e277b0104e9cbe4af691e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 14 May 2014 17:12:57 +0200
Subject: [PATCH] Fix cache invalidation hazard in
 RelationGetIndexAttrBitmap().

When a cache invalidation arrived inside RelationGetIndexAttrBitmap()
inbetween the call to RelationGetIndexList() and one of the
index_open() calls relation->rd_replidindex would be reset leading to
a corrupted INDEX_ATTR_BITMAP_IDENTITY_KEY return value. That in turn
could lead to the old REPLICA IDENTITY not being logged if set to
DEFAULT or INDEX.
---
 src/backend/utils/cache/relcache.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 5ff0d9e..4fc18b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3976,6 +3976,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	List	   *indexoidlist;
 	ListCell   *l;
 	MemoryContext oldcxt;
+	Oid			relreplindex;
 
 	/* Quick exit if we already computed the result. */
 	if (relation->rd_indexattr != NULL)
@@ -4005,6 +4006,12 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 		return NULL;
 
 	/*
+	 * Store after computing the index list above, to be safe against cache
+	 * flushes inside index_open() below.
+	 */
+	relreplindex = relation->rd_replidindex;
+
+	/*
 	 * For each index, add referenced attributes to indexattrs.
 	 *
 	 * Note: we consider all indexes returned by RelationGetIndexList, even if
@@ -4038,7 +4045,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 			indexInfo->ii_Predicate == NIL;
 
 		/* Is this index the configured (or default) replica identity? */
-		isIDKey = indexOid == relation->rd_replidindex;
+		isIDKey = indexOid == relreplindex;
 
 		/* Collect simple attribute references */
 		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
-- 
2.0.0.rc2.4.g1dc51c6.dirty


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers