Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-06-10 Thread Jeff Frost
On May 26, 2012, at 9:17 AM, Tom Lane wrote:

 Would you guys please try this in the problem databases:
 
 select a.ctid, c.relname
 from pg_attribute a join pg_class c on a.attrelid=c.oid
 where c.relnamespace=11 and c.relkind in ('r','i')
 order by 1 desc;
 
 If you see any block numbers above about 20 then maybe the triggering
 condition is a row relocation after all.


Sorry for such a long delay on the reply.  Took a while to get the data 
directory moved elsewhere:

select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;

  ctid   | relname 
-+-
 (18,31) | pg_extension_name_index
 (18,30) | pg_extension_oid_index
 (18,29) | pg_seclabel_object_index
 (18,28) | pg_seclabel_object_index
 (18,27) | pg_seclabel_object_index
 (18,26) | pg_seclabel_object_index

 As the next step, I'd suggest verifying that the stall is reproducible
 if you remove pg_internal.init (and that it's not there otherwise), and
 then strace'ing the single incoming connection to see what it's doing.

It does take a little while, but not nearly as long as the stalls we were 
seeing before.  The pg_internal.init is 108k in case that's an interesting data 
point.

Any other tests you'd like me to run on that bad data dir?

Also, thus far, the newly initdb'd system continues to hum along just fine.  
It's also been upgraded to 9.1.4, so if it was the rebuilding of 
pg_internal.init, then your fix should keep it happy.

Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-28 Thread Greg Sabino Mullane
On Sun, May 27, 2012 at 05:44:15PM -0700, Jeff Frost wrote:
 On May 27, 2012, at 12:53 PM, Tom Lane wrote:
  occurring, they'd take long enough to expose the process to sinval
  overrun even with not-very-high DDL rates.
 As it turns out, there are quite a few temporary tables created.

For the record, same here. We do *lots* of DDL (hence the cronjobs 
to vac/reindex system catalogs).

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpuQRprn1huB.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-28 Thread Tom Lane
Greg Sabino Mullane g...@endpoint.com writes:
 On Sun, May 27, 2012 at 05:44:15PM -0700, Jeff Frost wrote:
 On May 27, 2012, at 12:53 PM, Tom Lane wrote:
 occurring, they'd take long enough to expose the process to sinval
 overrun even with not-very-high DDL rates.

 As it turns out, there are quite a few temporary tables created.

 For the record, same here. We do *lots* of DDL (hence the cronjobs 
 to vac/reindex system catalogs).

I wonder if it could've been something like transient problem with
a cronjob leading to massive bloat of pg_attribute, eventually
triggering the syncscan issue, then fixed by a successful VAC FULL
before we thought to look closely at the table size.  The syncscan
issue definitely was there in 8.3.5, it's only the question of
pg_attribute size that made me doubt it was happening for you.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-27 Thread Tom Lane
I've been continuing to poke at this business of relcache-related
startup stalls, and have come to some new conclusions.  One is that
it no longer seems at all likely that the pg_attribute rows for system
catalogs aren't at the front of pg_attribute, because the commands that
might be used to update them are disallowed even to superusers:

foo=# alter table pg_index alter column indnatts set statistics -1;
ERROR:  permission denied: pg_index is a system catalog
foo=# alter table pg_index alter column indnatts set (n_distinct = -1);
ERROR:  permission denied: pg_index is a system catalog

Now a sufficiently stubborn DBA could override that protection (via
direct manual UPDATE on pg_attribute, or by doing ALTER TABLE in a
standalone backend with allowSystemTableMods set).  But it doesn't seem
like something anybody would try in a production database.  So while
I'd still like to see the results of that ctid query in Jeff's broken
database, it seems pretty unlikely that we will see high block numbers.

But, if all those rows are at the front of pg_attribute, how do we
explain the high block numbers seen in Jeff's stack traces?  He posted
three traces that show RelationBuildTupleDesc doing heapscans:
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00186.php
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00196.php
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00197.php
and in each one of these, the block number being passed to
ReadBufferExtended is well past where those rows ought to be.
What's more, the three traces are performing the pg_attribute heapscan
for three different catalogs; so to blame this observation on some
rows having gotten relocated to high block numbers, you'd have to
believe that had happened for all three of these catalogs (one of
which is an index, so there'd be no reason to fool with its stats
attributes anyway).

AFAICS, if there are not rows at high block numbers, the only other
possibility is that syncscan mode was active.  (A third explanation is
that some rows are just plain missing, but then the system would have
been erroring out altogether, which it was not.)  That theory requires
us to invent an explanation for the fact that Jeff now observes
pg_attribute to be slightly smaller than 1/4th shared_buffers, but
perhaps autovacuum managed to shrink it at some point after the trouble
happened.

Another thing that can be deduced from those stack traces is that sinval
resets were happening.  For example, in the third message linked above,
the heapscan is being done to load up a relcache entry for relation 2601
(pg_am).  This would be unsurprising, except that stack frames 17 and 18
show this is happening during the fifth call of load_critical_index, and
we should have had both pg_am's reldesc and the syscache entry that's
being fetched at relcache.c:1010 loaded up in the first call.  So those
cache entries have to have gotten blown away by sinval reset.  This is
not terribly surprising if there were a steady flux of DDL happening in
the database, for instance temp table creations/drops.  (Jeff, can you
comment on that?)  If heapscans over the whole of pg_attribute were
occurring, they'd take long enough to expose the process to sinval
overrun even with not-very-high DDL rates.

Now the interesting thing about that is that if an sinval reset occurs
at any time while relcache.c is trying to load up initial relcache
entries, it will not try to write a new relcache init file.  This means
that a steady but not very large flow of DDL activity from existing
sessions in the database would be sufficient to prevent incoming
sessions from ever writing pg_internal.init, and thus the stall behavior
could persist for a long time, which is something I hadn't been able to
explain before.  (If this were not happening, the stalls would disappear
as soon as any incoming session successfully got through its stall.)

So it appears to me that Jeff's problem is adequately explained if we
assume (1) pg_attribute a bit larger than 1/4th shared_buffers, and
(2) steady flow of DDL activity from existing sessions; and that if you
want to dispute (1) you have to invent some other explanation for the
observed attempts to read high block numbers in pg_attribute.  Also,
if there were lots of temp table creation/dropping going on, this could
result in lots of dead rows at the end of pg_attribute, which might
explain how autovacuum could've shortened the relation later.

Now this theory does not explain Greg's problem, because his
pg_attribute was apparently many times too small to make syncscans
credible, or to make them take very long even if they did happen.
But in view of the fact that that was 8.3.5 :-(, and that the problem
was observed in conjunction with corruption of system indexes, I don't
feel a compulsion to assume that Greg's problem was the same as Jeff's.
Trawling through the commit logs I find several post-8.3.5 fixes for
race conditions and other bugs in relcache initialization, 

Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-27 Thread Jeff Frost
On May 27, 2012, at 12:53 PM, Tom Lane wrote:

 Another thing that can be deduced from those stack traces is that sinval
 resets were happening.  For example, in the third message linked above,
 the heapscan is being done to load up a relcache entry for relation 2601
 (pg_am).  This would be unsurprising, except that stack frames 17 and 18
 show this is happening during the fifth call of load_critical_index, and
 we should have had both pg_am's reldesc and the syscache entry that's
 being fetched at relcache.c:1010 loaded up in the first call.  So those
 cache entries have to have gotten blown away by sinval reset.  This is
 not terribly surprising if there were a steady flux of DDL happening in
 the database, for instance temp table creations/drops.  (Jeff, can you
 comment on that?)  If heapscans over the whole of pg_attribute were
 occurring, they'd take long enough to expose the process to sinval
 overrun even with not-very-high DDL rates.




As it turns out, there are quite a few temporary tables created.

During the busiest periods of the day, this system averages 1.75 temp tables 
per second.

Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-26 Thread Greg Sabino Mullane
On Fri, May 25, 2012 at 07:02:42PM -0400, Tom Lane wrote:
 However, the remaining processes trying to
 compute new init files would still have to complete the process, so I'd
 expect there to be a diminishing effect --- the ones that were stalling
 shouldn't all release exactly together.  Unless there is some additional
 effect that's syncing them all.  (I wonder for instance if the syncscan
 logic is kicking in here.)

How fast would you expect that to happen? As far as I could tell, they all 
released at once, or at least within probably 15 seconds of each other; 
I wasn't running ps constantly. I could check the logs and get a better 
figure if you think it's an important data point.

 One interesting question is why there's a thundering herd of new
 arrivals in the first place.  IIRC you said you were using a connection
 pooler.  I wonder if it has a bug^H^H^Hdesign infelicity that makes it
 drop and reopen all its connections simultaneously.

No, we are not. Or rather, there is some pooling, but there is also a 
fairly large influx of new connections. As far as I could tell, the 
few existing connections were not affected.

 1. Somebody decides to update one of those rows, and it gets dropped in
 some remote region of the table.  The only really plausible reason for
 this is deciding to fool with the column-specific stats target
 (attstattarget) of a system catalog.  Does that sound like something
 either of you might have done?

No, zero chance of this, barring some rogue intruder on the network 
with a strange sense of humor.

 pg_attribute just enough smaller to avoid the scenario.  Not sure about
 Greg's case, but he should be able to tell us the size of pg_attribute
 and his shared_buffers setting ...

pg_attribute around 5 MB (+6MB indexes), shared_buffers 4GB. However, 
there is a *lot* of churn in pg_attribute and pg_class, mostly due 
to lots of temporary tables.

P.S. Hmmm that's weird, I just double-checked the above and pg_attribute 
is now 52MB/70MB (the original figures were from yesterday). At any rate, 
nowhere near 1/4 shared buffers.

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpGtYKGLr70y.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-26 Thread Tom Lane
Greg Sabino Mullane g...@endpoint.com writes:
 On Fri, May 25, 2012 at 07:02:42PM -0400, Tom Lane wrote:
 pg_attribute just enough smaller to avoid the scenario.  Not sure about
 Greg's case, but he should be able to tell us the size of pg_attribute
 and his shared_buffers setting ...

 pg_attribute around 5 MB (+6MB indexes), shared_buffers 4GB. However, 
 there is a *lot* of churn in pg_attribute and pg_class, mostly due 
 to lots of temporary tables.

 P.S. Hmmm that's weird, I just double-checked the above and pg_attribute 
 is now 52MB/70MB (the original figures were from yesterday). At any rate, 
 nowhere near 1/4 shared buffers.

Hmph.  And Jeff swears his isn't large enough to trigger syncscans
either.  This could all be due to the thundering herd effect of a lot
of processes all doing the same only-moderately-expensive-in-itself
thing; except it's hard to see how the problem gets rolling unless the
single-process cache reload time is already a lot more than the
inter-arrival time.

Would you guys please try this in the problem databases:

select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;

If you see any block numbers above about 20 then maybe the triggering
condition is a row relocation after all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-26 Thread Greg Sabino Mullane
On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote:
 If you see any block numbers above about 20 then maybe the triggering
 condition is a row relocation after all.

Highest was 13.

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpa6XGTGTEIZ.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-26 Thread Tom Lane
Greg Sabino Mullane g...@endpoint.com writes:
 On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote:
 If you see any block numbers above about 20 then maybe the triggering
 condition is a row relocation after all.

 Highest was 13.

Hm ... but wait, you said you'd done a VACUUM FULL on the catalogs.
So it's not clear whether this is reflective of the state at the time
the problem was happening.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-26 Thread Greg Sabino Mullane
On Sat, May 26, 2012 at 01:25:29PM -0400, Tom Lane wrote:
 Greg Sabino Mullane g...@endpoint.com writes:
  On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote:
  If you see any block numbers above about 20 then maybe the triggering
  condition is a row relocation after all.
 
  Highest was 13.
 
 Hm ... but wait, you said you'd done a VACUUM FULL on the catalogs.
 So it's not clear whether this is reflective of the state at the time
 the problem was happening.

True. I'll try to get a high water mark when (er...if!) it happens again.

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpUK3N5QYoTd.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-25 Thread Greg Sabino Mullane
 Yeah, this is proof that what it was doing is the same as what we saw in
 Jeff's backtrace, ie loading up the system catalog relcache entries the
 hard way via seqscans on the core catalogs.  So the question to be
 answered is why that's suddenly a big performance bottleneck.  It's not
 a cheap operation of course (that's why we cache the results ;-)) but
 it shouldn't take minutes either.  And, because they are seqscans, it
 doesn't seem like messed-up indexes should matter.

FWIW, this appeared to be an all-or-nothing event: either every new backend 
was suffering through this, or none were. They all seemed to clear up 
at the same time as well.

 The theory I have in mind about Jeff's case is that it was basically an
 I/O storm, but it's not clear whether the same explanation works for
 your case.  There may be some other contributing factor that we haven't
 identified yet.

Let me know if you think of anything particular I can test while it is 
happening again. I'll try to arrange a (netapp) snapshot the next time 
it happens as well (this system is too busy and too large to do anything 
else).

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpYJDovfAM7L.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-25 Thread Tom Lane
Greg Sabino Mullane g...@endpoint.com writes:
 Yeah, this is proof that what it was doing is the same as what we saw in
 Jeff's backtrace, ie loading up the system catalog relcache entries the
 hard way via seqscans on the core catalogs.  So the question to be
 answered is why that's suddenly a big performance bottleneck.  It's not
 a cheap operation of course (that's why we cache the results ;-)) but
 it shouldn't take minutes either.  And, because they are seqscans, it
 doesn't seem like messed-up indexes should matter.

 FWIW, this appeared to be an all-or-nothing event: either every new backend 
 was suffering through this, or none were. They all seemed to clear up 
 at the same time as well.

Mostly not surprising.  They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all.  And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall.  However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together.  Unless there is some additional
effect that's syncing them all.  (I wonder for instance if the syncscan
logic is kicking in here.)

One interesting question is why there's a thundering herd of new
arrivals in the first place.  IIRC you said you were using a connection
pooler.  I wonder if it has a bug^H^H^Hdesign infelicity that makes it
drop and reopen all its connections simultaneously.

 Let me know if you think of anything particular I can test while it is 
 happening again.

I just noticed something that might explain the persistency of the
effect as observed by Jeff.  The code that seqscans pg_attribute (in
RelationBuildTupleDesc) knows how many rows it's expecting to find for
a given catalog, and it falls out of the seqscan loop as soon as it's
gotten them all.  Now, the rows belonging to core system catalogs are
generally right near the front of pg_attribute, being the oldest rows in
that catalog, which means that generally this results in not having to
seqscan very far even if pg_attribute is large.  It strikes me though
that there are at least two ways that nice property could get broken,
resulting in much seqscan work if pg_attribute is large:

1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table.  The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog.  Does that sound like something
either of you might have done?  You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.

2. If the syncscan logic were to kick in and cause some backend to pick
up its seqscan of pg_attribute from a point beyond where some of the
target rows are, that backend would have to scan all of pg_attribute,
circling back around to the start, before it would find all the rows it
seeks.  And most likely this effect would lather-rinse-repeat for each
catalog it's seeking the pg_attribute entries for.  Not only does this
mean a much-worse-than-normal startup time for that backend, but any
other ones that arrive while the synchronized scan is in progress would
be caught in the undertow and likewise spend a long time to get their
results.

So point 2 is suddenly looking like a really attractive theory for
explaining what happened.  As we were just remarking in an adjacent
thread, the threshold for this to start happening would be for
pg_attribute to get larger than one-fourth of shared_buffers; the
syncscan logic doesn't kick in for relations smaller than that.  IIRC
this is close enough to the numbers Jeff mentioned to make it plausible
that it happened to him, and plausible that his new installation has
pg_attribute just enough smaller to avoid the scenario.  Not sure about
Greg's case, but he should be able to tell us the size of pg_attribute
and his shared_buffers setting ...

If this is the explanation, then it's easy enough to deal with point 2
--- just disable syncscan for these searches.  I don't see an easy
answer for problems of ilk #1, other than don't do that in a database
that's been around for awhile.

Another idea we might consider is to prevent the thundering herd effect
by not having all the incoming backends try to update pg_internal.init
independently.  Instead let the first one take a lock while it's doing
that, and the rest just queue up on that lock.  They'd be stalled
anyway, and they are not helping matters by duplicating the first one's
effort.  This is a rather more invasive change though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-25 Thread Jeff Frost

On May 25, 2012, at 4:02 PM, Tom Lane wrote:

 Greg Sabino Mullane g...@endpoint.com writes:
 Yeah, this is proof that what it was doing is the same as what we saw in
 Jeff's backtrace, ie loading up the system catalog relcache entries the
 hard way via seqscans on the core catalogs.  So the question to be
 answered is why that's suddenly a big performance bottleneck.  It's not
 a cheap operation of course (that's why we cache the results ;-)) but
 it shouldn't take minutes either.  And, because they are seqscans, it
 doesn't seem like messed-up indexes should matter.
 
 FWIW, this appeared to be an all-or-nothing event: either every new backend 
 was suffering through this, or none were. They all seemed to clear up 
 at the same time as well.
 
 Mostly not surprising.  They'd definitely all hit the missing init file
 at the same time, so the stalling would start consistently for all.  And
 once any one process successfully created a new file, subsequent incoming
 sessions wouldn't stall.  However, the remaining processes trying to
 compute new init files would still have to complete the process, so I'd
 expect there to be a diminishing effect --- the ones that were stalling
 shouldn't all release exactly together.  Unless there is some additional
 effect that's syncing them all.  (I wonder for instance if the syncscan
 logic is kicking in here.)

In our customer's case, the size of pg_attribute was a little less than 1/4 of 
shared_buffers, so might not be the syncscan?

BTW, In our case, I thought to take the system down to single user mode and 
reindex these.  When the indexes were disabled, I immediately experienced the 
slow startup, so it certainly seems like an issue with seq scanning these.

I'll see if i can reproduce that behavior by starting up with system indexes 
disabled.  This probably won't happen until tuesday when we get that data 
directory moved to a test server.

In our customer's case, it would happen for a while,then stop happening for 
some time...presumably this was after the caching, then it would start up 
again..presumably after something invalidated the cache.

Switching from the master to the streaming replica made the situation better, 
but not go away.

Then a full initdb solved the problem. I bet a vacuum full of pg_attribute 
would've done the trick though.


 

 
 1. Somebody decides to update one of those rows, and it gets dropped in
 some remote region of the table.  The only really plausible reason for
 this is deciding to fool with the column-specific stats target
 (attstattarget) of a system catalog.  Does that sound like something
 either of you might have done?  You could check it by looking at the
 ctid columns of the pg_attribute rows for system catalogs, and seeing
 if any have large block numbers.
 


Definitely wasn't done by me and I'm pretty sure the customer wouldn't have 
done that either.




---
Jeff Frost j...@pgexperts.com
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 








Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-25 Thread Tom Lane
Jeff Frost j...@pgexperts.com writes:
 In our customer's case, the size of pg_attribute was a little less than 1/4 
 of shared_buffers, so might not be the syncscan?

Could you go back and double check that?  If the shared_buffers setting
were 7GB not 8GB, that would fall right between the pg_attribute sizes
you posted before.  I'm getting somewhat convinced that this is the
right answer, because I've been able to reproduce an unexpectedly long
stall with multiple clients connecting simultaneously to an
init-file-less database whose pg_attribute is large enough to trigger
syncscans.  The particular case I'm testing has pg_attribute of about
1GB (in a machine with just 4GB RAM, so I'm not going to push it up
much further).  If I just remove the init file and connect with psql,
there's about a 1-second startup delay, which is bad enough; but if
I throw 50 concurrent connections at it with a hacked-up version of
pgbench, it takes about 45 seconds for all of them to get through
startup.  (You need to hack pgbench to suppress the single
initialization connection it normally likes to make, else the test
degenerates to the one-incoming-connection case.)

I think that a big chunk of this is coming from the syncscan logic
defeating the early-exit optimization in RelationBuildTupleDesc;
but it still seems like there's another inefficiency somewhere.
Maybe our syncscan logic just really sucks given enough backends
trying to piggyback on the same syncscan.  Now that I can reproduce it,
I'll take a closer look at that.

In the meantime, it looks like a trivial workaround is to disable
synchronize_seqscans via postgresql.conf.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-25 Thread Jeff Frost

On May 25, 2012, at 7:12 PM, Tom Lane wrote:

 Jeff Frost j...@pgexperts.com writes:
 In our customer's case, the size of pg_attribute was a little less than 1/4 
 of shared_buffers, so might not be the syncscan?
 
 Could you go back and double check that?  If the shared_buffers setting
 were 7GB not 8GB,

It definitely started happening with 8GB of shared_buffers.

We actually tried reducing shared_buffers to 2GB to see if we were writing out 
too many dirty buffers at checkpoint time or something, but that had no effect, 
so we put it back to 8GB.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-24 Thread Greg Sabino Mullane
Yesterday I had a client that experienced a sudden high load on 
one of their servers (8.3.5 - yes, I know. Those of you with 
clients will understand). When I checked, almost all connections 
were in a startup state, very similar to this thread:

http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

Running a strace showed a lot of semop activity, and the logs showed a
successful connection, then a 5 minute plus wait before a query was issued.
So obviously, blocking on something. Unlike the thread above, I *did* find
problems in the system catalogs. For example, both pg_class and pg_index
gave warnings like this for every index during a VACUUM FULL
VERBOSE tablename:

WARNING: index pg_class_relname_nsp_index contains 7712 row versions,
  but table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.

A REINDEX did not solve the problem (initially), as a REINDEX followed
by a VAC showed the same warning and hint.

The next step was dropping to standalone mode, but before that could
be done, the REINDEXes fixed the problem (no warnings, no stalled 
connections). So my questions are:

* Why would a REINDEX not fix the problem as the hint suggested?
Why would it then start working?

* What exactly is the relationship between bad indexes and shared
memory locks?

* Is there some other emergency fix when it happens, such as killing
all backends and hoping you kill the one that is actually holding
the lock (if any was).

* Did anything in the 8.3 series fix this? I saw nothing relevant in
the release notes for everything up to 8.3.18 (which it will be on soon).

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpwbxRekkZQf.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-24 Thread Tom Lane
Greg Sabino Mullane g...@endpoint.com writes:
 Yesterday I had a client that experienced a sudden high load on 
 one of their servers (8.3.5 - yes, I know. Those of you with 
 clients will understand). When I checked, almost all connections 
 were in a startup state, very similar to this thread:

 http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

 Running a strace showed a lot of semop activity, and the logs showed a
 successful connection, then a 5 minute plus wait before a query was issued.
 So obviously, blocking on something.

Did you check I/O activity?  I looked again at Jeff Frost's report and
now think that what he saw was probably a lot of seqscans on bloated
system catalogs, cf
http://archives.postgresql.org/message-id/28484.1337887...@sss.pgh.pa.us

 Unlike the thread above, I *did* find
 problems in the system catalogs. For example, both pg_class and pg_index
 gave warnings like this for every index during a VACUUM FULL
 VERBOSE tablename:

 WARNING: index pg_class_relname_nsp_index contains 7712 row versions,
   but table contains 9471 row versions
 HINT:  Rebuild the index with REINDEX.

That's fairly interesting, but if it was a bloat situation then it
would've been the VAC FULL that fixed it rather than the REINDEX.
Did you happen to save the VERBOSE output?  It'd be really useful to
know whether there was any major shrinkage of the core catalogs
(esp. pg_class, pg_attribute).

 * Did anything in the 8.3 series fix this?

I think there are probably two independent issues here.  The missing
index entries are clearly bad but it's not clear that they had anything
to do with the startup stall.  There are a couple of fixes in recent
8.3.x releases that might possibly explain the index corruption,
especially if you're in the habit of reindexing the system catalogs
frequently.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-24 Thread Greg Sabino Mullane
On Thu, May 24, 2012 at 03:54:54PM -0400, Tom Lane wrote:
 Did you check I/O activity?  I looked again at Jeff Frost's report and
 now think that what he saw was probably a lot of seqscans on bloated
 system catalogs, cf
 http://archives.postgresql.org/message-id/28484.1337887...@sss.pgh.pa.us

Thank you for the response. Yes, we did look at I/O, but nothing unusual 
was seen that would explain the load. If it happens again I'll see if the 
system catalogs are getting a lot of seqscans.

 That's fairly interesting, but if it was a bloat situation then it
 would've been the VAC FULL that fixed it rather than the REINDEX.
 Did you happen to save the VERBOSE output?  It'd be really useful to
 know whether there was any major shrinkage of the core catalogs
 (esp. pg_class, pg_attribute).

I did have them in screen, but the home screen box just craashed a 
few hours ago (after weeks of uptime: Murphy's law). It certainly could 
have been VAC FULL as we have processes that do both VAC FULL and REINDEX 
periodically; I simply assumed based on the HINT it was the REINDEX that 
cleared it up. The only snippet I have is:

# vacuum full verbose pg_class;
INFO:  vacuuming pg_catalog.pg_class
INFO:  pg_class: found 43 removable, 10376 nonremovable row versions in 518 
pages
DETAIL:  6078 dead row versions cannot be removed yet.
Nonremovable row versions range from 160 to 628 bytes long.
There were 7367 unused item pointers.
Total free space (including removable row versions) is 157000 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index pg_class_oid_index now contains 7712 row versions in 32 pages
DETAIL:  6 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_class_oid_index contains 7712 row versions, but table 
contains 9471 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  index pg_class_relname_nsp_index now contains 7712 row versions in 113 
pages
DETAIL:  6 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index pg_class_relname_nsp_index contains 7712 row versions, but 
table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.
VACUUM
Time: 65.635 ms

Actually, since we have full logging, those warnings should still be 
there... (checks) ... yep. Hints elided, the last one seen was:

WARNING: index pg_class_oid_index contains 14877 row versions,
  but table contains 66648 row versions
WARNING: index pg_class_relname_nsp_index contains 14877 row versions,
  but table contains 66648 row versions
WARNING: index pg_class_oid_index contains 60943 row versions,
  but table contains 112714 row versions
WARNING:  index pg_class_relname_nsp_index contains 60943 row versions,
  but table contains 112714 row versions

Looks like there are some more going back a ways: I'll gather together 
and send those offlist after this.

 I think there are probably two independent issues here.  The missing
 index entries are clearly bad but it's not clear that they had anything
 to do with the startup stall.  There are a couple of fixes in recent
 8.3.x releases that might possibly explain the index corruption,
 especially if you're in the habit of reindexing the system catalogs
 frequently.

Yes, we are in that habit. I hope they are not independent: I'd rather 
have one problem to worry about than two. :) I'll certainly report if I 
see either problem pop up again.

Oh, almost forgot: reading your reply to the old thread reminded me of 
something I saw in one of the straces right as it woke up and left 
the startup state to do some work. Here's a summary:

12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
(x a gazillion)
...
12:18:40 brk(0x1c0af000)= 0x1c0af000
...(some more semops)...
12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
-1, 0) = 0x2ac062c98000
...(handful of semops)...
12:18:40 unlink(base/1554846571/pg_internal.init.11803) = -1 ENOENT (No such 
file or directory)
12:18:40 open(base/1554846571/pg_internal.init.11803, 
O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13
12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0x2ac062cd9000
12:18:40 write(13, ...
...(normalish looking strace output after this)...

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgpgvtyJ9p6Fs.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-24 Thread Greg Sabino Mullane
 I think there are probably two independent issues here.  The missing
 index entries are clearly bad but it's not clear that they had anything
 to do with the startup stall.

On further log digging, I think you are correct, as those index 
warnings go back many days before the startup problems appeared.
Let me know if you'd like any of those warnings from the logs.

-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8


pgp9P2W8oU7Y8.pgp
Description: PGP signature


Re: [HACKERS] Backends stalled in 'startup' state: index corruption

2012-05-24 Thread Tom Lane
Greg Sabino Mullane g...@endpoint.com writes:
 Oh, almost forgot: reading your reply to the old thread reminded me of 
 something I saw in one of the straces right as it woke up and left 
 the startup state to do some work. Here's a summary:

 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
 (x a gazillion)
 ...
 12:18:40 brk(0x1c0af000)= 0x1c0af000
 ...(some more semops)...
 12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
 -1, 0) = 0x2ac062c98000
 ...(handful of semops)...
 12:18:40 unlink(base/1554846571/pg_internal.init.11803) = -1 ENOENT (No 
 such file or directory)
 12:18:40 open(base/1554846571/pg_internal.init.11803, 
 O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13
 12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
 12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
 -1, 0) = 0x2ac062cd9000
 12:18:40 write(13, ...
 ...(normalish looking strace output after this)...

Yeah, this is proof that what it was doing is the same as what we saw in
Jeff's backtrace, ie loading up the system catalog relcache entries the
hard way via seqscans on the core catalogs.  So the question to be
answered is why that's suddenly a big performance bottleneck.  It's not
a cheap operation of course (that's why we cache the results ;-)) but
it shouldn't take minutes either.  And, because they are seqscans, it
doesn't seem like messed-up indexes should matter.

The theory I have in mind about Jeff's case is that it was basically an
I/O storm, but it's not clear whether the same explanation works for
your case.  There may be some other contributing factor that we haven't
identified yet.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers