[GENERAL] pgstattuple triggered checkpoint failure and database outage?

2009-03-30 Thread Stuart Bishop
I just had a brief outage on a production server running 8.3.6, which
I suspect was triggered by me running a table bloat report making lots
of pgstattuple calls.

The first I got of it was the script I'd just kicked off died:

could not open segment 1 of relation 1663/16409/11088101 (target block
131292): No such file or directory
CONTEXT:  writing block 131292 of relation 1663/16409/11088101

More alerts came in - looks like everything was failing with similar errors.

Checking the logs the first indication of the problem is:

@:6160 2009-03-30 06:49:27 BST LOG:  checkpoint starting: time
[...]
@:6160 2009-03-30 06:49:58 BST ERROR:  could not open segment 1 of
relation 1663/16409/11088101 (target block 131072): No such file or
directory
@:6160 2009-03-30 06:49:58 BST CONTEXT:  writing block 131072 of
relation 1663/16409/11088101
@:6160 2009-03-30 06:49:59 BST LOG:  checkpoint starting: time
@:6160 2009-03-30 06:49:59 BST ERROR:  could not open segment 1 of
relation 1663/16409/11088101 (target block 134984): No such file or
directory
@:6160 2009-03-30 06:49:59 BST CONTEXT:  writing block 134984 of
relation 1663/16409/11088101
@:6160 2009-03-30 06:50:00 BST LOG:  checkpoint starting: time
@:6160 2009-03-30 06:50:01 BST ERROR:  could not open segment 1 of
relation 1663/16409/11088101 (target block 135061): No such file or
directory
@:6160 2009-03-30 06:50:01 BST CONTEXT:  writing block 135061 of
relation 1663/16409/11088101


Doing an immediate shutdown and restart seems to have brought
everything back online. I don't think there is any corruption (not
that I can tell easily...), and I'm not worried if I lost a
transaction or three.

Can anyone think what happened here? I suspect pgstattuple as it was
the only unusual activity happening at that time and as far as I'm
aware we have no hardware alerts and the box has been running smoothly
for quite some time.

-- 
Stuart Bishop stu...@stuartbishop.net
http://www.stuartbishop.net/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

2009-03-30 Thread Tom Lane
Stuart Bishop stu...@stuartbishop.net writes:
 I just had a brief outage on a production server running 8.3.6, which
 I suspect was triggered by me running a table bloat report making lots
 of pgstattuple calls.

 The first I got of it was the script I'd just kicked off died:

 could not open segment 1 of relation 1663/16409/11088101 (target block
 131292): No such file or directory
 CONTEXT:  writing block 131292 of relation 1663/16409/11088101
 ...
 Doing an immediate shutdown and restart seems to have brought
 everything back online.

What's the actual size of that relation now?  Is it growing rapidly?
(I'm trying to figure out whether those writes *should* have succeeded,
or whether the block numbers were corrupt in memory.)

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

2009-03-30 Thread Stuart Bishop

On Tue, Mar 31, 2009 at 8:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Stuart Bishop stu...@stuartbishop.net writes:

I just had a brief outage on a production server running 8.3.6, which
I suspect was triggered by me running a table bloat report making lots
of pgstattuple calls.



The first I got of it was the script I'd just kicked off died:



could not open segment 1 of relation 1663/16409/11088101 (target block
131292): No such file or directory
CONTEXT:  writing block 131292 of relation 1663/16409/11088101
...
Doing an immediate shutdown and restart seems to have brought
everything back online.


What's the actual size of that relation now?  Is it growing rapidly?
(I'm trying to figure out whether those writes *should* have succeeded,
or whether the block numbers were corrupt in memory.)


I can't seem to find a file on disk named 11088101 or an entry in pg_class 
where relfilenode = 11088101.

Are the allocated table oids always increasing? If so, I can pretty much 
guarantee that the missing relation was a temporary table or the index on the 
temporary table. It had a single integer column and maybe 50million rows.

--
Stuart Bishop stu...@stuartbishop.net
http://www.stuartbishop.net/



signature.asc
Description: OpenPGP digital signature


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

2009-03-30 Thread Tom Lane
Stuart Bishop stu...@stuartbishop.net writes:
 On Tue, Mar 31, 2009 at 8:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What's the actual size of that relation now?  Is it growing rapidly?
 (I'm trying to figure out whether those writes *should* have succeeded,
 or whether the block numbers were corrupt in memory.)

 I can't seem to find a file on disk named 11088101 or an entry in pg_class 
 where relfilenode = 11088101.

 Are the allocated table oids always increasing? If so, I can pretty much 
 guarantee that the missing relation was a temporary table or the index on the 
 temporary table. It had a single integer column and maybe 50million rows.

The OIDs increase till they wrap around, so what this sounds like is a
problem with somebody fetching temporary-table blocks into shared memory
(where they should never be), and then things going wrong after the
owning backend drops the temp table (without having cleared out shared
buffers, which it won't do because it doesn't think it needs to).  Can
you say what was the exact command(s) you were using with pgstattuple?

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

2009-03-30 Thread Stuart Bishop
On Tue, Mar 31, 2009 at 11:10 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Stuart Bishop stu...@stuartbishop.net writes:
 On Tue, Mar 31, 2009 at 8:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What's the actual size of that relation now?  Is it growing rapidly?
 (I'm trying to figure out whether those writes *should* have succeeded,
 or whether the block numbers were corrupt in memory.)

 I can't seem to find a file on disk named 11088101 or an entry in pg_class 
 where relfilenode = 11088101.

 Are the allocated table oids always increasing? If so, I can pretty much 
 guarantee that the missing relation was a temporary table or the index on 
 the temporary table. It had a single integer column and maybe 50million rows.

 The OIDs increase till they wrap around, so what this sounds like is a
 problem with somebody fetching temporary-table blocks into shared memory
 (where they should never be), and then things going wrong after the
 owning backend drops the temp table (without having cleared out shared
 buffers, which it won't do because it doesn't think it needs to).  Can
 you say what was the exact command(s) you were using with pgstattuple?

Get a list of everything I'm interested in:

SELECT pg_class.oid, nspname, relname
FROM pg_class,pg_namespace
WHERE relnamespace = pg_namespace.oid
AND relkind IN ('r', 't', 'i') -- normal table, toast, index
AND nspname IN ('public', '_sl')
ORDER BY nspname, relname

For every result, get the pgstattuple information I'm interested in
for my reports:

SELECT
table_len,
pg_size_pretty(table_len),
dead_tuple_len,
pg_size_pretty(dead_tuple_len),
free_space,
pg_size_pretty(free_space)
FROM pgstattuple(%(oid)s)

I might be passing objects into pgstattuple that it can't handle - the
system exploded before I could tune the table selection criteria. I
notice that my  query will have asked pgstattuple for information
about the temporary table. Which appears to work when testing locally,
but I'm not engineering any race conditions.

-- 
Stuart Bishop stu...@stuartbishop.net
http://www.stuartbishop.net/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general