Re: [GENERAL] Where art thou pg_clog?

2007-02-22 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


> The real bottom line here, and one I'll reiterate every chance I get,
> is that we don't make updates to back branches because we're too bored
> to have anything else to do.  If you're on 8.1.5, and the current
> release in that branch is 8.1.8, then you're missing some bug fixes
> that are probably significant.

Just as a data point, I came across this very same problem (corrupted 
tuple header, invalid xlog file) on an 8.1.3 system that was NOT 
running autovacuum (and never had). Amazingly, this occured on the 
very same day as the original poster. The server was upgraded to 8.2.3, 
after some creating-bogus-xlog-file pain to extract all the data, and 
all is well again.

- --
Greg Sabino Mullane [EMAIL PROTECTED]
PGP Key: 0x14964AC8 200702221021
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-BEGIN PGP SIGNATURE-

iD8DBQFF3bW8vJuQZxSWSsgRA45cAKCQYcPdmqvNh9KRBGNsm/YjycmqFQCgzIil
nUXZs7wIJvkxs6RaBTW5cKA=
=V0YI
-END PGP SIGNATURE-



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Where art thou pg_clog?

2007-02-16 Thread Alvaro Herrera
Casey Duncan wrote:
> 
> On Feb 15, 2007, at 5:50 PM, Alvaro Herrera wrote:

> >Hum, yeah, I forgot to mention that you need to create the 098E  
> >pg_clog
> >segment for that to work at all :-)  Fill it with byte 0x55 till the
> >needed position, which is the bit pattern for "all transactions
> >committed".  I'd make sure to remove it manually after the freeze is
> >done, just in case!  (I think the system would remove it at next
> >checkpoint, but anyway.)
> 
> That seems a bit scary to do on a running production server. Could I  
> get away with dropping the template0 database and loading one from  
> another identical pg instance (or a new one) or will that freak  
> things out?

If you haven't modified template1 since initdb, you can recreate
template0 by just dropping it and copying from template1; then do a
VACUUM FREEZE and set datallowconn to false.  That's what initdb does.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Where art thou pg_clog?

2007-02-16 Thread Casey Duncan


On Feb 15, 2007, at 5:50 PM, Alvaro Herrera wrote:


Casey Duncan wrote:


On Feb 15, 2007, at 5:21 PM, Alvaro Herrera wrote:


Casey Duncan wrote:



To fix the problem, set pg_database.datallowconn=true for template0,
then connect to it and do a VACUUM FREEZE.  Then set
datallowconn=false
again.


Do you mean to do this after upgrading to 8.1.8? If I try than in
8.1.5, I get (unsurprisingly):

% psql -U postgres template0 -c "vacuum freeze"
ERROR:  could not access status of transaction 2565134864
DETAIL:  could not open file "pg_clog/098E": No such file or  
directory


Hum, yeah, I forgot to mention that you need to create the 098E  
pg_clog

segment for that to work at all :-)  Fill it with byte 0x55 till the
needed position, which is the bit pattern for "all transactions
committed".  I'd make sure to remove it manually after the freeze is
done, just in case!  (I think the system would remove it at next
checkpoint, but anyway.)


That seems a bit scary to do on a running production server. Could I  
get away with dropping the template0 database and loading one from  
another identical pg instance (or a new one) or will that freak  
things out?


-Casey

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Where art thou pg_clog?

2007-02-16 Thread Tom Lane
Alvaro Herrera <[EMAIL PROTECTED]> writes:
> Casey Duncan wrote:
>> I'm curious how template0 got stomped on.

> Heh :-)  Sorry, they are all my bugs.  I guess you should be throwing
> stones at me or something.

The pre-8.1 theory was that template0 is (supposed to be) cleanly frozen
and hence never needs vacuumed at all.  The post-8.1 theory is that
template0 gets autovacuumed when necessary to prevent wraparound, just
like every other database.  8.1 unfortunately is somewhere in the
middle, because under circumstances-I-don't-remember-at-the-moment,
autovacuum might decide to process template0 and then leave non-frozen
XIDs therein.  Which is a problem because the clog-truncation logic
didn't think it needed to consider template0 when deciding if old clog
segments could be thrown away.  We live and learn.

The real bottom line here, and one I'll reiterate every chance I get,
is that we don't make updates to back branches because we're too bored
to have anything else to do.  If you're on 8.1.5, and the current
release in that branch is 8.1.8, then you're missing some bug fixes
that are probably significant.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Alvaro Herrera
Casey Duncan wrote:
> 
> On Feb 15, 2007, at 5:21 PM, Alvaro Herrera wrote:
> 
> >Casey Duncan wrote:

> >To fix the problem, set pg_database.datallowconn=true for template0,
> >then connect to it and do a VACUUM FREEZE.  Then set  
> >datallowconn=false
> >again.
> 
> Do you mean to do this after upgrading to 8.1.8? If I try than in  
> 8.1.5, I get (unsurprisingly):
> 
> % psql -U postgres template0 -c "vacuum freeze"
> ERROR:  could not access status of transaction 2565134864
> DETAIL:  could not open file "pg_clog/098E": No such file or directory

Hum, yeah, I forgot to mention that you need to create the 098E pg_clog
segment for that to work at all :-)  Fill it with byte 0x55 till the
needed position, which is the bit pattern for "all transactions
committed".  I'd make sure to remove it manually after the freeze is
done, just in case!  (I think the system would remove it at next
checkpoint, but anyway.)

You can do it either after or before upgrading; it's the same.  The only
thing that changes in 8.1.7 is that an upcoming vacuum would not forget
the FREEZE.


> >>I'm curious how template0 got stomped on. Certainly nothing's been
> >>changing it. Of course it might just be some random bug so the fact
> >>it landed on a file for template0 could be completely arbitrary.
> >
> >The problem is that all databases are vacuumed every so many
> >transactions, to avoid Xid wraparound problems; even non connectable
> >databases.  The problem is that a bug in autovacuum caused that vacuum
> >operation to neglect using the FREEZE flag; this negligence makes it
> >leave non-permanent Xids in the tables, leading to the problem you're
> >seeing.
> 
> Ironically we were earlier bitten by the bug that autovacuum didn't  
> do the cluster-wide vacuum until too late. Now we got bitten by the  
> fact that did do the cluster-wide vacuum. Talk about damned-if-you-do- 
> and-damned-if-you-don't! 8^)

Heh :-)  Sorry, they are all my bugs.  I guess you should be throwing
stones at me or something.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Casey Duncan


On Feb 15, 2007, at 5:21 PM, Alvaro Herrera wrote:


Casey Duncan wrote:


Interestingly I can manually vacuum that table in all of the
databases on this machine without provoking the error.


Except template0 I presume?  Is this autovacuum running in template0
perchance?  I note that 800 million transactions have passed  
since the

Xid in the error message was current.


Wouldn't you know it! A little farther back up in the log file:

2007-02-15 14:20:48.480 PST LOG:  autovacuum: processing database
"template0"
2007-02-15 14:20:48.480 PST DEBUG:  StartTransaction
2007-02-15 14:20:48.480 PST DEBUG:  name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
children: <>
2007-02-15 14:20:48.481 PST DEBUG:  autovacuum: VACUUM FREEZE whole
database
2007-02-15 14:20:48.481 PST DEBUG:  CommitTransaction
2007-02-15 14:20:48.481 PST DEBUG:  name: unnamed; blockState:
STARTED; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
children: <>


This is a bug we fixed in 8.1.7.  I suggest you update to the  
latest of

the 8.1 series, to get that fix among others.


ok, great.


To fix the problem, set pg_database.datallowconn=true for template0,
then connect to it and do a VACUUM FREEZE.  Then set  
datallowconn=false

again.


Do you mean to do this after upgrading to 8.1.8? If I try than in  
8.1.5, I get (unsurprisingly):


% psql -U postgres template0 -c "vacuum freeze"
ERROR:  could not access status of transaction 2565134864
DETAIL:  could not open file "pg_clog/098E": No such file or directory




I'm curious how template0 got stomped on. Certainly nothing's been
changing it. Of course it might just be some random bug so the fact
it landed on a file for template0 could be completely arbitrary.


The problem is that all databases are vacuumed every so many
transactions, to avoid Xid wraparound problems; even non connectable
databases.  The problem is that a bug in autovacuum caused that vacuum
operation to neglect using the FREEZE flag; this negligence makes it
leave non-permanent Xids in the tables, leading to the problem you're
seeing.


Ironically we were earlier bitten by the bug that autovacuum didn't  
do the cluster-wide vacuum until too late. Now we got bitten by the  
fact that did do the cluster-wide vacuum. Talk about damned-if-you-do- 
and-damned-if-you-don't! 8^)


ok, this is a much better sounding explanation than "random data  
corruption" ;^)


Thanks!

-Casey

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Alvaro Herrera
Casey Duncan wrote:

> >>Interestingly I can manually vacuum that table in all of the
> >>databases on this machine without provoking the error.
> >
> >Except template0 I presume?  Is this autovacuum running in template0
> >perchance?  I note that 800 million transactions have passed since the
> >Xid in the error message was current.
> 
> Wouldn't you know it! A little farther back up in the log file:
> 
> 2007-02-15 14:20:48.480 PST LOG:  autovacuum: processing database  
> "template0"
> 2007-02-15 14:20:48.480 PST DEBUG:  StartTransaction
> 2007-02-15 14:20:48.480 PST DEBUG:  name: unnamed; blockState:
> DEFAULT; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,  
> children: <>
> 2007-02-15 14:20:48.481 PST DEBUG:  autovacuum: VACUUM FREEZE whole  
> database
> 2007-02-15 14:20:48.481 PST DEBUG:  CommitTransaction
> 2007-02-15 14:20:48.481 PST DEBUG:  name: unnamed; blockState:
> STARTED; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,  
> children: <>

This is a bug we fixed in 8.1.7.  I suggest you update to the latest of
the 8.1 series, to get that fix among others.

To fix the problem, set pg_database.datallowconn=true for template0,
then connect to it and do a VACUUM FREEZE.  Then set datallowconn=false
again.


> I'm curious how template0 got stomped on. Certainly nothing's been  
> changing it. Of course it might just be some random bug so the fact  
> it landed on a file for template0 could be completely arbitrary.  

The problem is that all databases are vacuumed every so many
transactions, to avoid Xid wraparound problems; even non connectable
databases.  The problem is that a bug in autovacuum caused that vacuum
operation to neglect using the FREEZE flag; this negligence makes it
leave non-permanent Xids in the tables, leading to the problem you're
seeing.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Casey Duncan


On Feb 15, 2007, at 2:44 PM, Alvaro Herrera wrote:


Casey Duncan wrote:


On Feb 15, 2007, at 1:46 PM, Alvaro Herrera wrote:

[..]
Can you relate it to autovacuum?


Maybe. Here's what I get when I crank up the logging to debug4:

2007-02-15 14:20:48.771 PST DEBUG:  StartTransaction
2007-02-15 14:20:48.771 PST DEBUG:  name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 3429052708/1/0, nestlvl: 1,
children: <>
2007-02-15 14:20:48.771 PST DEBUG:  vacuuming  
"pg_catalog.pg_statistic"

2007-02-15 14:20:48.771 PST ERROR:  could not access status of
transaction 2565134864
2007-02-15 14:20:48.772 PST DETAIL:  could not open file "pg_clog/
098E": No such file or directory
2007-02-15 14:20:48.772 PST DEBUG:  proc_exit(0)
2007-02-15 14:20:48.772 PST DEBUG:  shmem_exit(0)
2007-02-15 14:20:48.773 PST DEBUG:  exit(0)
2007-02-15 14:20:48.775 PST DEBUG:  reaping dead processes

does that imply that it is the pg_statistic table that is hosed?

Interestingly I can manually vacuum that table in all of the
databases on this machine without provoking the error.


Except template0 I presume?  Is this autovacuum running in template0
perchance?  I note that 800 million transactions have passed since the
Xid in the error message was current.


Wouldn't you know it! A little farther back up in the log file:

2007-02-15 14:20:48.480 PST LOG:  autovacuum: processing database  
"template0"

2007-02-15 14:20:48.480 PST DEBUG:  StartTransaction
2007-02-15 14:20:48.480 PST DEBUG:  name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,  
children: <>
2007-02-15 14:20:48.481 PST DEBUG:  autovacuum: VACUUM FREEZE whole  
database

2007-02-15 14:20:48.481 PST DEBUG:  CommitTransaction
2007-02-15 14:20:48.481 PST DEBUG:  name: unnamed; blockState:
STARTED; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,  
children: <>


fwiw, I did a cluster-wide vacuum on 1/20/2007. Not sure if that has  
any impact on anything, just thought I'd throw it out there.


I'm curious how template0 got stomped on. Certainly nothing's been  
changing it. Of course it might just be some random bug so the fact  
it landed on a file for template0 could be completely arbitrary.  
Anyhow it does seem curious to me.


-Casey

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Alvaro Herrera
Casey Duncan wrote:
> 
> On Feb 15, 2007, at 1:46 PM, Alvaro Herrera wrote:
> 
> >Casey Duncan wrote:
> >>We have a production system with multiple identical database
> >>instances on the same hardware, with the same configuration, running
> >>databases with the exact same schema. They each have different data,
> >>but the database sizes and load patterns are almost exactly the same.
> >>
> >>We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh
> >>well ;^) and since then we have noticed the following error on two of
> >>the servers:
> >>
> >>2007-02-15 00:35:03.324 PST ERROR:  could not access status of
> >>transaction 2565134864
> >>2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
> >>098E": No such file or directory
> >
> >Can you relate it to autovacuum?
> 
> Maybe. Here's what I get when I crank up the logging to debug4:
> 
> 2007-02-15 14:20:48.771 PST DEBUG:  StartTransaction
> 2007-02-15 14:20:48.771 PST DEBUG:  name: unnamed; blockState:
> DEFAULT; state: INPROGR, xid/subid/cid: 3429052708/1/0, nestlvl: 1,  
> children: <>
> 2007-02-15 14:20:48.771 PST DEBUG:  vacuuming "pg_catalog.pg_statistic"
> 2007-02-15 14:20:48.771 PST ERROR:  could not access status of  
> transaction 2565134864
> 2007-02-15 14:20:48.772 PST DETAIL:  could not open file "pg_clog/ 
> 098E": No such file or directory
> 2007-02-15 14:20:48.772 PST DEBUG:  proc_exit(0)
> 2007-02-15 14:20:48.772 PST DEBUG:  shmem_exit(0)
> 2007-02-15 14:20:48.773 PST DEBUG:  exit(0)
> 2007-02-15 14:20:48.775 PST DEBUG:  reaping dead processes
> 
> does that imply that it is the pg_statistic table that is hosed?
> 
> Interestingly I can manually vacuum that table in all of the  
> databases on this machine without provoking the error.

Except template0 I presume?  Is this autovacuum running in template0
perchance?  I note that 800 million transactions have passed since the
Xid in the error message was current.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Casey Duncan


On Feb 15, 2007, at 1:46 PM, Alvaro Herrera wrote:


Casey Duncan wrote:

We have a production system with multiple identical database
instances on the same hardware, with the same configuration, running
databases with the exact same schema. They each have different data,
but the database sizes and load patterns are almost exactly the same.

We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh
well ;^) and since then we have noticed the following error on two of
the servers:

2007-02-15 00:35:03.324 PST ERROR:  could not access status of
transaction 2565134864
2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
098E": No such file or directory


Can you relate it to autovacuum?


Maybe. Here's what I get when I crank up the logging to debug4:

2007-02-15 14:20:48.771 PST DEBUG:  StartTransaction
2007-02-15 14:20:48.771 PST DEBUG:  name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 3429052708/1/0, nestlvl: 1,  
children: <>

2007-02-15 14:20:48.771 PST DEBUG:  vacuuming "pg_catalog.pg_statistic"
2007-02-15 14:20:48.771 PST ERROR:  could not access status of  
transaction 2565134864
2007-02-15 14:20:48.772 PST DETAIL:  could not open file "pg_clog/ 
098E": No such file or directory

2007-02-15 14:20:48.772 PST DEBUG:  proc_exit(0)
2007-02-15 14:20:48.772 PST DEBUG:  shmem_exit(0)
2007-02-15 14:20:48.773 PST DEBUG:  exit(0)
2007-02-15 14:20:48.775 PST DEBUG:  reaping dead processes

does that imply that it is the pg_statistic table that is hosed?

Interestingly I can manually vacuum that table in all of the  
databases on this machine without provoking the error.


-Casey


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Casey Duncan

On Feb 15, 2007, at 1:50 PM, Peter Eisentraut wrote:


Casey Duncan wrote:

2007-02-15 00:35:03.324 PST ERROR:  could not access status of
transaction 2565134864
2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
098E": No such file or directory

The first time this happened, I chalked it up to some kind of disk
corruption based on the mailing list archives. So I dumped the
databases, did a fresh initdb, forced an fsck (these run with a jfs
data partition and an ext2 wal partition) which found no problems and
then reloaded the databases.

Now about a week later


Unless you actually executed 2565134864 transactions in that one week,
this is still data corruption.

Check for faulty memory.


I'd be more inclined to agree with you if it happened on only one  
server machine. But this has now happened on two different machines  
in the space of a week.


My understanding is that the transaction id logged is garbage because  
the bookkeeping fields have been clobbered for some tuple(s). The one  
last week was really low (like < 1000).


-Casey

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Peter Eisentraut
Casey Duncan wrote:
> 2007-02-15 00:35:03.324 PST ERROR:  could not access status of
> transaction 2565134864
> 2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
> 098E": No such file or directory
>
> The first time this happened, I chalked it up to some kind of disk
> corruption based on the mailing list archives. So I dumped the
> databases, did a fresh initdb, forced an fsck (these run with a jfs
> data partition and an ext2 wal partition) which found no problems and
> then reloaded the databases.
>
> Now about a week later

Unless you actually executed 2565134864 transactions in that one week, 
this is still data corruption.

Check for faulty memory.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] Where art thou pg_clog?

2007-02-15 Thread Alvaro Herrera
Casey Duncan wrote:
> We have a production system with multiple identical database  
> instances on the same hardware, with the same configuration, running  
> databases with the exact same schema. They each have different data,  
> but the database sizes and load patterns are almost exactly the same.
> 
> We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh  
> well ;^) and since then we have noticed the following error on two of  
> the servers:
> 
> 2007-02-15 00:35:03.324 PST ERROR:  could not access status of  
> transaction 2565134864
> 2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/ 
> 098E": No such file or directory

Can you relate it to autovacuum?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend