Re: [HACKERS] panic on 7.3

2006-01-20 Thread Tom Lane
Rick Gigger [EMAIL PROTECTED] writes:
 I got this message:
 2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/ 
 pg_clog/0292 failed: File exists

 In 7.3.  It caused the server to restart.

 Can anyone tell me what it means?

7.3.what?

What file names exist in the pg_clog directory?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Rick Gigger

Postgres version 7.3.4

... a whole bunch of other files
-rw---  1 postgres postgres 262144 Jan 18 22:42 027D
-rw---  1 postgres postgres 262144 Jan 19 07:38 027E
-rw---  1 postgres postgres 262144 Jan 19 08:25 027F
-rw---  1 postgres postgres 262144 Jan 19 09:07 0280
-rw---  1 postgres postgres 262144 Jan 19 09:59 0281
-rw---  1 postgres postgres 262144 Jan 19 11:07 0282
-rw---  1 postgres postgres 262144 Jan 19 12:22 0283
-rw---  1 postgres postgres 262144 Jan 19 13:29 0284
-rw---  1 postgres postgres 262144 Jan 19 14:26 0285
-rw---  1 postgres postgres 262144 Jan 19 15:58 0286
-rw---  1 postgres postgres 262144 Jan 19 19:55 0287
-rw---  1 postgres postgres 262144 Jan 19 23:47 0288
-rw---  1 postgres postgres 262144 Jan 20 02:35 0289
-rw---  1 postgres postgres 262144 Jan 20 04:21 028A
-rw---  1 postgres postgres 262144 Jan 20 06:16 028B
-rw---  1 postgres postgres 262144 Jan 20 07:20 028C
-rw---  1 postgres postgres 262144 Jan 20 08:22 028D
-rw---  1 postgres postgres 262144 Jan 20 09:24 028E
-rw---  1 postgres postgres 262144 Jan 20 10:24 028F
-rw---  1 postgres postgres 262144 Jan 20 11:04 0290
-rw---  1 postgres postgres 262144 Jan 20 11:50 0291
-rw---  1 postgres postgres 262144 Jan 20 13:27 0292
-rw---  1 postgres postgres 262144 Jan 20 14:24 0293
-rw---  1 postgres postgres 262144 Jan 20 14:53 0294
-rw---  1 postgres postgres 262144 Jan 20 17:10 0295

That is right now.  Right after it started up it went up to 0292.  
There are a lot of files before the ones listed here right now  
though.  Do you need to see their names?



On Jan 20, 2006, at 3:58 PM, Tom Lane wrote:


Rick Gigger [EMAIL PROTECTED] writes:

I got this message:
2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
pg_clog/0292 failed: File exists



In 7.3.  It caused the server to restart.



Can anyone tell me what it means?


7.3.what?

What file names exist in the pg_clog directory?

regards, tom lane

---(end of  
broadcast)---

TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that  
your

   message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Bruce Momjian
Rick Gigger wrote:
 Postgres version 7.3.4
 
 ... a whole bunch of other files
 -rw---  1 postgres postgres 262144 Jan 18 22:42 027D
 -rw---  1 postgres postgres 262144 Jan 19 07:38 027E
 -rw---  1 postgres postgres 262144 Jan 19 08:25 027F
 -rw---  1 postgres postgres 262144 Jan 19 09:07 0280
 -rw---  1 postgres postgres 262144 Jan 19 09:59 0281
 -rw---  1 postgres postgres 262144 Jan 19 11:07 0282
 -rw---  1 postgres postgres 262144 Jan 19 12:22 0283
 -rw---  1 postgres postgres 262144 Jan 19 13:29 0284
 -rw---  1 postgres postgres 262144 Jan 19 14:26 0285
 -rw---  1 postgres postgres 262144 Jan 19 15:58 0286
 -rw---  1 postgres postgres 262144 Jan 19 19:55 0287
 -rw---  1 postgres postgres 262144 Jan 19 23:47 0288
 -rw---  1 postgres postgres 262144 Jan 20 02:35 0289
 -rw---  1 postgres postgres 262144 Jan 20 04:21 028A
 -rw---  1 postgres postgres 262144 Jan 20 06:16 028B
 -rw---  1 postgres postgres 262144 Jan 20 07:20 028C
 -rw---  1 postgres postgres 262144 Jan 20 08:22 028D
 -rw---  1 postgres postgres 262144 Jan 20 09:24 028E
 -rw---  1 postgres postgres 262144 Jan 20 10:24 028F
 -rw---  1 postgres postgres 262144 Jan 20 11:04 0290
 -rw---  1 postgres postgres 262144 Jan 20 11:50 0291
 -rw---  1 postgres postgres 262144 Jan 20 13:27 0292
 -rw---  1 postgres postgres 262144 Jan 20 14:24 0293
 -rw---  1 postgres postgres 262144 Jan 20 14:53 0294
 -rw---  1 postgres postgres 262144 Jan 20 17:10 0295
 
 That is right now.  Right after it started up it went up to 0292.  
 There are a lot of files before the ones listed here right now  
 though.  Do you need to see their names?

I assume you are missing one of these fixes in 7.3.X current which were
done _after_ 7.3.4 was released:

 * Fix race condition in transaction log management
   There was a narrow window in which an I/O operation could be
   initiated for the wrong page, leading to an Assert
   failure or data corruption.

 * Repair incorrect order of operations in GetNewTransactionId()
   This bug could result in failure under out-of-disk-space
   conditions, including inability to restart even after
   disk space is freed.

---


 
 
 On Jan 20, 2006, at 3:58 PM, Tom Lane wrote:
 
  Rick Gigger [EMAIL PROTECTED] writes:
  I got this message:
  2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
  pg_clog/0292 failed: File exists
 
  In 7.3.  It caused the server to restart.
 
  Can anyone tell me what it means?
 
  7.3.what?
 
  What file names exist in the pg_clog directory?
 
  regards, tom lane
 
  ---(end of  
  broadcast)---
  TIP 1: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that  
  your
 message can get through to the mailing list cleanly
 
 
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Rick Gigger
It is the version that shipped with fedora core 1.  The version  
string from psql is (PostgreSQL) 7.3.4-RH.  I assume that it must  
have been the first bug since I had plenty of disk space.



On Jan 20, 2006, at 5:35 PM, Bruce Momjian wrote:


Rick Gigger wrote:

Postgres version 7.3.4

... a whole bunch of other files
-rw---  1 postgres postgres 262144 Jan 18 22:42 027D
-rw---  1 postgres postgres 262144 Jan 19 07:38 027E
-rw---  1 postgres postgres 262144 Jan 19 08:25 027F
-rw---  1 postgres postgres 262144 Jan 19 09:07 0280
-rw---  1 postgres postgres 262144 Jan 19 09:59 0281
-rw---  1 postgres postgres 262144 Jan 19 11:07 0282
-rw---  1 postgres postgres 262144 Jan 19 12:22 0283
-rw---  1 postgres postgres 262144 Jan 19 13:29 0284
-rw---  1 postgres postgres 262144 Jan 19 14:26 0285
-rw---  1 postgres postgres 262144 Jan 19 15:58 0286
-rw---  1 postgres postgres 262144 Jan 19 19:55 0287
-rw---  1 postgres postgres 262144 Jan 19 23:47 0288
-rw---  1 postgres postgres 262144 Jan 20 02:35 0289
-rw---  1 postgres postgres 262144 Jan 20 04:21 028A
-rw---  1 postgres postgres 262144 Jan 20 06:16 028B
-rw---  1 postgres postgres 262144 Jan 20 07:20 028C
-rw---  1 postgres postgres 262144 Jan 20 08:22 028D
-rw---  1 postgres postgres 262144 Jan 20 09:24 028E
-rw---  1 postgres postgres 262144 Jan 20 10:24 028F
-rw---  1 postgres postgres 262144 Jan 20 11:04 0290
-rw---  1 postgres postgres 262144 Jan 20 11:50 0291
-rw---  1 postgres postgres 262144 Jan 20 13:27 0292
-rw---  1 postgres postgres 262144 Jan 20 14:24 0293
-rw---  1 postgres postgres 262144 Jan 20 14:53 0294
-rw---  1 postgres postgres 262144 Jan 20 17:10 0295

That is right now.  Right after it started up it went up to 0292.
There are a lot of files before the ones listed here right now
though.  Do you need to see their names?


I assume you are missing one of these fixes in 7.3.X current which  
were

done _after_ 7.3.4 was released:

 * Fix race condition in transaction log management
   There was a narrow window in which an I/O operation could be
   initiated for the wrong page, leading to an Assert
   failure or data corruption.

 * Repair incorrect order of operations in GetNewTransactionId()
   This bug could result in failure under out-of-disk-space
   conditions, including inability to restart even after
   disk space is freed.

-- 
-






On Jan 20, 2006, at 3:58 PM, Tom Lane wrote:


Rick Gigger [EMAIL PROTECTED] writes:

I got this message:
2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
pg_clog/0292 failed: File exists



In 7.3.  It caused the server to restart.



Can anyone tell me what it means?


7.3.what?

What file names exist in the pg_clog directory?

regards, tom lane

---(end of
broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that
your
   message can get through to the mailing list cleanly




---(end of  
broadcast)---

TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



--
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square,  
Pennsylvania 19073


---(end of  
broadcast)---

TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Andrew Dunstan


Updates for FC1 are available here:

http://download.fedoralegacy.org/fedora/1/updates/i386/

they have 7.3.9 dated in March last year.

Or grab the source for 7.3.13 and build it yourself.

cheers

andrew

Rick Gigger wrote:

It is the version that shipped with fedora core 1.  The version  
string from psql is (PostgreSQL) 7.3.4-RH.  I assume that it must  
have been the first bug since I had plenty of disk space.



On Jan 20, 2006, at 5:35 PM, Bruce Momjian wrote:


Rick Gigger wrote:


Postgres version 7.3.4

... a whole bunch of other files
-rw---  1 postgres postgres 262144 Jan 18 22:42 027D
-rw---  1 postgres postgres 262144 Jan 19 07:38 027E
-rw---  1 postgres postgres 262144 Jan 19 08:25 027F
-rw---  1 postgres postgres 262144 Jan 19 09:07 0280
-rw---  1 postgres postgres 262144 Jan 19 09:59 0281
-rw---  1 postgres postgres 262144 Jan 19 11:07 0282
-rw---  1 postgres postgres 262144 Jan 19 12:22 0283
-rw---  1 postgres postgres 262144 Jan 19 13:29 0284
-rw---  1 postgres postgres 262144 Jan 19 14:26 0285
-rw---  1 postgres postgres 262144 Jan 19 15:58 0286
-rw---  1 postgres postgres 262144 Jan 19 19:55 0287
-rw---  1 postgres postgres 262144 Jan 19 23:47 0288
-rw---  1 postgres postgres 262144 Jan 20 02:35 0289
-rw---  1 postgres postgres 262144 Jan 20 04:21 028A
-rw---  1 postgres postgres 262144 Jan 20 06:16 028B
-rw---  1 postgres postgres 262144 Jan 20 07:20 028C
-rw---  1 postgres postgres 262144 Jan 20 08:22 028D
-rw---  1 postgres postgres 262144 Jan 20 09:24 028E
-rw---  1 postgres postgres 262144 Jan 20 10:24 028F
-rw---  1 postgres postgres 262144 Jan 20 11:04 0290
-rw---  1 postgres postgres 262144 Jan 20 11:50 0291
-rw---  1 postgres postgres 262144 Jan 20 13:27 0292
-rw---  1 postgres postgres 262144 Jan 20 14:24 0293
-rw---  1 postgres postgres 262144 Jan 20 14:53 0294
-rw---  1 postgres postgres 262144 Jan 20 17:10 0295

That is right now.  Right after it started up it went up to 0292.
There are a lot of files before the ones listed here right now
though.  Do you need to see their names?



I assume you are missing one of these fixes in 7.3.X current which  were
done _after_ 7.3.4 was released:

 * Fix race condition in transaction log management
   There was a narrow window in which an I/O operation could be
   initiated for the wrong page, leading to an Assert
   failure or data corruption.

 * Repair incorrect order of operations in GetNewTransactionId()
   This bug could result in failure under out-of-disk-space
   conditions, including inability to restart even after
   disk space is freed.

-- 
-






On Jan 20, 2006, at 3:58 PM, Tom Lane wrote:


Rick Gigger [EMAIL PROTECTED] writes:


I got this message:
2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
pg_clog/0292 failed: File exists




In 7.3.  It caused the server to restart.




Can anyone tell me what it means?



7.3.what?

What file names exist in the pg_clog directory?

regards, tom lane

---(end of
broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that
your
   message can get through to the mailing list cleanly




---(end of  
broadcast)---

TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



--
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square,  
Pennsylvania 19073


---(end of  
broadcast)---

TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 6: explain analyze is your friend



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Tom Lane
Rick Gigger [EMAIL PROTECTED] writes:
 Postgres version 7.3.4

You realize of course that that's pretty old ...

 That is right now.  Right after it started up it went up to 0292.  

So it was the latest file eh?  I thought maybe you had some problem
with a corrupted XID leading to trying to touch a clog file
out-of-order, but that seems ruled out.

 2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
 pg_clog/0292 failed: File exists

Digging in the 7.3 sources, it seems that error message could only have
come from here:

fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd  0)
{
if (errno != ENOENT)
elog(PANIC, open of %s failed: %m, path);
fd = BasicOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
   S_IRUSR | S_IWUSR);
if (fd  0)
elog(PANIC, creation of file %s failed: %m, path);
}

AFAICS, it is simply not possible for the second open() to fail with
that errno, *unless* someone else created the same file in the
microseconds between the two open calls.

The code doing this has a lock on the particular clog buffer it's trying
to write out, so no-one else could be trying to write the same buffer;
but now that I look at it, it's entirely legal for someone else to be
trying to write a different clog buffer.  This leads to the following
theory:

1. The clog page that would be first in the 0292 segment got created in
clog buffers, but there was no reason to write it out for awhile.  (In
normal operation, only a checkpoint would be cause to write out the
frontmost page of clog.)

2. More than 2K transactions elapsed, so the page that would be second
in the 0292 segment also got set up in clog buffers.  (Rick, is the load
on your machine such that several thousand transactions might have
elapsed between checkpoints?)  Perhaps there were even enough
transactions so that more than two pages were dirty and pending write
in the clog buffers, yet the file hadn't been created yet.

3. Two different backends decided to try to write different clog pages
concurrently.  Probably one was writing the frontmost page because it
was doing a checkpoint, and another needed to read in an older clog page
so it had to swap out one of the other dirty buffers.

4. Kaboom.

If this theory is correct, the bug has been there since the clog code
was first written.  But the conditions for having it happen are narrow
enough that it's not too surprising we haven't seen it before.

I think that a sufficient fix might just be to remove the O_EXCL flag
from the second open() call --- ie, if someone else creates the file
in this narrow window, it should be considered OK.  Comments?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Tom Lane
I wrote:
 If this theory is correct, the bug has been there since the clog code
 was first written.  But the conditions for having it happen are narrow
 enough that it's not too surprising we haven't seen it before.

Actually, there seem to be a couple of unresolved bug reports that look
like the same thing ...

http://archives.postgresql.org/pgsql-hackers/2006-01/msg00216.php
http://archives.postgresql.org/pgsql-bugs/2005-12/msg00130.php
http://archives.postgresql.org/pgsql-general/2005-10/msg00975.php

The spelling of the message has changed since 7.3, but the code behavior
is still fundamentally the same.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Rick Gigger

Thanks.  I'm not quite sure what version I am going to upgrade to yet.

Rick

On Jan 20, 2006, at 5:59 PM, Andrew Dunstan wrote:



Updates for FC1 are available here:

http://download.fedoralegacy.org/fedora/1/updates/i386/

they have 7.3.9 dated in March last year.

Or grab the source for 7.3.13 and build it yourself.

cheers

andrew

Rick Gigger wrote:

It is the version that shipped with fedora core 1.  The version   
string from psql is (PostgreSQL) 7.3.4-RH.  I assume that it must   
have been the first bug since I had plenty of disk space.



On Jan 20, 2006, at 5:35 PM, Bruce Momjian wrote:


Rick Gigger wrote:


Postgres version 7.3.4

... a whole bunch of other files
-rw---  1 postgres postgres 262144 Jan 18 22:42 027D
-rw---  1 postgres postgres 262144 Jan 19 07:38 027E
-rw---  1 postgres postgres 262144 Jan 19 08:25 027F
-rw---  1 postgres postgres 262144 Jan 19 09:07 0280
-rw---  1 postgres postgres 262144 Jan 19 09:59 0281
-rw---  1 postgres postgres 262144 Jan 19 11:07 0282
-rw---  1 postgres postgres 262144 Jan 19 12:22 0283
-rw---  1 postgres postgres 262144 Jan 19 13:29 0284
-rw---  1 postgres postgres 262144 Jan 19 14:26 0285
-rw---  1 postgres postgres 262144 Jan 19 15:58 0286
-rw---  1 postgres postgres 262144 Jan 19 19:55 0287
-rw---  1 postgres postgres 262144 Jan 19 23:47 0288
-rw---  1 postgres postgres 262144 Jan 20 02:35 0289
-rw---  1 postgres postgres 262144 Jan 20 04:21 028A
-rw---  1 postgres postgres 262144 Jan 20 06:16 028B
-rw---  1 postgres postgres 262144 Jan 20 07:20 028C
-rw---  1 postgres postgres 262144 Jan 20 08:22 028D
-rw---  1 postgres postgres 262144 Jan 20 09:24 028E
-rw---  1 postgres postgres 262144 Jan 20 10:24 028F
-rw---  1 postgres postgres 262144 Jan 20 11:04 0290
-rw---  1 postgres postgres 262144 Jan 20 11:50 0291
-rw---  1 postgres postgres 262144 Jan 20 13:27 0292
-rw---  1 postgres postgres 262144 Jan 20 14:24 0293
-rw---  1 postgres postgres 262144 Jan 20 14:53 0294
-rw---  1 postgres postgres 262144 Jan 20 17:10 0295

That is right now.  Right after it started up it went up to 0292.
There are a lot of files before the ones listed here right now
though.  Do you need to see their names?



I assume you are missing one of these fixes in 7.3.X current  
which  were

done _after_ 7.3.4 was released:

 * Fix race condition in transaction log management
   There was a narrow window in which an I/O operation could be
   initiated for the wrong page, leading to an Assert
   failure or data corruption.

 * Repair incorrect order of operations in GetNewTransactionId()
   This bug could result in failure under out-of-disk-space
   conditions, including inability to restart even after
   disk space is freed.

 
-- -






On Jan 20, 2006, at 3:58 PM, Tom Lane wrote:


Rick Gigger [EMAIL PROTECTED] writes:


I got this message:
2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
pg_clog/0292 failed: File exists




In 7.3.  It caused the server to restart.




Can anyone tell me what it means?



7.3.what?

What file names exist in the pg_clog directory?

regards, tom lane

---(end of
broadcast)---
TIP 1: if posting/reading through Usenet, please send an  
appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so  
that

your
   message can get through to the mailing list cleanly




---(end of   
broadcast)---

TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



--
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square,   
Pennsylvania 19073


---(end of   
broadcast)---

TIP 6: explain analyze is your friend




---(end of  
broadcast)---

TIP 6: explain analyze is your friend






---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Rick Gigger

=Rick Gigger [EMAIL PROTECTED] writes:

Postgres version 7.3.4


You realize of course that that's pretty old ...



Yes.  I will be upgrading immediately.


That is right now.  Right after it started up it went up to 0292.


So it was the latest file eh?  I thought maybe you had some problem
with a corrupted XID leading to trying to touch a clog file
out-of-order, but that seems ruled out.


Well that sounds like a good thing.




2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
pg_clog/0292 failed: File exists


Digging in the 7.3 sources, it seems that error message could only  
have

come from here:

fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd  0)
{
if (errno != ENOENT)
elog(PANIC, open of %s failed: %m, path);
fd = BasicOpenFile(path, O_RDWR | O_CREAT | O_EXCL |  
PG_BINARY,

   S_IRUSR | S_IWUSR);
if (fd  0)
elog(PANIC, creation of file %s failed: %m, path);
}


Yes I found that too (on accident with google) but didn't really have  
the slightest clue what exactly would have caused it.




AFAICS, it is simply not possible for the second open() to fail with
that errno, *unless* someone else created the same file in the
microseconds between the two open calls.

The code doing this has a lock on the particular clog buffer it's  
trying

to write out, so no-one else could be trying to write the same buffer;
but now that I look at it, it's entirely legal for someone else to be
trying to write a different clog buffer.  This leads to the following
theory:

1. The clog page that would be first in the 0292 segment got  
created in

clog buffers, but there was no reason to write it out for awhile.  (In
normal operation, only a checkpoint would be cause to write out the
frontmost page of clog.)

2. More than 2K transactions elapsed, so the page that would be second
in the 0292 segment also got set up in clog buffers.  (Rick, is the  
load

on your machine such that several thousand transactions might have
elapsed between checkpoints?)  Perhaps there were even enough
transactions so that more than two pages were dirty and pending write
in the clog buffers, yet the file hadn't been created yet.


I don't know if 2K could have passed since the last checkpoint.  Part  
of the reason I haven't upgraded in so long is that it has been  
running like champ for about 3 years.   Recently though the load on  
the site just shot through the roof.  Not only am I going to upgrade  
the version of postgres but I need to do some major tuning.  I am  
still using a lot of defaults.  I am using the default checkpoint  
settings but I am not sure how often they are happening.  Actually  
now that I think about it I was getting about 400 pages requests /  
minute and each of those would have have been doing at least 2  
transactions so yes, I guess that is very likely.



3. Two different backends decided to try to write different clog pages
concurrently.  Probably one was writing the frontmost page because it
was doing a checkpoint, and another needed to read in an older clog  
page

so it had to swap out one of the other dirty buffers.

4. Kaboom.


Yeah Kaboom.  It was really bad timing too.  :)


If this theory is correct, the bug has been there since the clog code
was first written.  But the conditions for having it happen are narrow
enough that it's not too surprising we haven't seen it before.


Wow it's great to be the first.


I think that a sufficient fix might just be to remove the O_EXCL flag
from the second open() call --- ie, if someone else creates the file
in this narrow window, it should be considered OK.  Comments?


Well just a little fyi, I don't know if any of this will help but I  
was suffering from severe table bloat.  The data in my session table  
is unfortunately quite large and being updated constantly so the  
session table and it's two indexes and especially it's toast table we  
impossible to vacuum.  Also the vacuum and fsm settings were the  
defaults making the problem worse.




regards, tom lane



Thanks so much for the help.

Rick

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Rick Gigger


On Jan 20, 2006, at 6:02 PM, Tom Lane wrote:


Rick Gigger [EMAIL PROTECTED] writes:

Postgres version 7.3.4


You realize of course that that's pretty old ...


That is right now.  Right after it started up it went up to 0292.


So it was the latest file eh?  I thought maybe you had some problem
with a corrupted XID leading to trying to touch a clog file
out-of-order, but that seems ruled out.


2006-01-20 11:50:51 PANIC:  creation of file /var/lib/pgsql/data/
pg_clog/0292 failed: File exists


Digging in the 7.3 sources, it seems that error message could only  
have

come from here:

fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd  0)
{
if (errno != ENOENT)
elog(PANIC, open of %s failed: %m, path);
fd = BasicOpenFile(path, O_RDWR | O_CREAT | O_EXCL |  
PG_BINARY,

   S_IRUSR | S_IWUSR);
if (fd  0)
elog(PANIC, creation of file %s failed: %m, path);
}

AFAICS, it is simply not possible for the second open() to fail with
that errno, *unless* someone else created the same file in the
microseconds between the two open calls.

The code doing this has a lock on the particular clog buffer it's  
trying

to write out, so no-one else could be trying to write the same buffer;
but now that I look at it, it's entirely legal for someone else to be
trying to write a different clog buffer.  This leads to the following
theory:

1. The clog page that would be first in the 0292 segment got  
created in

clog buffers, but there was no reason to write it out for awhile.  (In
normal operation, only a checkpoint would be cause to write out the
frontmost page of clog.)

2. More than 2K transactions elapsed, so the page that would be second
in the 0292 segment also got set up in clog buffers.  (Rick, is the  
load

on your machine such that several thousand transactions might have
elapsed between checkpoints?)  Perhaps there were even enough
transactions so that more than two pages were dirty and pending write
in the clog buffers, yet the file hadn't been created yet.


So what I think I'm getting killed on right now are the disk writes.   
So I was thinking of changing away from the default checkpoint  
settings.  My load is going to continue to go up so 2000+  
transactions are going to start happening pretty fast.  I have lots  
of disk space so I was going to increase the time between  
checkpoints.  Will that increase the chances of this happening again  
or was this such a strange freak of nature coincidence that it can't  
really even happen again.  Also I've decided to upgrade all the way  
to the latest 8.1 code.



3. Two different backends decided to try to write different clog pages
concurrently.  Probably one was writing the frontmost page because it
was doing a checkpoint, and another needed to read in an older clog  
page

so it had to swap out one of the other dirty buffers.

4. Kaboom.

If this theory is correct, the bug has been there since the clog code
was first written.  But the conditions for having it happen are narrow
enough that it's not too surprising we haven't seen it before.

I think that a sufficient fix might just be to remove the O_EXCL flag
from the second open() call --- ie, if someone else creates the file
in this narrow window, it should be considered OK.  Comments?

regards, tom lane

---(end of  
broadcast)---

TIP 6: explain analyze is your friend




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Tom Lane
Rick Gigger [EMAIL PROTECTED] writes:
 I don't know if 2K could have passed since the last checkpoint.
 ...
 now that I think about it I was getting about 400 pages requests /  
 minute and each of those would have have been doing at least 2  
 transactions so yes, I guess that is very likely.

Good, 'cause if you didn't have a couple thousand transactions between
checkpoints then we need another theory ;-)

 You realize of course that that's pretty old ...

 Yes.  I will be upgrading immediately.

You'll want to include this patch:
http://archives.postgresql.org/pgsql-committers/2006-01/msg00289.php
(or see adjacent messages if you plan to move to something newer than
7.3.*).  We probably will not put out another set of releases until
next month, unless something really big comes along.  This one doesn't
qualify as really big IMHO, because it's not a PANIC-grade failure in
the later branches.  But having been burnt once, I'm sure you'll want
a patched copy ...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] panic on 7.3

2006-01-20 Thread Rick Gigger

Thanks very much!

I've decided to go straight to 8.1 though.  There are just too many  
performance improvements at this point that I might regret not having  
and I don't want to do a dump reload again.  I am about to compile it  
now.  If it isn't a panic grade failure in the latest 8.1 code then  
I'd just assume take the stock release source code.  I don't care at  
all if this kills one connection at the ultra-low frequency with  
which it occurs but what I can't have is the whole server rebooting  
itself in the middle of processing hundreds of transactions.  Once  
that happens all of the web clients hang onto their bad connections  
and then eventually die.  Considering that I'm moving to 8.1 and am  
not too familiar with applying patches am I crazy for just going with  
the stock 8.1 code?


On Jan 20, 2006, at 10:36 PM, Tom Lane wrote:


Rick Gigger [EMAIL PROTECTED] writes:

I don't know if 2K could have passed since the last checkpoint.
...
now that I think about it I was getting about 400 pages requests /
minute and each of those would have have been doing at least 2
transactions so yes, I guess that is very likely.


Good, 'cause if you didn't have a couple thousand transactions between
checkpoints then we need another theory ;-)


You realize of course that that's pretty old ...



Yes.  I will be upgrading immediately.


You'll want to include this patch:
http://archives.postgresql.org/pgsql-committers/2006-01/msg00289.php
(or see adjacent messages if you plan to move to something newer than
7.3.*).  We probably will not put out another set of releases until
next month, unless something really big comes along.  This one doesn't
qualify as really big IMHO, because it's not a PANIC-grade failure in
the later branches.  But having been burnt once, I'm sure you'll want
a patched copy ...

regards, tom lane




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org