Re: [HACKERS] Disaster!

2004-01-31 Thread Tom Lane
Randolf Richardson [EMAIL PROTECTED] writes: [EMAIL PROTECTED] (Greg Stark) stated in comp.databases.postgresql.hackers: The traditional Unix filesystems certainly don't return errors at close. Why shouldn't the close() function return an error? If an invalid file handle was passed

Re: [HACKERS] Disaster!

2004-01-31 Thread Greg Stark
Manfred Spraul [EMAIL PROTECTED] writes: The checkpoint code uses sync() right now. Actually sync();sleep(2);sync(). Win32 has no sync() call, therefore it will use fsyncs. Perhaps platforms with deferred errors on close must use fsync, too. Hopefully parallel fsyncs - sequential fsyncs

Re: [HACKERS] Disaster!

2004-01-30 Thread Greg Stark
Manfred Spraul [EMAIL PROTECTED] writes: That means open(); write(); sync(); could succeed, but the data is not stored on disk, correct? That would be true on any filesystem. Unless you throw an fsync() call in. With sync replaced by fsync then any filesystem ought to

Re: [HACKERS] Disaster!

2004-01-30 Thread Randolf Richardson
[EMAIL PROTECTED] (Greg Stark) stated in comp.databases.postgresql.hackers: Tom Lane [EMAIL PROTECTED] writes: Christopher Kings-Lynne [EMAIL PROTECTED] writes: FreeBSD 4.7/4.9 and the UFS filesystem Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC at close(). We

Re: [HACKERS] Disaster!

2004-01-29 Thread Christoph Haller
Tom Lane wrote: I said: If there wasn't disk space enough to hold the clog page, the checkpoint attempt should have failed. So it may be that allowing a short read in slru.c would be patching the symptom of a bug that is really elsewhere. After more staring at the code, I have

Re: [HACKERS] Disaster!

2004-01-29 Thread Tom Lane
Christoph Haller [EMAIL PROTECTED] writes: Tom was referring to close(), not fclose(). I once had an awful time searching for a memory leak caused by a typo using close instead of fclose. So adding checks for both is probably a good idea. Already done. regards,

Re: [HACKERS] Disaster!

2004-01-29 Thread Jeroen Ruigrok/asmodai
-On [20040125 03:52], Tom Lane ([EMAIL PROTECTED]) wrote: Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC at close(). From Tru64's write(2): [ENOSPC] [XSH4.2] No free space is left on the file system containing the file. [Tru64 UNIX] An attempt was made

Re: [HACKERS] Disaster!

2004-01-27 Thread Gaetano Mendola
Tom Lane wrote: Okay ... Chris was kind enough to let me examine the WAL logs and postmaster stderr log for his recent problem, and I believe that I have now achieved a full understanding of what happened. The true bug is indeed somewhere else than slru.c, and we would not have found it if

Re: [HACKERS] Disaster!

2004-01-26 Thread Christopher Kings-Lynne
Awesome Tom :) I'm glad I happened to have all the data required on hand to fully analyze the problem. Let's hope this make this failure condition go away for all future postgresql users :) Chris On Mon, 26 Jan 2004, Tom Lane wrote: Okay ... Chris was kind enough to let me examine the WAL

Re: [HACKERS] Disaster!

2004-01-26 Thread Bruce Momjian
Tom Lane wrote: I said: If there wasn't disk space enough to hold the clog page, the checkpoint attempt should have failed. So it may be that allowing a short read in slru.c would be patching the symptom of a bug that is really elsewhere. After more staring at the code, I have a theory.

Re: [HACKERS] Disaster!

2004-01-26 Thread Bruce Momjian
Excellent analysis. Thanks. Are there any other cases like this? --- Tom Lane wrote: Okay ... Chris was kind enough to let me examine the WAL logs and postmaster stderr log for his recent problem, and I believe that I

Re: [HACKERS] Disaster!

2004-01-26 Thread Alvaro Herrera
On Mon, Jan 26, 2004 at 02:52:58PM +0900, Michael Glaesemann wrote: I don't know if the 'canaveral' prompt had anything to do with it (maybe it was just the subject line), but I kept thinking of shuttle disasters, o-rings, and plane crashes reading through this. I won't claim to understand

Re: [HACKERS] Disaster!

2004-01-26 Thread Christopher Kings-Lynne
Just for the record, the Canaveral you are thinking about is derived from the spanish word Cañaveral, which is a place where cañas grow (canes or stems, according to my dictionary -- some sort of vegetal living form anyway). I suppose Cape Kennedy was filled with those plants and that's what the

Re: [HACKERS] Disaster!

2004-01-25 Thread Manfred Spraul
Greg Stark wrote: I do know that AFS returns quota failures on close. This was unusual enough that when AFS was deployed at school unix tools failed left and right over precisely this issue. Though it mostly just meant they returned the wrong exit status. That means open(); write();

Re: [HACKERS] Disaster!

2004-01-25 Thread Tom Lane
Okay ... Chris was kind enough to let me examine the WAL logs and postmaster stderr log for his recent problem, and I believe that I have now achieved a full understanding of what happened. The true bug is indeed somewhere else than slru.c, and we would not have found it if slru.c had had

Re: [HACKERS] Disaster!

2004-01-25 Thread Michael Glaesemann
Tom, I don't know if the 'canaveral' prompt had anything to do with it (maybe it was just the subject line), but I kept thinking of shuttle disasters, o-rings, and plane crashes reading through this. I won't claim to understand everything in huge detail, but from this newbie's point of view,

Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes: It seems that by adding the following to SlruPhysicalReadPage() we can recover in a reasonable way here. Instead of: [ add non-error check to lseek() ] But it's not the lseek() that's gonna fail. What we'll actually see, and did see in Chris' report, is

Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes: FreeBSD 4.7/4.9 and the UFS filesystem Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC at close(). We need to fix the code to check close's return value, probably, but it seems we still lack a clear explanation of what

Re: [HACKERS] Disaster!

2004-01-24 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes: Christopher Kings-Lynne [EMAIL PROTECTED] writes: FreeBSD 4.7/4.9 and the UFS filesystem Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC at close(). We need to fix the code to check close's return value, probably, but it seems we

Re: [HACKERS] Disaster!

2004-01-24 Thread Christopher Kings-Lynne
That request to look at your WAL files is still open ... I've sent you it privately - let me know how it goes. Chris ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne
We ran out of disk space on our main server, and now I've freed up space, we cannot start postgres! Jan 23 12:18:50 canaveral postgres[563]: [2-1] LOG: checkpoint record is at 2/96500B94 Jan 23 12:18:50 canaveral postgres[563]: [3-1] LOG: redo record is at 2/964BD23C; undo record is at 0/0;

Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne
I'd suggest extending that file with 8K of zeroes (might need more than that, but probably not). How do I do that? Sorry - I'm not sure of the quickest way, and I'm reading man pages as we speak! Thanks Tom, Chris ---(end of broadcast)--- TIP 4:

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes: I'd suggest extending that file with 8K of zeroes (might need more than that, but probably not). How do I do that? Sorry - I'm not sure of the quickest way, and I'm reading man pages as we speak! Something like dd if=/dev/zero bs=8k count=1

Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne
I'd suggest extending that file with 8K of zeroes (might need more than that, but probably not). OK, I've done dd if=/dev/zero of=zeros count=16 Then cat zero 000D Now I can start it up! Thanks! What should I do now? Chris ---(end of

Re: [HACKERS] Disaster!

2004-01-23 Thread Martín Marqués
Mensaje citado por Christopher Kings-Lynne [EMAIL PROTECTED]: I'd suggest extending that file with 8K of zeroes (might need more than that, but probably not). How do I do that? Sorry - I'm not sure of the quickest way, and I'm reading man pages as we speak! # dd if=/dev/zeros

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes: Now I can start it up! Thanks! What should I do now? Go home and get some sleep ;-). If the WAL replay succeeded, you're up and running, nothing else to do. regards, tom lane ---(end of

Re: [HACKERS] Disaster!

2004-01-23 Thread Martín Marqués
Mensaje citado por Tom Lane [EMAIL PROTECTED]: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Now I can start it up! Thanks! What should I do now? Go home and get some sleep ;-). If the WAL replay succeeded, you're up and running, nothing else to do. Tom, could you give a small

Re: [HACKERS] Disaster!

2004-01-23 Thread Dann Corbit
-Original Message- From: Tom Lane [mailto:[EMAIL PROTECTED] Sent: Friday, January 23, 2004 1:01 PM To: Christopher Kings-Lynne Cc: PostgreSQL-development Subject: Re: [HACKERS] Disaster! Christopher Kings-Lynne [EMAIL PROTECTED] writes: Now I can start it up! Thanks

Re: [HACKERS] Disaster!

2004-01-23 Thread Rod Taylor
On Fri, 2004-01-23 at 16:00, Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Now I can start it up! Thanks! What should I do now? Go home and get some sleep ;-). If the WAL replay succeeded, you're up and running, nothing else to do. Granted, running out of

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Christopher Kings-Lynne [EMAIL PROTECTED] writes: Are you interested in real backtraces, any of the old data directory, etc. to debug the problem? If you could recompile with debug support and get a backtrace from the panic, it would be helpful. I suspect what we need to do is make the clog

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
=?iso-8859-1?b?TWFydO1uIA==?= =?iso-8859-1?b?TWFycXXpcw==?= [EMAIL PROTECTED] writes: Tom, could you give a small insight on what occurred here, why those 8k of zeros fixed it, and what is a WAL replay? I think what happened is that there was insufficient space to write out a new page of the

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Rod Taylor [EMAIL PROTECTED] writes: Granted, running out of diskspace is a bad idea, but can (has?) something be put into place to prevent manual intervention from being required in restarting the database? See subsequent discussion. I do want to modify the code to avoid this problem in

Re: [HACKERS] Disaster!

2004-01-23 Thread Alvaro Herrera
On Fri, Jan 23, 2004 at 04:21:04PM -0500, Tom Lane wrote: But the clog access code evidently got confused by being asked to read a page that didn't exist in the file. I'm not sure yet how that sequence of events occurred, which is why I asked Chris for a stack trace. There was a very

Re: [HACKERS] Disaster!

2004-01-23 Thread Bruce Momjian
Tom Lane wrote: Christopher Kings-Lynne [EMAIL PROTECTED] writes: Are you interested in real backtraces, any of the old data directory, etc. to debug the problem? If you could recompile with debug support and get a backtrace from the panic, it would be helpful. I suspect what we need to

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes: Tom's answer will be undoubtly better ... Nope, I think you got all the relevant points. The only thing I'd add after having had more time to think about it is that this seems very much like the problem we noticed recently with recovery-from-WAL being

Re: [HACKERS] Disaster!

2004-01-23 Thread Gavin Sherry
On Fri, 23 Jan 2004, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Tom's answer will be undoubtly better ... Nope, I think you got all the relevant points. The only thing I'd add after having had more time to think about it is that this seems very much like the problem we