subject:"\[HACKERS\] Disaster\!"

Re: [HACKERS] Disaster!

2004-01-31 Thread Greg Stark

Manfred Spraul <[EMAIL PROTECTED]> writes: > The checkpoint code uses sync() right now. Actually sync();sleep(2);sync(). > Win32 has no sync() call, therefore it will use fsyncs. Perhaps platforms with > deferred errors on close must use fsync, too. Hopefully parallel fsyncs - > sequential fsyncs

Re: [HACKERS] Disaster!

2004-01-31 Thread Tom Lane

Randolf Richardson <[EMAIL PROTECTED]> writes: > "[EMAIL PROTECTED] (Greg Stark)" stated in > comp.databases.postgresql.hackers: >> The traditional Unix filesystems certainly don't return errors at close. > Why shouldn't the close() function return an error? If an invalid > file handle wa

Re: [HACKERS] Disaster!

2004-01-30 Thread Randolf Richardson

"[EMAIL PROTECTED] (Greg Stark)" stated in comp.databases.postgresql.hackers: > Tom Lane <[EMAIL PROTECTED]> writes: > >> Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: >> > FreeBSD 4.7/4.9 and the UFS filesystem >> >> Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC

Re: [HACKERS] Disaster!

2004-01-30 Thread Manfred Spraul

Greg Stark wrote: Manfred Spraul <[EMAIL PROTECTED]> writes: That means open(); write(); sync(); could succeed, but the data is not stored on disk, correct? That would be true on any filesystem. Unless you throw an fsync() call in. The checkpoint code uses sync() right now. Ac

Re: [HACKERS] Disaster!

2004-01-30 Thread Greg Stark

Manfred Spraul <[EMAIL PROTECTED]> writes: > That means > open(); > write(); > sync(); > > could succeed, but the data is not stored on disk, correct? That would be true on any filesystem. Unless you throw an fsync() call in. With sync replaced by fsync then any filesystem ought to

Re: [HACKERS] Disaster!

2004-01-29 Thread Jeroen Ruigrok/asmodai

-On [20040125 03:52], Tom Lane ([EMAIL PROTECTED]) wrote: >Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC >at close(). >From Tru64's write(2): [ENOSPC] [XSH4.2] No free space is left on the file system containing the file. [Tru64 UNIX] An attempt was ma

Re: [HACKERS] Disaster!

2004-01-29 Thread Tom Lane

Christoph Haller <[EMAIL PROTECTED]> writes: > Tom was referring to close(), not fclose(). > I once had an awful time searching for a memory leak caused > by a typo using close instead of fclose. > So adding checks for both is probably a good idea. Already done. regard

Re: [HACKERS] Disaster!

2004-01-29 Thread Christoph Haller

> > Tom Lane wrote: > > I said: > > > If there wasn't disk space enough to hold the clog page, the checkpoint > > > attempt should have failed. So it may be that allowing a short read in > > > slru.c would be patching the symptom of a bug that is really elsewhere. > > > > After more staring at t

Re: [HACKERS] Disaster!

2004-01-27 Thread Gaetano Mendola

Tom Lane wrote: Okay ... Chris was kind enough to let me examine the WAL logs and postmaster stderr log for his recent problem, and I believe that I have now achieved a full understanding of what happened. The true bug is indeed somewhere else than slru.c, and we would not have found it if slru.c

Re: [HACKERS] Disaster!

2004-01-26 Thread Christopher Kings-Lynne

Just for the record, the Canaveral you are thinking about is derived from the spanish word "Cañaveral", which is a place where "cañas" grow (canes or stems, according to my dictionary -- some sort of vegetal living form anyway). I suppose Cape Kennedy was filled with those plants and that's what t

Re: [HACKERS] Disaster!

2004-01-26 Thread Alvaro Herrera

On Mon, Jan 26, 2004 at 02:52:58PM +0900, Michael Glaesemann wrote: > I don't know if the 'canaveral' prompt had anything to do with it > (maybe it was just the subject line), but I kept thinking of shuttle > disasters, o-rings, and plane crashes reading through this. I won't > claim to underst

Re: [HACKERS] Disaster!

2004-01-26 Thread Bruce Momjian

Excellent analysis. Thanks. Are there any other cases like this? --- Tom Lane wrote: > Okay ... Chris was kind enough to let me examine the WAL logs and > postmaster stderr log for his recent problem, and I believe that >

Re: [HACKERS] Disaster!

2004-01-26 Thread Bruce Momjian

Tom Lane wrote: > I said: > > If there wasn't disk space enough to hold the clog page, the checkpoint > > attempt should have failed. So it may be that allowing a short read in > > slru.c would be patching the symptom of a bug that is really elsewhere. > > After more staring at the code, I have a

Re: [HACKERS] Disaster!

2004-01-26 Thread Christopher Kings-Lynne

Awesome Tom :) I'm glad I happened to have all the data required on hand to fully analyze the problem. Let's hope this make this failure condition go away for all future postgresql users :) Chris On Mon, 26 Jan 2004, Tom Lane wrote: > Okay ... Chris was kind enough to let me examine the WAL l

Re: [HACKERS] Disaster!

2004-01-25 Thread Michael Glaesemann

Tom, I don't know if the 'canaveral' prompt had anything to do with it (maybe it was just the subject line), but I kept thinking of shuttle disasters, o-rings, and plane crashes reading through this. I won't claim to understand everything in huge detail, but from this newbie's point of view, w

Re: [HACKERS] Disaster!

2004-01-25 Thread Tom Lane

Okay ... Chris was kind enough to let me examine the WAL logs and postmaster stderr log for his recent problem, and I believe that I have now achieved a full understanding of what happened. The true bug is indeed somewhere else than slru.c, and we would not have found it if slru.c had had less-par

Re: [HACKERS] Disaster!

2004-01-25 Thread Manfred Spraul

Greg Stark wrote: I do know that AFS returns quota failures on close. This was unusual enough that when AFS was deployed at school unix tools failed left and right over precisely this issue. Though it mostly just meant they returned the wrong exit status. That means open(); write(); sync(

Re: [HACKERS] Disaster!

2004-01-24 Thread Christopher Kings-Lynne

> That request to look at your WAL files is still open ... I've sent you it privately - let me know how it goes. Chris ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Disaster!

2004-01-24 Thread Greg Stark

Tom Lane <[EMAIL PROTECTED]> writes: > Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > > FreeBSD 4.7/4.9 and the UFS filesystem > > Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC > at close(). We need to fix the code to check close's return value, > probably, but it

Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane

Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > FreeBSD 4.7/4.9 and the UFS filesystem Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC at close(). We need to fix the code to check close's return value, probably, but it seems we still lack a clear explanation of what h

Re: [HACKERS] Disaster!

2004-01-24 Thread Christopher Kings-Lynne

After more staring at the code, I have a theory. SlruPhysicalWritePage and SlruPhysicalReadPage are coded on the assumption that close() can never return any interesting failure. However, it now occurs to me that there are some filesystem implementations wherein ENOSPC could be returned at close(

Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane

I said: > If there wasn't disk space enough to hold the clog page, the checkpoint > attempt should have failed. So it may be that allowing a short read in > slru.c would be patching the symptom of a bug that is really elsewhere. After more staring at the code, I have a theory. SlruPhysicalWriteP

Re: [HACKERS] Disaster!

2004-01-24 Thread Tom Lane

Gavin Sherry <[EMAIL PROTECTED]> writes: > It seems that by adding the following to SlruPhysicalReadPage() we can > recover in a reasonable way here. Instead of: > [ add non-error check to lseek() ] But it's not the lseek() that's gonna fail. What we'll actually see, and did see in Chris' report,

Re: [HACKERS] Disaster!

2004-01-23 Thread Gavin Sherry

On Fri, 23 Jan 2004, Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Tom's answer will be undoubtly better ... > > Nope, I think you got all the relevant points. > > The only thing I'd add after having had more time to think about it is > that this seems very much like the problem

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

Alvaro Herrera <[EMAIL PROTECTED]> writes: > Tom's answer will be undoubtly better ... Nope, I think you got all the relevant points. The only thing I'd add after having had more time to think about it is that this seems very much like the problem we noticed recently with recovery-from-WAL being

Re: [HACKERS] Disaster!

2004-01-23 Thread Bruce Momjian

Tom Lane wrote: > Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > > Are you interested in real backtraces, any of the old data directory, > > etc. to debug the problem? > > If you could recompile with debug support and get a backtrace from the > panic, it would be helpful. I suspect what w

Re: [HACKERS] Disaster!

2004-01-23 Thread Alvaro Herrera

On Fri, Jan 23, 2004 at 04:21:04PM -0500, Tom Lane wrote: > But the clog access code evidently got confused by being asked to read > a page that didn't exist in the file. I'm not sure yet how that > sequence of events occurred, which is why I asked Chris for a stack > trace. There was a very sim

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

Rod Taylor <[EMAIL PROTECTED]> writes: > Granted, running out of diskspace is a bad idea, but can (has?) > something be put into place to prevent manual intervention from being > required in restarting the database? See subsequent discussion. I do want to modify the code to avoid this problem in

Re: [HACKERS] Disaster!

2004-01-23 Thread Alvaro Herrera

On Fri, Jan 23, 2004 at 05:58:33PM -0300, Martín Marqués wrote: > Tom, could you give a small insight on what occurred here, why those 8k of zeros > fixed it, and what is a "WAL replay"? If I may ... - the disk filled up - Postgres registered something in WAL that required some commit status (

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

=?iso-8859-1?b?TWFydO1uIA==?= =?iso-8859-1?b?TWFycXXpcw==?= <[EMAIL PROTECTED]> writes: > Tom, could you give a small insight on what occurred here, why those > 8k of zeros fixed it, and what is a "WAL replay"? I think what happened is that there was insufficient space to write out a new page of t

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > Are you interested in real backtraces, any of the old data directory, > etc. to debug the problem? If you could recompile with debug support and get a backtrace from the panic, it would be helpful. I suspect what we need to do is make the clo

Re: [HACKERS] Disaster!

2004-01-23 Thread Rod Taylor

On Fri, 2004-01-23 at 16:00, Tom Lane wrote: > Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > > Now I can start it up! Thanks! > > > What should I do now? > > Go home and get some sleep ;-). If the WAL replay succeeded, you're up > and running, nothing else to do. Granted, running out o

Re: [HACKERS] Disaster!

2004-01-23 Thread Dann Corbit

> -Original Message- > From: Tom Lane [mailto:[EMAIL PROTECTED] > Sent: Friday, January 23, 2004 1:01 PM > To: Christopher Kings-Lynne > Cc: PostgreSQL-development > Subject: Re: [HACKERS] Disaster! > > > Christopher Kings-Lynne <[EMAIL PROTECTED]>

Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne

What should I do now? Go home and get some sleep ;-). If the WAL replay succeeded, you're up and running, nothing else to do. Cool, thanks heaps Tom. Are you interested in real backtraces, any of the old data directory, etc. to debug the problem? Obviously it ran out of disk space, but surely

Re: [HACKERS] Disaster!

2004-01-23 Thread Martín Marqués

Mensaje citado por Tom Lane <[EMAIL PROTECTED]>: > Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > > Now I can start it up! Thanks! > > > What should I do now? > > Go home and get some sleep ;-). If the WAL replay succeeded, you're up > and running, nothing else to do. Tom, could you gi

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > Now I can start it up! Thanks! > What should I do now? Go home and get some sleep ;-). If the WAL replay succeeded, you're up and running, nothing else to do. regards, tom lane ---(end of bro

Re: [HACKERS] Disaster!

2004-01-23 Thread Martín Marqués

Mensaje citado por Christopher Kings-Lynne <[EMAIL PROTECTED]>: > > I'd suggest extending that file with 8K of zeroes (might need more than > > that, but probably not). > > How do I do that? Sorry - I'm not sure of the quickest way, and I'm > reading man pages as we speak! # dd if=/dev/zeros o

Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne

I'd suggest extending that file with 8K of zeroes (might need more than that, but probably not). OK, I've done dd if=/dev/zero of=zeros count=16 Then cat zero >> 000D Now I can start it up! Thanks! What should I do now? Chris ---(end of broadcast)

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: >> I'd suggest extending that file with 8K of zeroes (might need more than >> that, but probably not). > How do I do that? Sorry - I'm not sure of the quickest way, and I'm > reading man pages as we speak! Something like "dd if=/dev/zero bs=8k

Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne

I'd suggest extending that file with 8K of zeroes (might need more than that, but probably not). How do I do that? Sorry - I'm not sure of the quickest way, and I'm reading man pages as we speak! Thanks Tom, Chris ---(end of broadcast)--- TIP 4:

Re: [HACKERS] Disaster!

2004-01-23 Thread Tom Lane

Christopher Kings-Lynne <[EMAIL PROTECTED]> writes: > We ran out of disk space on our main server, and now I've freed up > space, we cannot start postgres! > Jan 23 12:18:51 canaveral postgres[563]: [7-1] PANIC: could not access > status of transaction 14286850 > Jan 23 12:18:51 canaveral postg

Re: [HACKERS] Disaster!

2004-01-23 Thread Dann Corbit

> -Original Message- > From: Christopher Kings-Lynne [mailto:[EMAIL PROTECTED] > Sent: Friday, January 23, 2004 12:29 PM > To: PostgreSQL-development > Cc: Tom Lane > Subject: [HACKERS] Disaster! > > > We ran out of disk space on our main server, and now

Re: [HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne

pg_clog information: # cd pg_clog # ls -al total 3602 drwx-- 2 pgsql pgsql 512 Jan 23 03:49 . drwx-- 6 pgsql pgsql 512 Jan 23 12:30 .. -rw--- 1 pgsql pgsql 262144 Jan 18 19:43 -rw--- 1 pgsql pgsql 262144 Jan 18 19:43 0001 -rw--- 1 pgsql pgsql 262144 Ja

[HACKERS] Disaster!

2004-01-23 Thread Christopher Kings-Lynne

We ran out of disk space on our main server, and now I've freed up space, we cannot start postgres! Jan 23 12:18:50 canaveral postgres[563]: [2-1] LOG: checkpoint record is at 2/96500B94 Jan 23 12:18:50 canaveral postgres[563]: [3-1] LOG: redo record is at 2/964BD23C; undo record is at 0/0; s

44 matches

Mail list logo