Randolf Richardson [EMAIL PROTECTED] writes:
[EMAIL PROTECTED] (Greg Stark) stated in
comp.databases.postgresql.hackers:
The traditional Unix filesystems certainly don't return errors at close.
Why shouldn't the close() function return an error? If an invalid
file handle was passed
Manfred Spraul [EMAIL PROTECTED] writes:
The checkpoint code uses sync() right now. Actually sync();sleep(2);sync().
Win32 has no sync() call, therefore it will use fsyncs. Perhaps platforms with
deferred errors on close must use fsync, too. Hopefully parallel fsyncs -
sequential fsyncs
Manfred Spraul [EMAIL PROTECTED] writes:
That means
open();
write();
sync();
could succeed, but the data is not stored on disk, correct?
That would be true on any filesystem. Unless you throw an fsync() call in.
With sync replaced by fsync then any filesystem ought to
[EMAIL PROTECTED] (Greg Stark) stated in
comp.databases.postgresql.hackers:
Tom Lane [EMAIL PROTECTED] writes:
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
FreeBSD 4.7/4.9 and the UFS filesystem
Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close(). We
Tom Lane wrote:
I said:
If there wasn't disk space enough to hold the clog page, the checkpoint
attempt should have failed. So it may be that allowing a short read in
slru.c would be patching the symptom of a bug that is really elsewhere.
After more staring at the code, I have
Christoph Haller [EMAIL PROTECTED] writes:
Tom was referring to close(), not fclose().
I once had an awful time searching for a memory leak caused
by a typo using close instead of fclose.
So adding checks for both is probably a good idea.
Already done.
regards,
-On [20040125 03:52], Tom Lane ([EMAIL PROTECTED]) wrote:
Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close().
From Tru64's write(2):
[ENOSPC]
[XSH4.2] No free space is left on the file system containing the
file.
[Tru64 UNIX] An attempt was made
Tom Lane wrote:
Okay ... Chris was kind enough to let me examine the WAL logs and
postmaster stderr log for his recent problem, and I believe that
I have now achieved a full understanding of what happened. The true
bug is indeed somewhere else than slru.c, and we would not have found
it if
Awesome Tom :)
I'm glad I happened to have all the data required on hand to fully analyze
the problem. Let's hope this make this failure condition go away for all
future postgresql users :)
Chris
On Mon, 26 Jan 2004, Tom Lane wrote:
Okay ... Chris was kind enough to let me examine the WAL
Tom Lane wrote:
I said:
If there wasn't disk space enough to hold the clog page, the checkpoint
attempt should have failed. So it may be that allowing a short read in
slru.c would be patching the symptom of a bug that is really elsewhere.
After more staring at the code, I have a theory.
Excellent analysis. Thanks. Are there any other cases like this?
---
Tom Lane wrote:
Okay ... Chris was kind enough to let me examine the WAL logs and
postmaster stderr log for his recent problem, and I believe that
I
On Mon, Jan 26, 2004 at 02:52:58PM +0900, Michael Glaesemann wrote:
I don't know if the 'canaveral' prompt had anything to do with it
(maybe it was just the subject line), but I kept thinking of shuttle
disasters, o-rings, and plane crashes reading through this. I won't
claim to understand
Just for the record, the Canaveral you are thinking about is derived
from the spanish word Cañaveral, which is a place where cañas grow
(canes or stems, according to my dictionary -- some sort of vegetal
living form anyway). I suppose Cape Kennedy was filled with those
plants and that's what the
Greg Stark wrote:
I do know that AFS returns quota failures on close. This was unusual enough
that when AFS was deployed at school unix tools failed left and right over
precisely this issue. Though it mostly just meant they returned the wrong exit
status.
That means
open();
write();
Okay ... Chris was kind enough to let me examine the WAL logs and
postmaster stderr log for his recent problem, and I believe that
I have now achieved a full understanding of what happened. The true
bug is indeed somewhere else than slru.c, and we would not have found
it if slru.c had had
Tom,
I don't know if the 'canaveral' prompt had anything to do with it
(maybe it was just the subject line), but I kept thinking of shuttle
disasters, o-rings, and plane crashes reading through this. I won't
claim to understand everything in huge detail, but from this newbie's
point of view,
Gavin Sherry [EMAIL PROTECTED] writes:
It seems that by adding the following to SlruPhysicalReadPage() we can
recover in a reasonable way here. Instead of:
[ add non-error check to lseek() ]
But it's not the lseek() that's gonna fail. What we'll actually see,
and did see in Chris' report, is
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
FreeBSD 4.7/4.9 and the UFS filesystem
Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close(). We need to fix the code to check close's return value,
probably, but it seems we still lack a clear explanation of what
Tom Lane [EMAIL PROTECTED] writes:
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
FreeBSD 4.7/4.9 and the UFS filesystem
Hm, okay, I'm pretty sure that that combination wouldn't report ENOSPC
at close(). We need to fix the code to check close's return value,
probably, but it seems we
That request to look at your WAL files is still open ...
I've sent you it privately - let me know how it goes.
Chris
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
We ran out of disk space on our main server, and now I've freed up
space, we cannot start postgres!
Jan 23 12:18:50 canaveral postgres[563]: [2-1] LOG: checkpoint record
is at 2/96500B94
Jan 23 12:18:50 canaveral postgres[563]: [3-1] LOG: redo record is at
2/964BD23C; undo record is at 0/0;
I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).
How do I do that? Sorry - I'm not sure of the quickest way, and I'm
reading man pages as we speak!
Thanks Tom,
Chris
---(end of broadcast)---
TIP 4:
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).
How do I do that? Sorry - I'm not sure of the quickest way, and I'm
reading man pages as we speak!
Something like dd if=/dev/zero bs=8k count=1
I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).
OK, I've done
dd if=/dev/zero of=zeros count=16
Then cat zero 000D
Now I can start it up! Thanks!
What should I do now?
Chris
---(end of
Mensaje citado por Christopher Kings-Lynne [EMAIL PROTECTED]:
I'd suggest extending that file with 8K of zeroes (might need more than
that, but probably not).
How do I do that? Sorry - I'm not sure of the quickest way, and I'm
reading man pages as we speak!
# dd if=/dev/zeros
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
Now I can start it up! Thanks!
What should I do now?
Go home and get some sleep ;-). If the WAL replay succeeded, you're up
and running, nothing else to do.
regards, tom lane
---(end of
Mensaje citado por Tom Lane [EMAIL PROTECTED]:
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
Now I can start it up! Thanks!
What should I do now?
Go home and get some sleep ;-). If the WAL replay succeeded, you're up
and running, nothing else to do.
Tom, could you give a small
-Original Message-
From: Tom Lane [mailto:[EMAIL PROTECTED]
Sent: Friday, January 23, 2004 1:01 PM
To: Christopher Kings-Lynne
Cc: PostgreSQL-development
Subject: Re: [HACKERS] Disaster!
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
Now I can start it up! Thanks
On Fri, 2004-01-23 at 16:00, Tom Lane wrote:
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
Now I can start it up! Thanks!
What should I do now?
Go home and get some sleep ;-). If the WAL replay succeeded, you're up
and running, nothing else to do.
Granted, running out of
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
Are you interested in real backtraces, any of the old data directory,
etc. to debug the problem?
If you could recompile with debug support and get a backtrace from the
panic, it would be helpful. I suspect what we need to do is make the
clog
=?iso-8859-1?b?TWFydO1uIA==?= =?iso-8859-1?b?TWFycXXpcw==?= [EMAIL PROTECTED] writes:
Tom, could you give a small insight on what occurred here, why those
8k of zeros fixed it, and what is a WAL replay?
I think what happened is that there was insufficient space to write out
a new page of the
Rod Taylor [EMAIL PROTECTED] writes:
Granted, running out of diskspace is a bad idea, but can (has?)
something be put into place to prevent manual intervention from being
required in restarting the database?
See subsequent discussion. I do want to modify the code to avoid this
problem in
On Fri, Jan 23, 2004 at 04:21:04PM -0500, Tom Lane wrote:
But the clog access code evidently got confused by being asked to read
a page that didn't exist in the file. I'm not sure yet how that
sequence of events occurred, which is why I asked Chris for a stack
trace.
There was a very
Tom Lane wrote:
Christopher Kings-Lynne [EMAIL PROTECTED] writes:
Are you interested in real backtraces, any of the old data directory,
etc. to debug the problem?
If you could recompile with debug support and get a backtrace from the
panic, it would be helpful. I suspect what we need to
Alvaro Herrera [EMAIL PROTECTED] writes:
Tom's answer will be undoubtly better ...
Nope, I think you got all the relevant points.
The only thing I'd add after having had more time to think about it is
that this seems very much like the problem we noticed recently with
recovery-from-WAL being
On Fri, 23 Jan 2004, Tom Lane wrote:
Alvaro Herrera [EMAIL PROTECTED] writes:
Tom's answer will be undoubtly better ...
Nope, I think you got all the relevant points.
The only thing I'd add after having had more time to think about it is
that this seems very much like the problem we
36 matches
Mail list logo