Re: [HACKERS] Disaster!

Bruce Momjian Mon, 26 Jan 2004 11:29:30 -0800

Excellent analysis.  Thanks.  Are there any other cases like this?

---------------------------------------------------------------------------


Tom Lane wrote:
> Okay ... Chris was kind enough to let me examine the WAL logs and
> postmaster stderr log for his recent problem, and I believe that
> I have now achieved a full understanding of what happened.  The true
> bug is indeed somewhere else than slru.c, and we would not have found
> it if slru.c had had less-paranoid error checking.
> 
> The WAL log shows that checkpoints were happening every five minutes
> up to 2004-01-23 10:13:10, but no checkpoint completion record appears
> after that.  However, the system remained up, with plenty of activity,
> until 10:45:24, when it was finally taken down by a panic.  The last
> transaction commit records in the WAL log are
> 
> commit: 14286807 at 2004-01-23 10:45:23 
> commit: 14286811 at 2004-01-23 10:45:24 
> commit: 14286814 at 2004-01-23 10:45:24 
> commit: 14286824 at 2004-01-23 10:45:24 
> commit: 14286825 at 2004-01-23 10:45:24 
> commit: 14286836 at 2004-01-23 10:45:24 
> commit: 14286838 at 2004-01-23 10:45:24 
> commit: 14286850 at 2004-01-23 10:45:24 
> commit: 14286851 at 2004-01-23 10:45:24 
> 
> Over in the postmaster log, the first sign of trouble is
> 
> Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG:  could not close temporary 
> statistics file "/usr/local/pgsql/data/global/pgstat.tmp.20035": No space left on 
> device
> 
> and there is a steady stream of transactions failing with out-of-space
> errors over the next half hour, but none of the failures are worse than
> a transaction abort.  Finally we see
> 
> Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of 
> transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file 
> "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not 
> in in-progress state
> Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC:  could not access status of 
> transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL:  could not write to file 
> "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG:  server process (PID 57237) 
> was terminated by signal 6
> Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG:  terminating any other active 
> server processes
> 
> after which the postmaster's recovery attempts fail, as Chris already
> detailed.  (Note: the reference to "transaction 0" is not significant;
> that just happens because SimpleLruWritePage doesn't have a specific
> transaction number to blame its write failures on.)
> 
> Those are the observed facts, what's the interpretation?  I think it
> shows that Postgres is pretty darn robust, actually.  We were able to
> stay up and do useful work for quite a long time with zero free space;
> what's more, we lost no transactions that were successfully committed.
> The data was successfully stored in preallocated WAL space.  (If things
> had gone on this way for awhile longer, we would have panicked for lack
> of WAL space, but Chris was actually not anywhere near there; he'd only
> filled about two WAL segments in the half hour of operations.)  Note
> also that checkpoints were attempted several times during that interval,
> and they all failed gracefully --- no panic, no incorrect WAL update.
> 
> But why did this panic finally happen?  The key observation is that
> the first nonexistent page of pg_clog was the page beginning with
> transaction 14286848.  Neither this xact nor the following one have any
> commit or abort record in WAL, but we do see entries for 14286850 and
> 14286851.  It is also notable that there is no WAL entry for extension
> of pg_clog to include this page --- normally a WAL entry is made each
> time a page of zeroes is added to pg_clog.  My interpretation of the
> sequence of events is:
> 
> Transaction 14286848 started, and since it was the first for its pg_clog
> page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG).  This
> required making room in the in-memory clog buffers, which required
> dumping one of the previously-buffered clog pages, which failed for lack
> of disk space, leading to this log entry:
> 
> Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR:  could not access status of 
> transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL:  could not write to file 
> "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING:  AbortTransaction and not 
> in in-progress state
> 
> (Note: page offset 147456 is the page two before the one containing xid
> 14286848.  This page had been allocated in clog buffers but never yet
> successfully written to disk.  Ditto for the page in between.)  The next
> thing that happened was that transaction xids 14286849 and 14286850 were
> assigned (ie, those xacts started), and then 14286850 tried to commit.
> This again led to a failed attempt to write out a clog page, but this
> time the error was promoted to a panic because it happened inside the
> transaction commit critical section:
> 
> Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC:  could not access status of 
> transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL:  could not write to file 
> "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> 
> The final commit record in WAL, from xid 14286851, must have come from a
> different backend that was able to get that far in its commit sequence
> before hearing the all-hands-abandon-ship signal from the postmaster.
> (AFAICT it was just chance that the same backend process was responsible
> for both 14286848 and 14286850.  Presumably 14286849 was taken out by
> yet another backend that hadn't gotten as far as trying to commit.)
> 
> After Chris had freed some disk space, WAL replay was able to create the
> clog page at offset 147456, because there was a clog-extension WAL entry
> for it within the WAL entries since the last successful checkpoint.  It
> was also able to correctly fill that page using the transaction commit
> data in WAL.  Likewise for the page after that.  But when it got to the
> commit record for 14286850, the error checks in slru.c barfed because
> there was no such page, thus exposing the real problem: there wasn't a
> clog extension WAL record for that page.
> 
> In short, the bug is in GetNewTransactionId(), which shorn of extraneous
> details looks like
> 
>     LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
> 
>     xid = ShmemVariableCache->nextXid;
> 
>     TransactionIdAdvance(ShmemVariableCache->nextXid);
> 
>     ExtendCLOG(xid);
> 
>     LWLockRelease(XidGenLock);
> 
> and the correct fix is to swap the order of the TransactionIdAdvance and
> ExtendCLOG lines.  Because these lines are out of order, a failure
> occurring down inside ExtendCLOG leaves the shared-memory copy of
> nextXid already advanced, and so subsequent transactions coming through
> this bit of code will see that they are not the first transaction in
> their page and conclude that they need not do any work to extend clog.
> With the operations in the correct order, ExtendCLOG failure will leave
> the counter unchanged, so every subsequent transaction will try to do
> ExtendCLOG and will fail until some disk space becomes available.  (Note
> that this code is *not* in a critical section, indeed it's not yet
> inside a transaction at all, and so failure here does not mean a panic.)
> 
> If you like you can think of the real problem with this code as being
> that it violates the basic WAL rule: "make a WAL entry before making
> permanent changes".  nextXid isn't on disk, but it's persistent
> shared-memory state, and it mustn't be updated until any WAL entries
> associated with that action have been made.
> 
> Any questions?
> 
>                       regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
> 

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  [EMAIL PROTECTED]               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: [HACKERS] Disaster!

Reply via email to