Excellent analysis. Thanks. Are there any other cases like this? ---------------------------------------------------------------------------
Tom Lane wrote: > Okay ... Chris was kind enough to let me examine the WAL logs and > postmaster stderr log for his recent problem, and I believe that > I have now achieved a full understanding of what happened. The true > bug is indeed somewhere else than slru.c, and we would not have found > it if slru.c had had less-paranoid error checking. > > The WAL log shows that checkpoints were happening every five minutes > up to 2004-01-23 10:13:10, but no checkpoint completion record appears > after that. However, the system remained up, with plenty of activity, > until 10:45:24, when it was finally taken down by a panic. The last > transaction commit records in the WAL log are > > commit: 14286807 at 2004-01-23 10:45:23 > commit: 14286811 at 2004-01-23 10:45:24 > commit: 14286814 at 2004-01-23 10:45:24 > commit: 14286824 at 2004-01-23 10:45:24 > commit: 14286825 at 2004-01-23 10:45:24 > commit: 14286836 at 2004-01-23 10:45:24 > commit: 14286838 at 2004-01-23 10:45:24 > commit: 14286850 at 2004-01-23 10:45:24 > commit: 14286851 at 2004-01-23 10:45:24 > > Over in the postmaster log, the first sign of trouble is > > Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG: could not close temporary > statistics file "/usr/local/pgsql/data/global/pgstat.tmp.20035": No space left on > device > > and there is a steady stream of transactions failing with out-of-space > errors over the next half hour, but none of the failures are worse than > a transaction abort. Finally we see > > Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR: could not access status of > transaction 0 > Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL: could not write to file > "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device > Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING: AbortTransaction and not > in in-progress state > Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC: could not access status of > transaction 0 > Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL: could not write to file > "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device > Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG: server process (PID 57237) > was terminated by signal 6 > Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG: terminating any other active > server processes > > after which the postmaster's recovery attempts fail, as Chris already > detailed. (Note: the reference to "transaction 0" is not significant; > that just happens because SimpleLruWritePage doesn't have a specific > transaction number to blame its write failures on.) > > Those are the observed facts, what's the interpretation? I think it > shows that Postgres is pretty darn robust, actually. We were able to > stay up and do useful work for quite a long time with zero free space; > what's more, we lost no transactions that were successfully committed. > The data was successfully stored in preallocated WAL space. (If things > had gone on this way for awhile longer, we would have panicked for lack > of WAL space, but Chris was actually not anywhere near there; he'd only > filled about two WAL segments in the half hour of operations.) Note > also that checkpoints were attempted several times during that interval, > and they all failed gracefully --- no panic, no incorrect WAL update. > > But why did this panic finally happen? The key observation is that > the first nonexistent page of pg_clog was the page beginning with > transaction 14286848. Neither this xact nor the following one have any > commit or abort record in WAL, but we do see entries for 14286850 and > 14286851. It is also notable that there is no WAL entry for extension > of pg_clog to include this page --- normally a WAL entry is made each > time a page of zeroes is added to pg_clog. My interpretation of the > sequence of events is: > > Transaction 14286848 started, and since it was the first for its pg_clog > page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG). This > required making room in the in-memory clog buffers, which required > dumping one of the previously-buffered clog pages, which failed for lack > of disk space, leading to this log entry: > > Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR: could not access status of > transaction 0 > Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL: could not write to file > "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device > Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING: AbortTransaction and not > in in-progress state > > (Note: page offset 147456 is the page two before the one containing xid > 14286848. This page had been allocated in clog buffers but never yet > successfully written to disk. Ditto for the page in between.) The next > thing that happened was that transaction xids 14286849 and 14286850 were > assigned (ie, those xacts started), and then 14286850 tried to commit. > This again led to a failed attempt to write out a clog page, but this > time the error was promoted to a panic because it happened inside the > transaction commit critical section: > > Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC: could not access status of > transaction 0 > Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL: could not write to file > "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device > > The final commit record in WAL, from xid 14286851, must have come from a > different backend that was able to get that far in its commit sequence > before hearing the all-hands-abandon-ship signal from the postmaster. > (AFAICT it was just chance that the same backend process was responsible > for both 14286848 and 14286850. Presumably 14286849 was taken out by > yet another backend that hadn't gotten as far as trying to commit.) > > After Chris had freed some disk space, WAL replay was able to create the > clog page at offset 147456, because there was a clog-extension WAL entry > for it within the WAL entries since the last successful checkpoint. It > was also able to correctly fill that page using the transaction commit > data in WAL. Likewise for the page after that. But when it got to the > commit record for 14286850, the error checks in slru.c barfed because > there was no such page, thus exposing the real problem: there wasn't a > clog extension WAL record for that page. > > In short, the bug is in GetNewTransactionId(), which shorn of extraneous > details looks like > > LWLockAcquire(XidGenLock, LW_EXCLUSIVE); > > xid = ShmemVariableCache->nextXid; > > TransactionIdAdvance(ShmemVariableCache->nextXid); > > ExtendCLOG(xid); > > LWLockRelease(XidGenLock); > > and the correct fix is to swap the order of the TransactionIdAdvance and > ExtendCLOG lines. Because these lines are out of order, a failure > occurring down inside ExtendCLOG leaves the shared-memory copy of > nextXid already advanced, and so subsequent transactions coming through > this bit of code will see that they are not the first transaction in > their page and conclude that they need not do any work to extend clog. > With the operations in the correct order, ExtendCLOG failure will leave > the counter unchanged, so every subsequent transaction will try to do > ExtendCLOG and will fail until some disk space becomes available. (Note > that this code is *not* in a critical section, indeed it's not yet > inside a transaction at all, and so failure here does not mean a panic.) > > If you like you can think of the real problem with this code as being > that it violates the basic WAL rule: "make a WAL entry before making > permanent changes". nextXid isn't on disk, but it's persistent > shared-memory state, and it mustn't be updated until any WAL entries > associated with that action have been made. > > Any questions? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > -- Bruce Momjian | http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings