On 11/06/10 07:18, Fujii Masao wrote:
On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
<heikki.linnakan...@enterprisedb.com>  wrote:
We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a corrupt
record means you've reached end of WAL. In standby mode, when streaming WAL
from master, that shouldn't happen, and it's not clear what to do if it
does. PANIC is not a good idea, at least if the server uses hot standby,
because that only makes the situation worse from availability point of view.
So we log the error as a WARNING, and keep retrying. It's unlikely that the
problem will just go away, but we keep retrying anyway in the hope that it
does. However, it seems that we're too aggressive with the retries.

Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?

Hmm, right now it doesn't even reconnect when it sees a corrupt record streamed from the master. It's really pointless to retry in that case, reapplying the exact same piece of WAL surely won't work. I think it should disconnect, and then retry reading from archive and pg_xlog, and then retry streaming again. That's pretty hopeless too, but it's at least theoretically possible that something went wrong in the transmission and the file in the archive is fine.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
--- src/backend/access/transam/xlog.c	10 Jun 2010 08:13:50 -0000	1.422
+++ src/backend/access/transam/xlog.c	11 Jun 2010 12:30:36 -0000
@@ -9271,6 +9271,22 @@
 				if (WalRcvInProgress())
 				{
 					/*
+					 * If we find an invalid record in the WAL streamed from
+					 * master, something is seriously wrong. There's little
+					 * chance that the problem will just go away, but PANIC
+					 * is not good for availability either, especially in
+					 * hot standby mode. Disconnect, and retry from
+					 * archive/pg_xlog again. The WAL in the archive should
+					 * be identical to what was streamed, so it's unlikely
+					 * that it helps, but one can hope...
+					 */
+					if (failedSources & XLOG_FROM_STREAM)
+					{
+						ShutdownWalRcv();
+						continue;
+					}
+
+					/*
 					 * While walreceiver is active, wait for new WAL to arrive
 					 * from primary.
 					 */
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to