Re: [HACKERS] warning message in standby

Heikki Linnakangas Fri, 11 Jun 2010 05:32:54 -0700

On 11/06/10 07:18, Fujii Masao wrote:

On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
<[email protected]>  wrote:

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a corrupt
record means you've reached end of WAL. In standby mode, when streaming WAL
from master, that shouldn't happen, and it's not clear what to do if it
does. PANIC is not a good idea, at least if the server uses hot standby,
because that only makes the situation worse from availability point of view.
So we log the error as a WARNING, and keep retrying. It's unlikely that the
problem will just go away, but we keep retrying anyway in the hope that it
does. However, it seems that we're too aggressive with the retries.


Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?

Hmm, right now it doesn't even reconnect when it sees a corrupt recordstreamed from the master. It's really pointless to retry in that case,reapplying the exact same piece of WAL surely won't work. I think itshould disconnect, and then retry reading from archive and pg_xlog, andthen retry streaming again. That's pretty hopeless too, but it's atleast theoretically possible that something went wrong in thetransmission and the file in the archive is fine.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--- src/backend/access/transam/xlog.c	10 Jun 2010 08:13:50 -0000	1.422
+++ src/backend/access/transam/xlog.c	11 Jun 2010 12:30:36 -0000
@@ -9271,6 +9271,22 @@
 				if (WalRcvInProgress())
 				{
 					/*
+					 * If we find an invalid record in the WAL streamed from
+					 * master, something is seriously wrong. There's little
+					 * chance that the problem will just go away, but PANIC
+					 * is not good for availability either, especially in
+					 * hot standby mode. Disconnect, and retry from
+					 * archive/pg_xlog again. The WAL in the archive should
+					 * be identical to what was streamed, so it's unlikely
+					 * that it helps, but one can hope...
+					 */
+					if (failedSources & XLOG_FROM_STREAM)
+					{
+						ShutdownWalRcv();
+						continue;
+					}
+
+					/*
 					 * While walreceiver is active, wait for new WAL to arrive
 					 * from primary.
 					 */

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] warning message in standby

Reply via email to