On 10/06/10 17:38, Tom Lane wrote:
Robert Haas<robertmh...@gmail.com>  writes:
On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao<masao.fu...@gmail.com>  wrote:
When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover?  Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?

What "warning message" are we talking about?  All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

We're talking about a corrupt record (incorrect CRC, incorrect backlink etc.), not errors within redo functions. During crash recovery, a corrupt record means you've reached end of WAL. In standby mode, when streaming WAL from master, that shouldn't happen, and it's not clear what to do if it does. PANIC is not a good idea, at least if the server uses hot standby, because that only makes the situation worse from availability point of view. So we log the error as a WARNING, and keep retrying. It's unlikely that the problem will just go away, but we keep retrying anyway in the hope that it does. However, it seems that we're too aggressive with the retries.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to