Re: [HACKERS] warning message in standby

2010-07-02 Thread Robert Haas
On Tue, Jun 29, 2010 at 10:58 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 If someone is sloppy about how they copy the WAL files around,
 they could temporarily have a truncated file.

 Can you explain the scenario you're concerned about in more
 detail?

 If someone uses cp or scp to copy a WAL file from the pg_xlog
 directory to an archive directory, there will be a window of time
 where the file exists and is not complete.  If you wait a while
 (that being a duration which could be highly variable, depending on
 the specifics of the environment and copy techniques used), the
 missing part of the file will materialize.

Drat.  It looks like we even recommend that configuration in the docs.

Well, I don't know what to do about this, then, but I predict
somebody's going to get bitten by it... hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Fujii Masao
On Tue, Jun 15, 2010 at 11:35 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On the other hand, I like immediate-panicking. And I don't want the standby
 to retry reconnecting the master infinitely.

On second thought, the peremptory PANIC is not good for HA system. If the
master unfortunately has written an invalid record because of its crash,
the standby would exit with PANIC before performing a failover.

So when an invalid record is found in streamed WAL file, we should keep
the standby running and leave the decision whether the standby retries to
connect to the master forever or shuts down right now, up to the user
(actually, it may be a clusterware)?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Robert Haas
On Tue, Jun 29, 2010 at 3:55 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 11:35 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On the other hand, I like immediate-panicking. And I don't want the standby
 to retry reconnecting the master infinitely.

 On second thought, the peremptory PANIC is not good for HA system. If the
 master unfortunately has written an invalid record because of its crash,
 the standby would exit with PANIC before performing a failover.

I don't think that should ever happen.  The master only streams WAL
that it has fsync'd.  Presumably there's no reason for the master to
ever fsync a partial WAL record (which is usually how a corrupt record
gets into the stream).

 So when an invalid record is found in streamed WAL file, we should keep
 the standby running and leave the decision whether the standby retries to
 connect to the master forever or shuts down right now, up to the user
 (actually, it may be a clusterware)?

Well, if we want to leave it up to the user/clusterware, the current
code is possibly adequate, although there are many different log
messages that could signal this situation, so coding it up might not
be too trivial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Robert Haas
On Tue, Jun 29, 2010 at 6:59 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 29, 2010 at 3:55 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 11:35 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On the other hand, I like immediate-panicking. And I don't want the standby
 to retry reconnecting the master infinitely.

 On second thought, the peremptory PANIC is not good for HA system. If the
 master unfortunately has written an invalid record because of its crash,
 the standby would exit with PANIC before performing a failover.

 I don't think that should ever happen.  The master only streams WAL
 that it has fsync'd.  Presumably there's no reason for the master to
 ever fsync a partial WAL record (which is usually how a corrupt record
 gets into the stream).

 So when an invalid record is found in streamed WAL file, we should keep
 the standby running and leave the decision whether the standby retries to
 connect to the master forever or shuts down right now, up to the user
 (actually, it may be a clusterware)?

 Well, if we want to leave it up to the user/clusterware, the current
 code is possibly adequate, although there are many different log
 messages that could signal this situation, so coding it up might not
 be too trivial.

So here's a patch that seems to implement the behavior I'm thinking of
- if we repeatedly retrieve the same WAL record from the master, and
we never succeed in replaying it, then give up.

It seems we don't have 100% consensus on this, but I thought posting
the patch might inspire some further thoughts.  I'm really
uncomfortable with the idea that if the slave gets out of sync with
the master we'll just do this forever:

FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638
FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638
FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638
FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638

...with this patch, following the above, you get:

FATAL:  invalid record in WAL stream
HINT:  Take a new base backup, or remove recovery.conf and restart in
read-write mode.
LOG:  startup process (PID 6126) exited with exit code 1
LOG:  terminating any other active server processes

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


bound-corrupt-record-retries.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 ...with this patch, following the above, you get:
 
 FATAL:  invalid record in WAL stream
 HINT:  Take a new base backup, or remove recovery.conf and restart
 in read-write mode.
 LOG:  startup process (PID 6126) exited with exit code 1
 LOG:  terminating any other active server processes
 
If someone is sloppy about how they copy the WAL files around, they
could temporarily have a truncated file.  If we want to be tolerant
of straight file copies, without a temporary name or location with a
move on completion, we would need some kind of retry or timeout.  It
appears that you have this hard-coded to five retries.  I'm not
saying this is a bad setting, but I always wonder about hard-coded
magic numbers like this.  What's the delay between retries?  How did
you arrive at five as the magic number?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Robert Haas
On Tue, Jun 29, 2010 at 10:21 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 ...with this patch, following the above, you get:

 FATAL:  invalid record in WAL stream
 HINT:  Take a new base backup, or remove recovery.conf and restart
 in read-write mode.
 LOG:  startup process (PID 6126) exited with exit code 1
 LOG:  terminating any other active server processes

 If someone is sloppy about how they copy the WAL files around, they
 could temporarily have a truncated file.

Can you explain the scenario you're concerned about in more detail?

 If we want to be tolerant
 of straight file copies, without a temporary name or location with a
 move on completion, we would need some kind of retry or timeout.  It
 appears that you have this hard-coded to five retries.  I'm not
 saying this is a bad setting, but I always wonder about hard-coded
 magic numbers like this.  What's the delay between retries?  How did
 you arrive at five as the magic number?

It's approximately the number Josh Berkus suggested in an email
upthread.  In other words, SWAG.

There's not a fixed delay between retries - it represents the number
of times that we have either (a) streamed the relevant chunk from the
master via WALSender, or (b) retrieved the segment from the archive
with restore_command.  The first retry CAN help, if WAL streaming from
master to standby was interrupted unexpectedly.  AFAIK, the additional
retries after that are just paranoia, but I can't rule out the
possibility that I'm missing something, in which case we might have to
rethink the whole approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 If someone is sloppy about how they copy the WAL files around,
 they could temporarily have a truncated file.
 
 Can you explain the scenario you're concerned about in more
 detail?
 
If someone uses cp or scp to copy a WAL file from the pg_xlog
directory to an archive directory, there will be a window of time
where the file exists and is not complete.  If you wait a while
(that being a duration which could be highly variable, depending on
the specifics of the environment and copy techniques used), the
missing part of the file will materialize.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Fujii Masao
On Tue, Jun 29, 2010 at 7:59 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 29, 2010 at 3:55 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 11:35 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On the other hand, I like immediate-panicking. And I don't want the standby
 to retry reconnecting the master infinitely.

 On second thought, the peremptory PANIC is not good for HA system. If the
 master unfortunately has written an invalid record because of its crash,
 the standby would exit with PANIC before performing a failover.

 I don't think that should ever happen.  The master only streams WAL
 that it has fsync'd.  Presumably there's no reason for the master to
 ever fsync a partial WAL record (which is usually how a corrupt record
 gets into the stream).

This is true. But what I'm concerned about is:

1. Backend writes and fsyncs the WAL to the disk
2. The WAL on the disk gets corrupted
3. Walsender reads and sends that corrupted WAL image
4. The master crashes because of the corruption of the disk
5. The standby attempts to replay the corrupted WAL... PANIC

 So when an invalid record is found in streamed WAL file, we should keep
 the standby running and leave the decision whether the standby retries to
 connect to the master forever or shuts down right now, up to the user
 (actually, it may be a clusterware)?

 Well, if we want to leave it up to the user/clusterware, the current
 code is possibly adequate, although there are many different log
 messages that could signal this situation, so coding it up might not
 be too trivial.

So the current code + user-settable-retry-count seems good to me.
If the retry-count is set to 0, we will not see the repeated log
messages. And we might need to provide the parameter specifying
how the standby should behave after exceeding the retry-count:
PANIC or stay-alive-without-retries.

Choosing PANIC and using the retry-count = 5 would cover your proposed
patch.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-29 Thread Robert Haas
On Tue, Jun 29, 2010 at 10:03 PM, Fujii Masao masao.fu...@gmail.com wrote:
 This is true. But what I'm concerned about is:

 1. Backend writes and fsyncs the WAL to the disk
 2. The WAL on the disk gets corrupted
 3. Walsender reads and sends that corrupted WAL image
 4. The master crashes because of the corruption of the disk
 5. The standby attempts to replay the corrupted WAL... PANIC

That sounds like design behavior to me.

 Well, if we want to leave it up to the user/clusterware, the current
 code is possibly adequate, although there are many different log
 messages that could signal this situation, so coding it up might not
 be too trivial.

 So the current code + user-settable-retry-count seems good to me.
 If the retry-count is set to 0, we will not see the repeated log
 messages. And we might need to provide the parameter specifying
 how the standby should behave after exceeding the retry-count:
 PANIC or stay-alive-without-retries.

 Choosing PANIC and using the retry-count = 5 would cover your proposed
 patch.

I'm still having a hard time understanding why anyone would want to
configure this value as infinity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Heikki Linnakangas

On 12/06/10 04:19, Bruce Momjian wrote:

Robert Haas wrote:

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message


+1.  I was making this same argument (less eloquently) upthread.
I particularly like the errhint().


I am wondering what action would be most likely to get the
administrator's attention.


I've committed the patch to disconnect the SR connection in that case. 
If the message needs improvement, let's do that separately once we 
figure out what to do.


Seems like we need something like WARNING that doesn't cause the process 
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a 
hint to the warning will do. How about


WARNING:  invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of 
corruption in the master, or inconsistency between master and standby 
state. The record will be re-fetched, but that is unlikely to fix the 
problem. You may have to restore standby from base backup.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Bruce Momjian
Heikki Linnakangas wrote:
 On 12/06/10 04:19, Bruce Momjian wrote:
  Robert Haas wrote:
  If my streaming replication stops working, I want to know about it as
  soon as possible. WARNING just doesn't cut it.
 
  This needs some better thought.
 
  If we PANIC, then surely it will PANIC again when we restart unless we
  do something. So we can't do that. But we need to do something better
  than
 
  WARNING there is a bug that will likely cause major data loss
  HINT you'll be sacked if you miss this message
 
  +1.  I was making this same argument (less eloquently) upthread.
  I particularly like the errhint().
 
  I am wondering what action would be most likely to get the
  administrator's attention.
 
 I've committed the patch to disconnect the SR connection in that case. 
 If the message needs improvement, let's do that separately once we 
 figure out what to do.
 
 Seems like we need something like WARNING that doesn't cause the process 
 to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a 
 hint to the warning will do. How about
 
 WARNING:  invalid record length at 0/4005330
 HINT: An invalid record was streamed from master. That can be a sign of 
 corruption in the master, or inconsistency between master and standby 
 state. The record will be re-fetched, but that is unlikely to fix the 
 problem. You may have to restore standby from base backup.

I am thinking about log monitoring tools like Nagios.  I am afraid
they are never going to pick up something tagged WARNING, no matter
what the wording is.  Crazy idea, but can we force a fatal error line
into the logs with something like WARNING ...\nFATAL: 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Magnus Hagander
On Mon, Jun 14, 2010 at 12:16, Bruce Momjian br...@momjian.us wrote:
 Heikki Linnakangas wrote:
 On 12/06/10 04:19, Bruce Momjian wrote:
  Robert Haas wrote:
  If my streaming replication stops working, I want to know about it as
  soon as possible. WARNING just doesn't cut it.
 
  This needs some better thought.
 
  If we PANIC, then surely it will PANIC again when we restart unless we
  do something. So we can't do that. But we need to do something better
  than
 
  WARNING there is a bug that will likely cause major data loss
  HINT you'll be sacked if you miss this message
 
  +1.  I was making this same argument (less eloquently) upthread.
  I particularly like the errhint().
 
  I am wondering what action would be most likely to get the
  administrator's attention.

 I've committed the patch to disconnect the SR connection in that case.
 If the message needs improvement, let's do that separately once we
 figure out what to do.

 Seems like we need something like WARNING that doesn't cause the process
 to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
 hint to the warning will do. How about

 WARNING:  invalid record length at 0/4005330
 HINT: An invalid record was streamed from master. That can be a sign of
 corruption in the master, or inconsistency between master and standby
 state. The record will be re-fetched, but that is unlikely to fix the
 problem. You may have to restore standby from base backup.

 I am thinking about log monitoring tools like Nagios.  I am afraid
 they are never going to pick up something tagged WARNING, no matter

If they are properly configured, I imagine they would. And if they're
not, well, there's not much for us to do.

(What would be more usful then would be to separate user-warnings
like warnings about cast from actual system-warnings like this, but
that's a whole different story)

 what the wording is.  Crazy idea, but can we force a fatal error line
 into the logs with something like WARNING ...\nFATAL: 

That's way too crazy :P And btw, randomly sticking newlines into that
will mess up *most* log displayers and I bet a lot of the log
monitoring tools as well...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Bruce Momjian
Magnus Hagander wrote:
  Seems like we need something like WARNING that doesn't cause the process
  to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
  hint to the warning will do. How about
 
  WARNING: ?invalid record length at 0/4005330
  HINT: An invalid record was streamed from master. That can be a sign of
  corruption in the master, or inconsistency between master and standby
  state. The record will be re-fetched, but that is unlikely to fix the
  problem. You may have to restore standby from base backup.
 
  I am thinking about log monitoring tools like Nagios. ?I am afraid
  they are never going to pick up something tagged WARNING, no matter
 
 If they are properly configured, I imagine they would. And if they're
 not, well, there's not much for us to do.

What does that mean?

 (What would be more usful then would be to separate user-warnings
 like warnings about cast from actual system-warnings like this, but
 that's a whole different story)
 
  what the wording is. ?Crazy idea, but can we force a fatal error line
  into the logs with something like WARNING ...\nFATAL: 
 
 That's way too crazy :P And btw, randomly sticking newlines into that
 will mess up *most* log displayers and I bet a lot of the log
 monitoring tools as well...

Yeah, it would mess up CSV log output, for example?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Magnus Hagander
On Mon, Jun 14, 2010 at 13:11, Bruce Momjian br...@momjian.us wrote:
 Magnus Hagander wrote:
  Seems like we need something like WARNING that doesn't cause the process
  to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
  hint to the warning will do. How about
 
  WARNING: ?invalid record length at 0/4005330
  HINT: An invalid record was streamed from master. That can be a sign of
  corruption in the master, or inconsistency between master and standby
  state. The record will be re-fetched, but that is unlikely to fix the
  problem. You may have to restore standby from base backup.
 
  I am thinking about log monitoring tools like Nagios. ?I am afraid
  they are never going to pick up something tagged WARNING, no matter

 If they are properly configured, I imagine they would. And if they're
 not, well, there's not much for us to do.

 What does that mean?

It means that we can't prevent people from configuring their tools to
ignore important warning. We can't prevent them rom ignoring ERROR or
FATAL either...


 (What would be more usful then would be to separate user-warnings
 like warnings about cast from actual system-warnings like this, but
 that's a whole different story)

  what the wording is. ?Crazy idea, but can we force a fatal error line
  into the logs with something like WARNING ...\nFATAL: 

 That's way too crazy :P And btw, randomly sticking newlines into that
 will mess up *most* log displayers and I bet a lot of the log
 monitoring tools as well...

 Yeah, it would mess up CSV log output, for example?

Possibly. And your general syslog output would become pretty hard to
read once this kind of stuff shows up.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Bruce Momjian
Magnus Hagander wrote:
 On Mon, Jun 14, 2010 at 13:11, Bruce Momjian br...@momjian.us wrote:
  Magnus Hagander wrote:
   Seems like we need something like WARNING that doesn't cause the process
   to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
   hint to the warning will do. How about
  
   WARNING: ?invalid record length at 0/4005330
   HINT: An invalid record was streamed from master. That can be a sign of
   corruption in the master, or inconsistency between master and standby
   state. The record will be re-fetched, but that is unlikely to fix the
   problem. You may have to restore standby from base backup.
  
   I am thinking about log monitoring tools like Nagios. ?I am afraid
   they are never going to pick up something tagged WARNING, no matter
 
  If they are properly configured, I imagine they would. And if they're
  not, well, there's not much for us to do.
 
  What does that mean?
 
 It means that we can't prevent people from configuring their tools to
 ignore important warning. We can't prevent them rom ignoring ERROR or
 FATAL either...

My point is that most tools are going to look at the tag first to
determine the severity of the message, and might even have
log_min_messages set to ignore warnings.  Esentially this message would
have to be special-cased for log filtering, and considering translation
issues, etc, it seems pretty odd.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 7:18 AM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Jun 14, 2010 at 13:11, Bruce Momjian br...@momjian.us wrote:
 Magnus Hagander wrote:
  Seems like we need something like WARNING that doesn't cause the process
  to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
  hint to the warning will do. How about
 
  WARNING: ?invalid record length at 0/4005330
  HINT: An invalid record was streamed from master. That can be a sign of
  corruption in the master, or inconsistency between master and standby
  state. The record will be re-fetched, but that is unlikely to fix the
  problem. You may have to restore standby from base backup.
 
  I am thinking about log monitoring tools like Nagios. ?I am afraid
  they are never going to pick up something tagged WARNING, no matter

 If they are properly configured, I imagine they would. And if they're
 not, well, there's not much for us to do.

 What does that mean?

 It means that we can't prevent people from configuring their tools to
 ignore important warning. We can't prevent them rom ignoring ERROR or
 FATAL either...

Right.  Certainly, ERROR would be better than WARNING, though, because
someone, somewhere out there has a log-fitering tool that extracts
ERRORs but ignore WARNINGs.

What still bugs me about this situation is that we're essentially
trying futilely to recover from what's really a fatal error.  There is
no manner of proceeding that has any hope of success, yet we just keep
hopelessly retrying.  Why do we do that here and not elsewhere?  By
the logic we're using here, we ought to retry when we hit a division
by zero error.  Maybe the next time we read the second input value it
will have some bits set...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Heikki Linnakangas

On 14/06/10 13:16, Bruce Momjian wrote:

Heikki Linnakangas wrote:

On 12/06/10 04:19, Bruce Momjian wrote:

Robert Haas wrote:

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message


+1.  I was making this same argument (less eloquently) upthread.
I particularly like the errhint().


I am wondering what action would be most likely to get the
administrator's attention.


I've committed the patch to disconnect the SR connection in that case.
If the message needs improvement, let's do that separately once we
figure out what to do.

Seems like we need something like WARNING that doesn't cause the process
to die, but more alarming like ERROR/FATAL/PANIC. Or maybe just adding a
hint to the warning will do. How about

WARNING:  invalid record length at 0/4005330
HINT: An invalid record was streamed from master. That can be a sign of
corruption in the master, or inconsistency between master and standby
state. The record will be re-fetched, but that is unlikely to fix the
problem. You may have to restore standby from base backup.


I am thinking about log monitoring tools like Nagios.  I am afraid
they are never going to pick up something tagged WARNING, no matter
what the wording is.


One idea is for the startup process to signal walreceiver process to 
commit suicide with FATAL, instead of just dying silently like it does 
now. So you'd get a WARNING explaining how the record was corrupt, 
followed by a FATAL from the walreceiver process:


WARNING:  invalid record length at 0/4005330
FATAL: walreceiver killed because of error in WAL stream


 Crazy idea, but can we force a fatal error line
into the logs with something like WARNING ...\nFATAL: 


Yeah, that's crazy.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Magnus Hagander wrote:
 It means that we can't prevent people from configuring their tools to
 ignore important warning. We can't prevent them rom ignoring ERROR or
 FATAL either...

 My point is that most tools are going to look at the tag first to
 determine the severity of the message, and might even have
 log_min_messages set to ignore warnings.

Why is this discussion based on the idea that we have to cater to
incorrectly written log-filtering apps?

The correct log level for this message is LOG.  End of discussion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bruce Momjian br...@momjian.us writes:
 Magnus Hagander wrote:
 It means that we can't prevent people from configuring their tools to
 ignore important warning. We can't prevent them rom ignoring ERROR or
 FATAL either...

 My point is that most tools are going to look at the tag first to
 determine the severity of the message, and might even have
 log_min_messages set to ignore warnings.

 Why is this discussion based on the idea that we have to cater to
 incorrectly written log-filtering apps?

 The correct log level for this message is LOG.  End of discussion.

Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 14, 2010 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 The correct log level for this message is LOG.  End of discussion.

 Why?

Because it's not being issued in a user's session.  The only place it
can go is to the system log, and if you use a level of WARNING or less,
it's likely to get filtered out by log_min_messages.

I'm totally unimpressed by the argument that log-filtering applications
don't know enough to pay attention to LOG messages.  There are already a
lot of those that are quite important to notice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 10:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 14, 2010 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 The correct log level for this message is LOG.  End of discussion.

 Why?

 Because it's not being issued in a user's session.  The only place it
 can go is to the system log, and if you use a level of WARNING or less,
 it's likely to get filtered out by log_min_messages.

 I'm totally unimpressed by the argument that log-filtering applications
 don't know enough to pay attention to LOG messages.  There are already a
 lot of those that are quite important to notice.

I'm willing to buy the above, but nobody has explained to my
satisfaction why it's remotely sane to go into an infinite retry loop
on an unrecoverable error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Bruce Momjian
Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Mon, Jun 14, 2010 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  The correct log level for this message is LOG. ?End of discussion.
 
  Why?
 
 Because it's not being issued in a user's session.  The only place it
 can go is to the system log, and if you use a level of WARNING or less,
 it's likely to get filtered out by log_min_messages.
 
 I'm totally unimpressed by the argument that log-filtering applications
 don't know enough to pay attention to LOG messages.  There are already a
 lot of those that are quite important to notice.

My point was that log filtering applications might ignore WARNING.  They
don't usually ignore LOG.
8
-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I'm willing to buy the above, but nobody has explained to my
 satisfaction why it's remotely sane to go into an infinite retry loop
 on an unrecoverable error.

That's a different question altogether ;-).  I assume you're not
satisfied by the change Heikki committed a couple hours ago?
It will at least try to do something to recover.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I'm willing to buy the above, but nobody has explained to my
 satisfaction why it's remotely sane to go into an infinite retry loop
 on an unrecoverable error.

 That's a different question altogether ;-).  I assume you're not
 satisfied by the change Heikki committed a couple hours ago?
 It will at least try to do something to recover.

Yeah, I'm not satisfied by that.  It's an improvement in the technical
sense - it replaces an infinite retry that spins at top speed with a
slower retry that won't flog your CPU quite so badly, but the chances
that it will actually succeed in correcting the underlying problem
seem infinitesimal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Simon Riggs
On Mon, 2010-06-14 at 10:30 -0400, Tom Lane wrote:

 I'm totally unimpressed by the argument that log-filtering
 applications don't know enough to pay attention to LOG messages.
 There are already a lot of those that are quite important to notice.

We have a log level where 1 log entry in a million is something serious
and all the rest are generally ignorable. That situation leads to the
cognitive phenomena is known as normalization of deviance, a term
coined in the wake of the Challenger disaster.

Should I be downgrading Hot Standby breakages to LOG? That will
certainly help high availability as well.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 14, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That's a different question altogether ;-).  I assume you're not
 satisfied by the change Heikki committed a couple hours ago?
 It will at least try to do something to recover.

 Yeah, I'm not satisfied by that.  It's an improvement in the technical
 sense - it replaces an infinite retry that spins at top speed with a
 slower retry that won't flog your CPU quite so badly, but the chances
 that it will actually succeed in correcting the underlying problem
 seem infinitesimal.

I'm not sure about that.  walreceiver will refetch from the start of the
current WAL page, so there's at least some chance of getting a good copy
when we didn't have one before.

However, I do agree that it's not helpful to loop forever.  If we can
easily make it retry once and then PANIC, I'd be for that --- otherwise
I tend to agree that the best thing is just to PANIC immediately.  There
are many many situations where a slave resync will be necessary; a
transmission error on the WAL data is just one more.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 10:57 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 14, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That's a different question altogether ;-).  I assume you're not
 satisfied by the change Heikki committed a couple hours ago?
 It will at least try to do something to recover.

 Yeah, I'm not satisfied by that.  It's an improvement in the technical
 sense - it replaces an infinite retry that spins at top speed with a
 slower retry that won't flog your CPU quite so badly, but the chances
 that it will actually succeed in correcting the underlying problem
 seem infinitesimal.

 I'm not sure about that.  walreceiver will refetch from the start of the
 current WAL page, so there's at least some chance of getting a good copy
 when we didn't have one before.

The testing that I have been doing while we've been discussing this
reveals that you are correct.  I set up an HS/SR master and slave
(running on the same machine), ran pgbench on the master, and then
started randomly sending SIGSEGV to one of the master's backends.  It
seems that complaints about the WAL are possible on both master and
slave.  Here are a couple from the slave:

LOG:  unexpected pageaddr 0/89B7A000 in log file 0, segment 152, offset 12034048
WARNING:  there is no contrecord flag in log file 0, segment 136, offset 2523136
LOG:  invalid magic number  in log file 0, segment 136, offset 2531328

The slave reconnects and then things get better.  So I think your idea
of retrying once and then panicking is probably best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Should I be downgrading Hot Standby breakages to LOG? That will
 certainly help high availability as well.

If a message is being issued in a non-user-connected session, there
is basically not a lot of point in WARNING or below.  It should either
be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
thing in the startup process...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Should I be downgrading Hot Standby breakages to LOG? That will
 certainly help high availability as well.

 If a message is being issued in a non-user-connected session, there
 is basically not a lot of point in WARNING or below.  It should either
 be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
 thing in the startup process...)

I think Simon's point here is the same as mine - LOG isn't too high -
it's too low.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Simon Riggs
On Mon, 2010-06-14 at 11:14 -0400, Robert Haas wrote:
 On Mon, Jun 14, 2010 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Should I be downgrading Hot Standby breakages to LOG? That will
  certainly help high availability as well.
 
  If a message is being issued in a non-user-connected session, there
  is basically not a lot of point in WARNING or below.  It should either
  be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
  thing in the startup process...)
 
 I think Simon's point here is the same as mine - LOG isn't too high -
 it's too low.

Yes, *and* how do we decide which this is?

Should I downgrade all of my code to throwing LOGs rather than ERRORs,
because (following the same argument) doing so would be better for high
availability? It's not a facetious question.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 14, 2010 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 If a message is being issued in a non-user-connected session, there
 is basically not a lot of point in WARNING or below.  It should either
 be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
 thing in the startup process...)

 I think Simon's point here is the same as mine - LOG isn't too high -
 it's too low.

If he wants to throw ERROR, that's fine with me.  If your point is that
you think WARNING is more severe than LOG, the answer is that you need
to readjust your thinking.  It's not.  See the sort order for
log_min_messages values.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 11:34 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-06-14 at 11:14 -0400, Robert Haas wrote:
 On Mon, Jun 14, 2010 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Should I be downgrading Hot Standby breakages to LOG? That will
  certainly help high availability as well.
 
  If a message is being issued in a non-user-connected session, there
  is basically not a lot of point in WARNING or below.  It should either
  be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
  thing in the startup process...)

 I think Simon's point here is the same as mine - LOG isn't too high -
 it's too low.

 Yes, *and* how do we decide which this is?

 Should I downgrade all of my code to throwing LOGs rather than ERRORs,
 because (following the same argument) doing so would be better for high
 availability? It's not a facetious question.

I would say - definitely not.  High availability is all well and good,
but if things are broken under the hood, it's not really so good after
all.  We've traditionally put data integrity and system reliability
ahead of (for example) giving some answer to a query, even if it's the
wrong answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 14, 2010 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Should I be downgrading Hot Standby breakages to LOG? That will
 certainly help high availability as well.

 If a message is being issued in a non-user-connected session, there
 is basically not a lot of point in WARNING or below.  It should either
 be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
 thing in the startup process...)

 I think Simon's point here is the same as mine - LOG isn't too high -
 it's too low.

log_min_messages = warning  # values in order of decreasing detail:
#   notice
#   warning
#   error
#   log
#   fatal
#   panic

I've left out some lines, but the ones I left are in the right order and
there's nothing missing in the range. So WARNING  ERROR  LOG  FATAL,
right?

If that's the case, I guess Tom's right, once more, saying that LOG is
fine here. If we want to be more subtle than that, we'd need to revise
each and every error message and attribute it the right level, which it
probably have already anyway.

Regards.
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Simon Riggs
On Mon, 2010-06-14 at 18:11 +0200, Dimitri Fontaine wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Mon, Jun 14, 2010 at 11:09 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Should I be downgrading Hot Standby breakages to LOG? That will
  certainly help high availability as well.
 
  If a message is being issued in a non-user-connected session, there
  is basically not a lot of point in WARNING or below.  It should either
  be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
  thing in the startup process...)
 
  I think Simon's point here is the same as mine - LOG isn't too high -
  it's too low.
 
 log_min_messages = warning# values in order of decreasing detail:
   #   notice
   #   warning
   #   error
   #   log
   #   fatal
   #   panic
 
 I've left out some lines, but the ones I left are in the right order and
 there's nothing missing in the range. So WARNING  ERROR  LOG  FATAL,
 right?
 
 If that's the case, I guess Tom's right, once more, saying that LOG is
 fine here. If we want to be more subtle than that, we'd need to revise
 each and every error message and attribute it the right level, which it
 probably have already anyway.

Nobody is arguing with what Tom has said about log levels.

The problem is that LOG already has many things like performance logging
which aren't a problem as all. So we need a level between LOG and FATAL
to draw anyone's attention.

@Robert - I'd point out that the behaviour of archive_cleanup_command
and recovery_end_command is broken as a result of this discussion.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 12:31 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If that's the case, I guess Tom's right, once more, saying that LOG is
 fine here. If we want to be more subtle than that, we'd need to revise
 each and every error message and attribute it the right level, which it
 probably have already anyway.

 Nobody is arguing with what Tom has said about log levels.

Agreed.

 The problem is that LOG already has many things like performance logging
 which aren't a problem as all. So we need a level between LOG and FATAL
 to draw anyone's attention.

Not sure I agree with this - what I think the problem is here is we
need to make a clear distinction between recoverable errors and
unrecoverable errors.

 @Robert - I'd point out that the behaviour of archive_cleanup_command
 and recovery_end_command is broken as a result of this discussion.

:-(

How so?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Not sure I agree with this - what I think the problem is here is we
 need to make a clear distinction between recoverable errors and
 unrecoverable errors.

Um, if it's recoverable, it's not really an error ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 1:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Not sure I agree with this - what I think the problem is here is we
 need to make a clear distinction between recoverable errors and
 unrecoverable errors.

 Um, if it's recoverable, it's not really an error ...

Gah.  This thread is a terminological mess.  I'm using the term
error generally, not in the PG-specific sense of ERROR.

problems after which there is a sane way to continue and problems
after which there isn't

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Simon Riggs
On Mon, 2010-06-14 at 11:09 -0400, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  Should I be downgrading Hot Standby breakages to LOG? That will
  certainly help high availability as well.
 
 If a message is being issued in a non-user-connected session, there
 is basically not a lot of point in WARNING or below.  It should either
 be LOG, or ERROR/FATAL/PANIC (which are probably all about the same
 thing in the startup process...)

This looks like a significant issue to me. 

Code has been written for 9.0 that assumes WARNING  LOG. I've pointed
out two places in SR code, I'm pretty sure there are others.

There is also lots of pre-existing code where we run the same code in a
backend and in a server process, for example autovacuum. In those cases,
all the WARNINGs issued in vacuum.c, for example, will all be filtered
by the current log-level scheme.

LOG is already over-used and so anything said at that level is drowned.
In many areas of code we cannot use a higher level without trauma. That
is a problem since we have no way to separate the truly important from
the barely interesting.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 LOG is already over-used and so anything said at that level is drowned.

This is nonsense.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 
 LOG is already over-used and so anything said at that level is
 drowned.  In many areas of code we cannot use a higher level
 without trauma. That is a problem since we have no way to separate
 the truly important from the barely interesting.
 
The fact that LOG is categorized the same as INFO has led me to
believe that they are morally equivalent -- that the only reason
both exist is that one has entries of interest to system
administrators and the other has interest to clients.
 
http://www.postgresql.org/docs/8.4/interactive/runtime-config-logging.html#RUNTIME-CONFIG-SEVERITY-LEVELS
 
Our shop chooses to log all connections and disconnections.  That's
mixed with such as which clients broke their connections without
proper handshaking.  I am surprised to hear that any time-critical
alerts would be logged at this level, versus the sort of information
you might want for forensic purposes.  Perhaps anything which
shouldn't be categorized as INFO to syslog should have some other
(new?) level.  ALERT, maybe?  Mapping to ERR?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Simon Riggs si...@2ndquadrant.com wrote:
 LOG is already over-used and so anything said at that level is
 drowned.  In many areas of code we cannot use a higher level
 without trauma. That is a problem since we have no way to separate
 the truly important from the barely interesting.
 
 The fact that LOG is categorized the same as INFO has led me to
 believe that they are morally equivalent --

They are not morally equivalent.  INFO is for output that the user
has explicitly requested appear on his console (eg, via VACUUM VERBOSE).
So it's high priority for client output, not so much for log output.
LOG is for information that is considered high priority to log, but
not so much for client output (indeed maybe there is no client to output
it to).

If LOG is over-used then the problem is that we have LOG messages that
ought to be downgraded to DEBUG.  A normally functioning system should
not be emitting *any* LOG messages during routine business, other than
ones that the user explicitly requested (like log_connections).  We have
had this discussion in the past, in response to complaints that there
was too much log traffic, and have downgraded things that were poorly
categorized as LOG.  Maybe there are some more of them in the HS/SR
code.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Josh Berkus
On 6/14/10 7:57 AM, Tom Lane wrote:
 However, I do agree that it's not helpful to loop forever.  If we can
 easily make it retry once and then PANIC, I'd be for that --- otherwise
 I tend to agree that the best thing is just to PANIC immediately.  There
 are many many situations where a slave resync will be necessary; a
 transmission error on the WAL data is just one more

I'd vote for 3 or 5 times.  Whatever is the same as we do with trying to
get the log files.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Magnus Hagander
On Mon, Jun 14, 2010 at 20:22, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 LOG is already over-used and so anything said at that level is drowned.

 This is nonsense.

Whether it's over-used or not may be, but that doesn't make the
general issue nonsense.

But the fact is that having LOG at a higher priority than WARNING and
ERROR is un-intuitive in the extreme, and I bet there aren't many
non-experts who realize this. I don't know of any other software that
does this. Heck, *experienced* postgresql users keep getting this
wrong. Perhaps we need to re-think the naming there?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 
 The fact that LOG is categorized the same as INFO has led me to
 believe that they are morally equivalent --
 
 They are not morally equivalent.  INFO is for output that the user
 has explicitly requested appear on his console (eg, via VACUUM
 VERBOSE).  So it's high priority for client output, not so much
 for log output.  LOG is for information that is considered high
 priority to log, but not so much for client output (indeed maybe
 there is no client to output it to).
 
I didn't say I thought they meant the same thing.  You omitted the
part where I said
 
 -- that the only reason
 both exist is that one has entries of interest to system
 administrators and the other has interest to clients.
 
Which I think is saying the same thing you just said.
 
 If LOG is over-used then the problem is that we have LOG
 messages that ought to be downgraded to DEBUG.  A normally
 functioning system should not be emitting *any* LOG messages
 during routine business, other than ones that the user explicitly
 requested (like log_connections).
 
Current LOG level entries may be something which DBAs find as useful
as you find the xmin/xmax breadcrumbs in tuples -- we normally don't
want to look at all of these LOG entries, but we want them there to
review in case of questions or problems.  If anything of a
time-critical nature is being written at the LOG level, I view that
as a serious issue.  We *should* have some logging level available
to server processes which maps to something more serious than INFO
in syslog and INFORMATION in the Windows event log but which doesn't
indicate that something is being terminated.  Perhaps we are
currently using FATAL for these and I'm misunderstanding the issue. 
Even so, there seems to be a lot of room between LOG and FATAL,
conceptually -- at least in the connotations those words have for
me.
 
ALERT keeps coming to mind.  Maybe best mapped to NOTICE for syslog
and WARNING for Windows event log.  That seems to fill a gap in the
logging levels.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Fujii Masao
On Tue, Jun 15, 2010 at 12:09 AM, Robert Haas robertmh...@gmail.com wrote:
 The testing that I have been doing while we've been discussing this
 reveals that you are correct.  I set up an HS/SR master and slave
 (running on the same machine), ran pgbench on the master, and then
 started randomly sending SIGSEGV to one of the master's backends.  It
 seems that complaints about the WAL are possible on both master and
 slave.  Here are a couple from the slave:

 LOG:  unexpected pageaddr 0/89B7A000 in log file 0, segment 152, offset 
 12034048
 WARNING:  there is no contrecord flag in log file 0, segment 136, offset 
 2523136
 LOG:  invalid magic number  in log file 0, segment 136, offset 2531328

 The slave reconnects and then things get better.  So I think your idea
 of retrying once and then panicking is probably best.

AFAIR, in the previous discussion, some people think that it's better to
keep the standby open for read-only queries even if an error is found.
Panicking would be undesirable for them.

On the other hand, I like immediate-panicking. And I don't want the standby
to retry reconnecting the master infinitely.

To cover all the use cases, how about introducing new parameter specifying
the maximum number of times to retry reconnecting? If we like the retry-once-
then-panicking idea, we can set the parameter to one. If we'd like to keep
the standby open infinitely, we can set it to the very large value (or -1
meaning infinite retrying). If we think that immediate-panicking is the best,
we can set it to zero. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 10:35 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 12:09 AM, Robert Haas robertmh...@gmail.com wrote:
 The testing that I have been doing while we've been discussing this
 reveals that you are correct.  I set up an HS/SR master and slave
 (running on the same machine), ran pgbench on the master, and then
 started randomly sending SIGSEGV to one of the master's backends.  It
 seems that complaints about the WAL are possible on both master and
 slave.  Here are a couple from the slave:

 LOG:  unexpected pageaddr 0/89B7A000 in log file 0, segment 152, offset 
 12034048
 WARNING:  there is no contrecord flag in log file 0, segment 136, offset 
 2523136
 LOG:  invalid magic number  in log file 0, segment 136, offset 2531328

 The slave reconnects and then things get better.  So I think your idea
 of retrying once and then panicking is probably best.

 AFAIR, in the previous discussion, some people think that it's better to
 keep the standby open for read-only queries even if an error is found.
 Panicking would be undesirable for them.

 On the other hand, I like immediate-panicking. And I don't want the standby
 to retry reconnecting the master infinitely.

 To cover all the use cases, how about introducing new parameter specifying
 the maximum number of times to retry reconnecting? If we like the retry-once-
 then-panicking idea, we can set the parameter to one. If we'd like to keep
 the standby open infinitely, we can set it to the very large value (or -1
 meaning infinite retrying). If we think that immediate-panicking is the best,
 we can set it to zero. Thought?

I think that would be overkill.  If the user wants to use the slave to
answer queries even though recovery can't continue, they can always
remove recovery.conf.

It seems pretty optimistic to assume that this will be useful very
often, though.  Most people will need their slave to be up to date
(otherwise why bother with Hot Standby in the first place?).  I feel
pretty safe in predicting that if the master and slave get out of
sync, most administrators are going to want to take a new base backup
immediately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Simon Riggs
On Thu, 2010-06-10 at 09:57 -0400, Robert Haas wrote:
 On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao masao.fu...@gmail.com
 wrote:
  When an error is found in the WAL streamed from the master, a
 warning
  message is repeated without interval forever in the standby. This
  consumes CPU load very much, and would interfere with read-only
 queries.
  To fix this problem, we should add a sleep into
 emode_for_corrupt_record()
  or somewhere? Or we should stop walreceiver and retry to read WAL
 from
  pg_xlog or the archive?
 
 I ran into this problem at one point, too, but was in the middle of
 trying to investigate a different bug and didn't have time to track
 down what was causing it.

So you saw a bug, but didn't report it? Why was that?

I find it amazing that such an obvious problem wasn't covered in
testing, nor do we have a clear plan in the design for handling it.

How many other things are known about but not reported?

If anybody has bugs, report them now.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Heikki Linnakangas

On 11/06/10 07:18, Fujii Masao wrote:

On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a corrupt
record means you've reached end of WAL. In standby mode, when streaming WAL
from master, that shouldn't happen, and it's not clear what to do if it
does. PANIC is not a good idea, at least if the server uses hot standby,
because that only makes the situation worse from availability point of view.
So we log the error as a WARNING, and keep retrying. It's unlikely that the
problem will just go away, but we keep retrying anyway in the hope that it
does. However, it seems that we're too aggressive with the retries.


Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?


Hmm, right now it doesn't even reconnect when it sees a corrupt record 
streamed from the master. It's really pointless to retry in that case, 
reapplying the exact same piece of WAL surely won't work. I think it 
should disconnect, and then retry reading from archive and pg_xlog, and 
then retry streaming again. That's pretty hopeless too, but it's at 
least theoretically possible that something went wrong in the 
transmission and the file in the archive is fine.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
--- src/backend/access/transam/xlog.c	10 Jun 2010 08:13:50 -	1.422
+++ src/backend/access/transam/xlog.c	11 Jun 2010 12:30:36 -
@@ -9271,6 +9271,22 @@
 if (WalRcvInProgress())
 {
 	/*
+	 * If we find an invalid record in the WAL streamed from
+	 * master, something is seriously wrong. There's little
+	 * chance that the problem will just go away, but PANIC
+	 * is not good for availability either, especially in
+	 * hot standby mode. Disconnect, and retry from
+	 * archive/pg_xlog again. The WAL in the archive should
+	 * be identical to what was streamed, so it's unlikely
+	 * that it helps, but one can hope...
+	 */
+	if (failedSources  XLOG_FROM_STREAM)
+	{
+		ShutdownWalRcv();
+		continue;
+	}
+
+	/*
 	 * While walreceiver is active, wait for new WAL to arrive
 	 * from primary.
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Robert Haas
On Fri, Jun 11, 2010 at 8:19 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-06-10 at 09:57 -0400, Robert Haas wrote:
 On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao masao.fu...@gmail.com
 wrote:
  When an error is found in the WAL streamed from the master, a
 warning
  message is repeated without interval forever in the standby. This
  consumes CPU load very much, and would interfere with read-only
 queries.
  To fix this problem, we should add a sleep into
 emode_for_corrupt_record()
  or somewhere? Or we should stop walreceiver and retry to read WAL
 from
  pg_xlog or the archive?

 I ran into this problem at one point, too, but was in the middle of
 trying to investigate a different bug and didn't have time to track
 down what was causing it.

 So you saw a bug, but didn't report it? Why was that?

Because it happened while I was in the middle of trying to investigate
a different bug, and I didn't know precisely what I had done to cause
it.  I could have reported it anyway, but I thought it would not have
been that helpful to report a bug that I didn't know for sure how to
reproduce.  Like everyone else here, I am working very hard and do not
always have time to follow up on every issue.  I had something else
weird happen yesterday but I'm not sure I know exactly what I did and
I'm not at all sure that it wasn't just pilot error.  If I start
reporting every pilot error as a bug, I'm going to get laughed out of
town, because I make a pilot error *every time* I tried to set up SR
w/HS.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Fujii Masao
On Fri, Jun 11, 2010 at 9:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Hmm, right now it doesn't even reconnect when it sees a corrupt record
 streamed from the master. It's really pointless to retry in that case,
 reapplying the exact same piece of WAL surely won't work. I think it should
 disconnect, and then retry reading from archive and pg_xlog, and then retry
 streaming again. That's pretty hopeless too, but it's at least theoretically
 possible that something went wrong in the transmission and the file in the
 archive is fine.

Yep, this makes more sense.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Simon Riggs
On Thu, 2010-06-10 at 19:01 +0300, Heikki Linnakangas wrote:
 
  What warning message are we talking about?  All the error cases I can
  think of in WAL-application are ERROR, or likely even PANIC.
 
 We're talking about a corrupt record (incorrect CRC, incorrect backlink 
 etc.), not errors within redo functions. During crash recovery, a 
 corrupt record means you've reached end of WAL. In standby mode, when 
 streaming WAL from master, that shouldn't happen, and it's not clear 
 what to do if it does. PANIC is not a good idea, at least if the server 
 uses hot standby, because that only makes the situation worse from 
 availability point of view. So we log the error as a WARNING, and keep 
 retrying. It's unlikely that the problem will just go away, but we keep 
 retrying anyway in the hope that it does. However, it seems that we're 
 too aggressive with the retries.

If my streaming replication stops working, I want to know about it as
soon as possible. WARNING just doesn't cut it.

This needs some better thought.

If we PANIC, then surely it will PANIC again when we restart unless we
do something. So we can't do that. But we need to do something better
than 

WARNING there is a bug that will likely cause major data loss
HINT you'll be sacked if you miss this message

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Robert Haas
On Fri, Jun 11, 2010 at 9:43 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-06-10 at 19:01 +0300, Heikki Linnakangas wrote:
 
  What warning message are we talking about?  All the error cases I can
  think of in WAL-application are ERROR, or likely even PANIC.

 We're talking about a corrupt record (incorrect CRC, incorrect backlink
 etc.), not errors within redo functions. During crash recovery, a
 corrupt record means you've reached end of WAL. In standby mode, when
 streaming WAL from master, that shouldn't happen, and it's not clear
 what to do if it does. PANIC is not a good idea, at least if the server
 uses hot standby, because that only makes the situation worse from
 availability point of view. So we log the error as a WARNING, and keep
 retrying. It's unlikely that the problem will just go away, but we keep
 retrying anyway in the hope that it does. However, it seems that we're
 too aggressive with the retries.

 If my streaming replication stops working, I want to know about it as
 soon as possible. WARNING just doesn't cut it.

 This needs some better thought.

 If we PANIC, then surely it will PANIC again when we restart unless we
 do something. So we can't do that. But we need to do something better
 than

 WARNING there is a bug that will likely cause major data loss
 HINT you'll be sacked if you miss this message

+1.  I was making this same argument (less eloquently) upthread.

I particularly like the errhint().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-11 Thread Bruce Momjian
Robert Haas wrote:
  If my streaming replication stops working, I want to know about it as
  soon as possible. WARNING just doesn't cut it.
 
  This needs some better thought.
 
  If we PANIC, then surely it will PANIC again when we restart unless we
  do something. So we can't do that. But we need to do something better
  than
 
  WARNING there is a bug that will likely cause major data loss
  HINT you'll be sacked if you miss this message
 
 +1.  I was making this same argument (less eloquently) upthread.
 I particularly like the errhint().

I am wondering what action would be most likely to get the
administrator's attention.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Robert Haas
On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao masao.fu...@gmail.com wrote:
 When an error is found in the WAL streamed from the master, a warning
 message is repeated without interval forever in the standby. This
 consumes CPU load very much, and would interfere with read-only queries.
 To fix this problem, we should add a sleep into emode_for_corrupt_record()
 or somewhere? Or we should stop walreceiver and retry to read WAL from
 pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover?  Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao masao.fu...@gmail.com wrote:
 When an error is found in the WAL streamed from the master, a warning
 message is repeated without interval forever in the standby. This
 consumes CPU load very much, and would interfere with read-only queries.
 To fix this problem, we should add a sleep into emode_for_corrupt_record()
 or somewhere? Or we should stop walreceiver and retry to read WAL from
 pg_xlog or the archive?

 I ran into this problem at one point, too, but was in the middle of
 trying to investigate a different bug and didn't have time to track
 down what was causing it.

 I think the basic question here is - if there's an error in the WAL,
 how do we expect to EVER recover?  Even if we can read from the
 archive or pg_xlog, presumably it's the same WAL - why should we be
 any more successful the second time?

What warning message are we talking about?  All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 17:38, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masaomasao.fu...@gmail.com  wrote:

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?



I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.



I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover?  Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?


What warning message are we talking about?  All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.


We're talking about a corrupt record (incorrect CRC, incorrect backlink 
etc.), not errors within redo functions. During crash recovery, a 
corrupt record means you've reached end of WAL. In standby mode, when 
streaming WAL from master, that shouldn't happen, and it's not clear 
what to do if it does. PANIC is not a good idea, at least if the server 
uses hot standby, because that only makes the situation worse from 
availability point of view. So we log the error as a WARNING, and keep 
retrying. It's unlikely that the problem will just go away, but we keep 
retrying anyway in the hope that it does. However, it seems that we're 
too aggressive with the retries.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Robert Haas
On Thu, Jun 10, 2010 at 12:01 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 We're talking about a corrupt record (incorrect CRC, incorrect backlink
 etc.), not errors within redo functions. During crash recovery, a corrupt
 record means you've reached end of WAL. In standby mode, when streaming WAL
 from master, that shouldn't happen, and it's not clear what to do if it
 does. PANIC is not a good idea, at least if the server uses hot standby,
 because that only makes the situation worse from availability point of view.
 So we log the error as a WARNING, and keep retrying. It's unlikely that the
 problem will just go away, but we keep retrying anyway in the hope that it
 does. However, it seems that we're too aggressive with the retries.

You can reproduce this problem by doing the following.

1. initdb
2. edit postgresql.conf, set wal_level=hot_standby, max_wal_senders=1;
edit pg_hba.conf, trust local replication connections
3. pg_ctl start
4. make some changes to the database
5. take a hot backup to another directory (call it pgslave)
6. create pgslave/recovery.conf with standby_mode='on',
primary_conninfo=whatever, edit pgslave/postgresql.conf change the
port number, set hot_standby=on
7. pg_ctl start -D pgslave

At this point you should have a working HS/SR setup.  Now:

8. shut the slave down
9. move recovery.conf out of the way
10. restart the slave - it will do recovery and enter normal running
11. make some database changes
12. stop the slave
13. put recovery.conf back
14. restart the slave
15. make a bunch of changes on the master

When the slave then tries to replay, you then get something like:

WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330

...ad infinitum.

Obviously there are other ways this could occur - the WAL could really
be corrupted, for example - but the current handling is not too
graceful.  I'm actually thinking it might be better to trigger a
shutdown if this happens.  Probably something has gone haywire and
manual intervention is required.  Retrying when there's no hope of
success isn't really that helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Greg Stark
On Thu, Jun 10, 2010 at 5:13 PM, Robert Haas robertmh...@gmail.com wrote:
 At this point you should have a working HS/SR setup.  Now:

 8. shut the slave down
 9. move recovery.conf out of the way
 10. restart the slave - it will do recovery and enter normal running
 11. make some database changes
 12. stop the slave
 13. put recovery.conf back
 14. restart the slave
 15. make a bunch of changes on the master

 When the slave then tries to replay, you then get something like:

 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330


Woah, why does this procedure lead to this situation? I would hope
there's nothing a user could do which would cause it short of invoking
dd to corrupt the WAL files.

At precisely which step of the procedure did the user do something
wrong? Is there any reason we can't detect that they've done it and
throw a specific error message saying the configuration is invalid?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Robert Haas
On Thu, Jun 10, 2010 at 12:49 PM, Greg Stark gsst...@mit.edu wrote:
 On Thu, Jun 10, 2010 at 5:13 PM, Robert Haas robertmh...@gmail.com wrote:
 At this point you should have a working HS/SR setup.  Now:

 8. shut the slave down
 9. move recovery.conf out of the way
 10. restart the slave - it will do recovery and enter normal running
 11. make some database changes
 12. stop the slave
 13. put recovery.conf back
 14. restart the slave
 15. make a bunch of changes on the master

 When the slave then tries to replay, you then get something like:

 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330


 Woah, why does this procedure lead to this situation? I would hope
 there's nothing a user could do which would cause it short of invoking
 dd to corrupt the WAL files.

 At precisely which step of the procedure did the user do something
 wrong?

13.

 Is there any reason we can't detect that they've done it and
 throw a specific error message saying the configuration is invalid?

I'm not sure how we'd go about doing that, but I agree it would be nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Fujii Masao
On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 We're talking about a corrupt record (incorrect CRC, incorrect backlink
 etc.), not errors within redo functions. During crash recovery, a corrupt
 record means you've reached end of WAL. In standby mode, when streaming WAL
 from master, that shouldn't happen, and it's not clear what to do if it
 does. PANIC is not a good idea, at least if the server uses hot standby,
 because that only makes the situation worse from availability point of view.
 So we log the error as a WARNING, and keep retrying. It's unlikely that the
 problem will just go away, but we keep retrying anyway in the hope that it
 does. However, it seems that we're too aggressive with the retries.

Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


calm_down_retries_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] warning message in standby

2010-06-07 Thread Fujii Masao
Hi,

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers