Re: [HACKERS] Cascade replication

2011-07-19 Thread Simon Riggs
On Mon, Jul 11, 2011 at 7:28 AM, Fujii Masao masao.fu...@gmail.com wrote:

 Attached is the updated version which addresses all the issues raised by
 Simon.

Is there any reason why we disallow cascading unless hot standby is enabled?

ISTM we can just alter the postmaster path for walsenders, patch attached.

Some people might be happier if a sync standby were not HS enabled,
yet able to cascade to other standbys for reading.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


allow_cascading_without_hot_standby.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-19 Thread Fujii Masao
On Tue, Jul 19, 2011 at 5:58 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, Jul 11, 2011 at 7:28 AM, Fujii Masao masao.fu...@gmail.com wrote:

 Attached is the updated version which addresses all the issues raised by
 Simon.

 Is there any reason why we disallow cascading unless hot standby is enabled?

 ISTM we can just alter the postmaster path for walsenders, patch attached.

 Some people might be happier if a sync standby were not HS enabled,
 yet able to cascade to other standbys for reading.

-   return CAC_STARTUP; /* normal startup */
+   {
+   if (am_walsender)
+   return CAC_OK;
+   else
+   return CAC_STARTUP; /* normal startup */
+   }

In canAcceptConnections(), am_walsender is always false, so the above CAC_OK
is never returned. You should change ProcessStartupPacket() as follows, instead.

switch (port-canAcceptConnections)
{
case CAC_STARTUP:
+   if (am_walsender)
+   {
+   port-canAcceptConnections = CAC_OK;
+   break;
+   }
ereport(FATAL,

When I fixed the above, compile the code and set up the cascading replication
environment (disable hot_standby), I got the following assertion error:

TRAP: FailedAssertion(!(slot  0  slot =
PMSignalState-num_child_flags), File: pmsignal.c, Line: 227)

So we would still have some code to change.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-19 Thread Simon Riggs
On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao masao.fu...@gmail.com wrote:

 So we would still have some code to change.

Sigh, yes, of course.

The question was whether there is any reason we need to disallow cascading?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-19 Thread Fujii Masao
On Tue, Jul 19, 2011 at 9:09 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao masao.fu...@gmail.com wrote:

 So we would still have some code to change.

 Sigh, yes, of course.

 The question was whether there is any reason we need to disallow cascading?

No, at least I have no clear reason for now.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-19 Thread Simon Riggs
On Tue, Jul 19, 2011 at 1:38 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Jul 19, 2011 at 9:09 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao masao.fu...@gmail.com
 wrote:

 So we would still have some code to change.

 Sigh, yes, of course.

 The question was whether there is any reason we need to disallow
 cascading?

 No, at least I have no clear reason for now.

I'll work up a proper patch. Thanks for your earlier review,

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-11 Thread Fujii Masao
On Mon, Jul 11, 2011 at 10:26 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jul 11, 2011 at 3:30 AM, Josh Berkus j...@agliodbs.com wrote:
 Do you think you'll submit a new version of the patch this commitfest?

 Yes. I'm now updating the patch according to Simon's comments.
 I will submit it today.

Attached is the updated version which addresses all the issues raised by Simon.

 The risk you describe already exists in current code.

 I regard it as a non-risk. The unlink() and the rename() are executed
 consecutively, so the gap between them is small, so the chance of a
 SIGKILL in that gap at the same time as losing the archive seems low,
 and we can always get that file from the master again if we are
 streaming. Any code you add to fix this will get executed so rarely
 it probably won't work when we need it to.

 In the current scheme we restart archiving from the last restartpoint,
 which exists only on the archive. This new patch improves upon this by
 keeping the most recent files locally, so we are less expose in the
 case of archive unavailability. So this patch already improves things
 and we don't need any more than that. No extra code please, IMHO.

Yes, I added no extra code for the risk I raised upthread.

 In #2, there is another problem; walsender might have the pre-existing file
 open, so the startup process would need to request walsenders to close the
 file before removing (or renaming) it, wait for new file to appear and open it
 again.

I implemented this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1949,1954  SET ENABLE_SEQSCAN TO OFF;
--- 1949,1956 
The values of these parameters on standby servers are irrelevant,
although you may wish to set them there in preparation for the
possibility of a standby becoming the master.
+   Some of them need to be set in the standby for cascade replication
+   (see xref linkend=cascade-replication).
   /para
  
   variablelist
***
*** 2019,2025  SET ENABLE_SEQSCAN TO OFF;
  doesn't keep any extra segments for standby purposes, so the number
  of old WAL segments available to standby servers is a function of
  the location of the previous checkpoint and status of WAL
! archiving.  This parameter has no effect on restartpoints.
  This parameter can only be set in the
  filenamepostgresql.conf/ file or on the server command line.
 /para
--- 2021,2027 
  doesn't keep any extra segments for standby purposes, so the number
  of old WAL segments available to standby servers is a function of
  the location of the previous checkpoint and status of WAL
! archiving.
  This parameter can only be set in the
  filenamepostgresql.conf/ file or on the server command line.
 /para
***
*** 2121,2127  SET ENABLE_SEQSCAN TO OFF;
  synchronous replication is enabled, individual transactions can be
  configured not to wait for replication by setting the
  xref linkend=guc-synchronous-commit parameter to
! literallocal/ or literaloff/.
 /para
 para
  This parameter can only be set in the filenamepostgresql.conf/
--- 2123,2130 
  synchronous replication is enabled, individual transactions can be
  configured not to wait for replication by setting the
  xref linkend=guc-synchronous-commit parameter to
! literallocal/ or literaloff/. This parameter has no effect on
! cascade replication.
 /para
 para
  This parameter can only be set in the filenamepostgresql.conf/
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 877,884  primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 877,921 
   network delay, or that the standby is under heavy load.
  /para
 /sect3
+   /sect2
+ 
+   sect2 id=cascade-replication
+titleCascade Replication/title
  
+indexterm zone=high-availability
+ primaryCascade Replication/primary
+/indexterm
+para
+ Cascade replication feature allows the standby to accept the replication
+ connections and stream WAL records to another standbys. This is useful
+ for reducing the number of standbys connecting to the master and reducing
+ the overhead of the master, when you have many standbys.
+/para
+para
+ The cascading standby sends not only WAL records received from the
+ master but also those restored from the archive. So even if the replication
+ connection in higher level is terminated, you can continue cascade replication.
+/para
+para
+ Cascade replication is asynchronous. Note that synchronous replication
+ (see xref 

Re: [HACKERS] Cascade replication

2011-07-10 Thread Josh Berkus
Fujii,

 In the current scheme we restart archiving from the last restartpoint,
 which exists only on the archive. This new patch improves upon this by
 keeping the most recent files locally, so we are less expose in the
 case of archive unavailability. So this patch already improves things
 and we don't need any more than that. No extra code please, IMHO.

Do you think you'll submit a new version of the patch this commitfest?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-10 Thread Fujii Masao
On Mon, Jul 11, 2011 at 3:30 AM, Josh Berkus j...@agliodbs.com wrote:
 Fujii,

 In the current scheme we restart archiving from the last restartpoint,
 which exists only on the archive. This new patch improves upon this by
 keeping the most recent files locally, so we are less expose in the
 case of archive unavailability. So this patch already improves things
 and we don't need any more than that. No extra code please, IMHO.

 Do you think you'll submit a new version of the patch this commitfest?

Yes. I'm now updating the patch according to Simon's comments.
I will submit it today.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-06 Thread Fujii Masao
On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs si...@2ndquadrant.com wrote:
 1. De-archive the file to RECOVERYXLOG
 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename
    RECOVERYXLOG to the correct name
 3. Replay the file with the correct name

 Yes please, that makes sense.

Will do.

 Those changes will make this code cleaner for the long term.

 I don't think we should simply shutdown a WALSender when we startup.
 That is indistinguishable from a failure, which is going to be very
 worrying if we do a switchover. Is there another way to do this? Or if
 not, at least a log message to explain it was normal that we requested
 this.

 What about outputing something like the following message in that case?

    if (walsender receives SIGUSR2)
        ereport(LOG, terminating walsender process due to
 administrator command);

 ...which doesn't explain the situation because we don't know why
 SIGUSR2 was sent.

 I was thinking of something like this

 LOG:  requesting walsenders for cascading replication reconnect to
 update timeline

Looks better than my proposal.

 but then I ask: Why not simply send a new message type saying new
 timeline id is X and that way we don't need to restart the connection
 at all?

Yeah, that's very useful. But I'd like to implement that as a separate patch.

I'm thinking that in that case walsender should send the timeline history file
and walreceiver should write it down to the disk, instead of just sending
timeline ID. Otherwise, when the standby in receive side restarts, it cannot
calculate the latest timeline ID correctly.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-06 Thread Simon Riggs
On Wed, Jul 6, 2011 at 8:53 AM, Fujii Masao masao.fu...@gmail.com wrote:

 What about outputing something like the following message in that case?

    if (walsender receives SIGUSR2)
        ereport(LOG, terminating walsender process due to
 administrator command);

 ...which doesn't explain the situation because we don't know why
 SIGUSR2 was sent.

 I was thinking of something like this

 LOG:  requesting walsenders for cascading replication reconnect to
 update timeline

 Looks better than my proposal.

 but then I ask: Why not simply send a new message type saying new
 timeline id is X and that way we don't need to restart the connection
 at all?

 Yeah, that's very useful. But I'd like to implement that as a separate patch.

 I'm thinking that in that case walsender should send the timeline history file
 and walreceiver should write it down to the disk, instead of just sending
 timeline ID. Otherwise, when the standby in receive side restarts, it cannot
 calculate the latest timeline ID correctly.

OK, happy to do that part as an additional patch. That way we can get
this committed soon - in this CF.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-06 Thread Fujii Masao
On Wed, Jul 6, 2011 at 4:53 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs si...@2ndquadrant.com wrote:
 1. De-archive the file to RECOVERYXLOG
 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename
    RECOVERYXLOG to the correct name
 3. Replay the file with the correct name

 Yes please, that makes sense.

In #2, if the server is killed with SIGKILL just after removing a pre-existing
file and before renaming RECOVERYXLOG, we lose the file with correct name.
Even in this case, we would be able to restore it from the archive, but what if
unfortunately the archive is unavailable? We would lose the file infinitely. So
we should introduce the following safeguard?

2'. If RECOVERYXLOG is valid, move a pre-existing file to pg_xlog/backup,
rename RECOVERYXLOG to the correct name, and remove the pre-existing
file from pg_xlog/backup

Currently we give up a recovery if there is the target file in
neither the
archive nor pg_xlog. But, if we adopt the above safeguard, in that case,
we should try to read the file from also pg_xlog/backup.

In #2, there is another problem; walsender might have the pre-existing file
open, so the startup process would need to request walsenders to close the
file before removing (or renaming) it, wait for new file to appear and open it
again. This might make the code complicated. Does anyone have better
approach?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-06 Thread Simon Riggs
On Wed, Jul 6, 2011 at 12:27 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 4:53 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs si...@2ndquadrant.com wrote:
 1. De-archive the file to RECOVERYXLOG
 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename
    RECOVERYXLOG to the correct name
 3. Replay the file with the correct name

 Yes please, that makes sense.

 In #2, if the server is killed with SIGKILL just after removing a pre-existing
 file and before renaming RECOVERYXLOG, we lose the file with correct name.
 Even in this case, we would be able to restore it from the archive, but what 
 if
 unfortunately the archive is unavailable? We would lose the file infinitely. 
 So
 we should introduce the following safeguard?

    2'. If RECOVERYXLOG is valid, move a pre-existing file to pg_xlog/backup,
        rename RECOVERYXLOG to the correct name, and remove the pre-existing
        file from pg_xlog/backup

        Currently we give up a recovery if there is the target file in
 neither the
        archive nor pg_xlog. But, if we adopt the above safeguard, in that 
 case,
        we should try to read the file from also pg_xlog/backup.

 In #2, there is another problem; walsender might have the pre-existing file
 open, so the startup process would need to request walsenders to close the
 file before removing (or renaming) it, wait for new file to appear and open it
 again. This might make the code complicated. Does anyone have better
 approach?

The risk you describe already exists in current code.

I regard it as a non-risk. The unlink() and the rename() are executed
consecutively, so the gap between them is small, so the chance of a
SIGKILL in that gap at the same time as losing the archive seems low,
and we can always get that file from the master again if we are
streaming. Any code you add to fix this will get executed so rarely
it probably won't work when we need it to.

In the current scheme we restart archiving from the last restartpoint,
which exists only on the archive. This new patch improves upon this by
keeping the most recent files locally, so we are less expose in the
case of archive unavailability. So this patch already improves things
and we don't need any more than that. No extra code please, IMHO.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-05 Thread Simon Riggs
On Tue, Jul 5, 2011 at 4:34 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jul 4, 2011 at 6:24 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao masao.fu...@gmail.com wrote:

 The standby must not accept replication connection from that standby 
 itself.
 Otherwise, since any new WAL data would not appear in that standby,
 replication cannot advance any more. As a safeguard against this, I 
 introduced
 new ID to identify each instance. The walsender sends that ID as the fourth
 field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether
 the IDs are the same between two servers. If they are the same, which means
 that the standby is just connecting to that standby itself, so walreceiver
 emits ERROR.

 Thanks for waiting for review.

 Thanks for the review!

 I agree to focus on the main problem first. I removed that. Attached
 is the updated version.

Now for the rest of the review...

I'd rather not include another chunk of code related to
wal_keep_segments. The existing code in CreateCheckPoint() should be
refactored so that we call the same code from both CreateCheckPoint()
and CreateRestartPoint().

IMHO it's time to get rid of RECOVERYXLOG as an initial target for
de-archived files. That made sense once, but now we have streaming it
makes more sense for us to de-archive straight onto the correct file
name and let the file be cleaned up later. So de-archiving it and then
copying to the new location doesn't seem the right thing to do
(especially not to copy rather than rename). RECOVERYXLOG allowed us
to de-archive the file without removing a pre-existing file, so we
must handle that still - the current patch would fail if a
pre-existing WAL file were there.

Those changes will make this code cleaner for the long term.

I don't think we should simply shutdown a WALSender when we startup.
That is indistinguishable from a failure, which is going to be very
worrying if we do a switchover. Is there another way to do this? Or if
not, at least a log message to explain it was normal that we requested
this.

It would be possible to have synchronous cascaded replication but that
is probably another patch :-)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-05 Thread Fujii Masao
On Tue, Jul 5, 2011 at 8:08 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Now for the rest of the review...

Thanks!

 I'd rather not include another chunk of code related to
 wal_keep_segments. The existing code in CreateCheckPoint() should be
 refactored so that we call the same code from both CreateCheckPoint()
 and CreateRestartPoint().

This makes sense. Will do.

 IMHO it's time to get rid of RECOVERYXLOG as an initial target for
 de-archived files. That made sense once, but now we have streaming it
 makes more sense for us to de-archive straight onto the correct file
 name and let the file be cleaned up later. So de-archiving it and then
 copying to the new location doesn't seem the right thing to do
 (especially not to copy rather than rename). RECOVERYXLOG allowed us
 to de-archive the file without removing a pre-existing file, so we
 must handle that still - the current patch would fail if a
 pre-existing WAL file were there.

You mean that we must keep a pre-existing file? If so, why do we need to
do that? Since it's older than an archived file, it seems to be OK to overwrite
it with an archived file. Or is there corner case that an archived file is older
than a pre-existing one?

If we don't need to keep a pre-existing file, I'll change the way to de-archive
according to your suggestion, as follows;

1. Rename a pre-existing file to EXISTINGXLOG
2. De-archive the file onto the correct name
3. If the de-archived file is invalid (i.e., its size is not 16MB),
remove it and
   rename EXISTINGXLOG to the correct name
4. If the de-archived file is valid, remove EXISTINGXLOG
5. Replay the file with the correct name

Or

1. De-archive the file to RECOVERYXLOG
2. If RECOVERYXLOG is valid, remove a pre-existing one and rename
RECOVERYXLOG to the correct name
3. Replay the file with the correct name

 Those changes will make this code cleaner for the long term.

 I don't think we should simply shutdown a WALSender when we startup.
 That is indistinguishable from a failure, which is going to be very
 worrying if we do a switchover. Is there another way to do this? Or if
 not, at least a log message to explain it was normal that we requested
 this.

What about outputing something like the following message in that case?

if (walsender receives SIGUSR2)
ereport(LOG, terminating walsender process due to
administrator command);

 It would be possible to have synchronous cascaded replication but that
 is probably another patch :-)

Yeah, right. You'll try? ;)

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-05 Thread Simon Riggs
On Wed, Jul 6, 2011 at 2:13 AM, Fujii Masao masao.fu...@gmail.com wrote:

 IMHO it's time to get rid of RECOVERYXLOG as an initial target for
 de-archived files. That made sense once, but now we have streaming it
 makes more sense for us to de-archive straight onto the correct file
 name and let the file be cleaned up later. So de-archiving it and then
 copying to the new location doesn't seem the right thing to do
 (especially not to copy rather than rename). RECOVERYXLOG allowed us
 to de-archive the file without removing a pre-existing file, so we
 must handle that still - the current patch would fail if a
 pre-existing WAL file were there.


snip

 If we don't need to keep a pre-existing file, I'll change the way to 
 de-archive
 according to your suggestion, as follows;

 1. Rename a pre-existing file to EXISTINGXLOG
 2. De-archive the file onto the correct name
 3. If the de-archived file is invalid (i.e., its size is not 16MB),
 remove it and
   rename EXISTINGXLOG to the correct name
 4. If the de-archived file is valid, remove EXISTINGXLOG
 5. Replay the file with the correct name

I'm laughing quite hard here... :-)

 Or

 1. De-archive the file to RECOVERYXLOG
 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename
    RECOVERYXLOG to the correct name
 3. Replay the file with the correct name

Yes please, that makes sense.

 Those changes will make this code cleaner for the long term.

 I don't think we should simply shutdown a WALSender when we startup.
 That is indistinguishable from a failure, which is going to be very
 worrying if we do a switchover. Is there another way to do this? Or if
 not, at least a log message to explain it was normal that we requested
 this.

 What about outputing something like the following message in that case?

    if (walsender receives SIGUSR2)
        ereport(LOG, terminating walsender process due to
 administrator command);

...which doesn't explain the situation because we don't know why
SIGUSR2 was sent.

I was thinking of something like this

LOG:  requesting walsenders for cascading replication reconnect to
update timeline

but then I ask: Why not simply send a new message type saying new
timeline id is X and that way we don't need to restart the connection
at all?

 It would be possible to have synchronous cascaded replication but that
 is probably another patch :-)

 Yeah, right. You'll try? ;)

I'll wait for a request...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-04 Thread Simon Riggs
On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao masao.fu...@gmail.com wrote:

 The standby must not accept replication connection from that standby itself.
 Otherwise, since any new WAL data would not appear in that standby,
 replication cannot advance any more. As a safeguard against this, I 
 introduced
 new ID to identify each instance. The walsender sends that ID as the fourth
 field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether
 the IDs are the same between two servers. If they are the same, which means
 that the standby is just connecting to that standby itself, so walreceiver
 emits ERROR.

Thanks for waiting for review.

This part of the patch is troubling me. I think you have identified an
important problem, but this solution doesn't work fully.

If we allow standbys to connect to other standbys then we have
problems with standbys not being connected to master. This can occur
with a 1-step connection, as you point out, but it could also occur
with a 2-step, 3-step or more connection, where a circle of standbys
are all depending upon each other. Your solution only works for 1-step
connections. Solving that problem in a general sense might be more
dangerous than leaving it alone. I think we should think some more
about the issues there and approach them as a separate problem.

I think we should remove that and just focus on the main problem, for
now. That will make it a simpler patch and easier to commit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cascade replication

2011-07-04 Thread Fujii Masao
On Mon, Jul 4, 2011 at 6:24 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao masao.fu...@gmail.com wrote:

 The standby must not accept replication connection from that standby itself.
 Otherwise, since any new WAL data would not appear in that standby,
 replication cannot advance any more. As a safeguard against this, I 
 introduced
 new ID to identify each instance. The walsender sends that ID as the fourth
 field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether
 the IDs are the same between two servers. If they are the same, which means
 that the standby is just connecting to that standby itself, so walreceiver
 emits ERROR.

 Thanks for waiting for review.

Thanks for the review!

 This part of the patch is troubling me. I think you have identified an
 important problem, but this solution doesn't work fully.

 If we allow standbys to connect to other standbys then we have
 problems with standbys not being connected to master. This can occur
 with a 1-step connection, as you point out, but it could also occur
 with a 2-step, 3-step or more connection, where a circle of standbys
 are all depending upon each other. Your solution only works for 1-step
 connections. Solving that problem in a general sense might be more
 dangerous than leaving it alone. I think we should think some more
 about the issues there and approach them as a separate problem.

 I think we should remove that and just focus on the main problem, for
 now. That will make it a simpler patch and easier to commit.

I agree to focus on the main problem first. I removed that. Attached
is the updated version.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1998,2004  SET ENABLE_SEQSCAN TO OFF;
  doesn't keep any extra segments for standby purposes, and the number
  of old WAL segments available to standby servers is a function of
  the location of the previous checkpoint and status of WAL
! archiving.  This parameter has no effect on restartpoints.
  This parameter can only be set in the
  filenamepostgresql.conf/ file or on the server command line.
 /para
--- 1998,2004 
  doesn't keep any extra segments for standby purposes, and the number
  of old WAL segments available to standby servers is a function of
  the location of the previous checkpoint and status of WAL
! archiving.
  This parameter can only be set in the
  filenamepostgresql.conf/ file or on the server command line.
 /para
***
*** 2105,2111  SET ENABLE_SEQSCAN TO OFF;
  synchronous replication is enabled, individual transactions can be
  configured not to wait for replication by setting the
  xref linkend=guc-synchronous-commit parameter to
! literallocal/ or literaloff/.
 /para
 para
  This parameter can only be set in the filenamepostgresql.conf/
--- 2105,2112 
  synchronous replication is enabled, individual transactions can be
  configured not to wait for replication by setting the
  xref linkend=guc-synchronous-commit parameter to
! literallocal/ or literaloff/. This parameter has no effect on
! cascade replication.
 /para
 para
  This parameter can only be set in the filenamepostgresql.conf/
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 877,884  primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 877,921 
   network delay, or that the standby is under heavy load.
  /para
 /sect3
+   /sect2
+ 
+   sect2 id=cascade-replication
+titleCascade Replication/title
  
+indexterm zone=high-availability
+ primaryCascade Replication/primary
+/indexterm
+para
+ Cascade replication feature allows the standby to accept the replication
+ connections and stream WAL records to another standbys. This is useful
+ for reducing the number of standbys connecting to the master and reducing
+ the overhead of the master, when you have many standbys.
+/para
+para
+ The cascading standby sends not only WAL records received from the
+ master but also those restored from the archive. So even if the replication
+ connection in higher level is terminated, you can continue cascade replication.
+/para
+para
+ Cascade replication is asynchronous. Note that synchronous replication
+ (see xref linkend=synchronous-replication) has no effect on cascade
+ replication.
+/para
+para
+ Promoting the cascading standby terminates all the cascade replication
+ connections which it uses. This is because the timeline becomes different
+ between standbys, and they cannot 

[HACKERS] Cascade replication (WIP)

2011-05-24 Thread Fujii Masao
Hi,

I'd like to propose cascade replication feature (i.e., allow the
standby to accept
replication connection from another standby) for 9.2. This feature is useful to
reduce the overhead of the master since by using that we can decrease the
number of standbys directly connecting to the master.

I attached the WIP patch, which changes walsender so that it starts replication
even during recovery. Then, the walsender attempts to send all WAL that's
already been fsync'd to the standby's disk (i.e., send WAL up to the bigger
location between the receive location and the replay one). When the standby is
promoted, all walsenders in that standby end because they cannot continue
replication any more in that case because of the timeline mismatch.

The standby must not accept replication connection from that standby itself.
Otherwise, since any new WAL data would not appear in that standby,
replication cannot advance any more. As a safeguard against this, I introduced
new ID to identify each instance. The walsender sends that ID as the fourth
field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether
the IDs are the same between two servers. If they are the same, which means
that the standby is just connecting to that standby itself, so walreceiver
emits ERROR.

One remaining problem which I'll have to tackle is that: Even while walreceiver
is not in progress (i.e., the startup process is retrieving WAL file from the
archive), the cascading walsender should continuously send new WAL data.
This means that the walsender should send the WAL file restored from the
archive. The problem is that the name of such a restored WAL file is always
RECOVERYXLOG. For now, walsender cannot handle the WAL file with such
a name.

To address the above problem, I'm thinking to make the startup process restore
the WAL file with its real name instead of RECOVERYXLOG. Then, like in the
master, the walsender can read and send the restored WAL file. The required
WAL file can be recycled before being sent. So we might need to enable
wal_keep_segments setting even in the standby.

Comments? Objections?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 1357,1362  The commands accepted in walsender mode are:
--- 1357,1374 
/listitem
/varlistentry
  
+   varlistentry
+   term
+identificationkey
+   /term
+   listitem
+   para
+Identification key. Also useful to check that the standby is
+not connecting to that standby itself.
+   /para
+   /listitem
+   /varlistentry
+ 
/variablelist
   /para
  /listitem
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 9551,9556  GetXLogReplayRecPtr(void)
--- 9551,9572 
  }
  
  /*
+  * Get current standby flush position, ie, the last WAL position
+  * known to be fsync'd to disk in standby.
+  */
+ XLogRecPtr
+ GetStandbyFlushRecPtr(void)
+ {
+ 	XLogRecPtr	recvptr;
+ 	XLogRecPtr	redoptr;
+ 
+ 	recvptr = GetWalRcvWriteRecPtr(NULL);
+ 	redoptr = GetXLogReplayRecPtr();
+ 
+ 	return XLByteLT(recvptr, redoptr) ? redoptr : recvptr;
+ }
+ 
+ /*
   * Report the last WAL replay location (same format as pg_start_backup etc)
   *
   * This is useful for determining how much of WAL is visible to read-only
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***
*** 351,357  static void processCancelRequest(Port *port, void *pkt);
  static int	initMasks(fd_set *rmask);
  static void report_fork_failure_to_client(Port *port, int errnum);
  static CAC_state canAcceptConnections(void);
- static long PostmasterRandom(void);
  static void RandomSalt(char *md5Salt);
  static void signal_child(pid_t pid, int signal);
  static bool SignalSomeChildren(int signal, int targets);
--- 351,356 
***
*** 2410,2415  reaper(SIGNAL_ARGS)
--- 2409,2423 
  			pmState = PM_RUN;
  
  			/*
+ 			 * Kill the cascading walsender to urge the cascaded standby to
+ 			 * reread the timeline history file, adjust its timeline and
+ 			 * establish replication connection again. This is required
+ 			 * because the timeline of cascading standby is not consistent
+ 			 * with that of cascaded one just after failover.
+ 			 */
+ 			SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND);
+ 
+ 			/*
  			 * Crank up the background writer, if we didn't do that already
  			 * when we entered consistent recovery state.  It doesn't matter
  			 * if this fails, we'll just try again later.
***
*** 4369,4375  RandomSalt(char *md5Salt)
  /*
   * PostmasterRandom
   */
! static long
  PostmasterRandom(void)
  {
  	/*
--- 4377,4383 
  /*
   * PostmasterRandom
   */
! long
  PostmasterRandom(void)
  {
  	/*
*** a/src/backend/replication/basebackup.c
---