Re: [HACKERS] Cascade replication
On Mon, Jul 11, 2011 at 7:28 AM, Fujii Masao masao.fu...@gmail.com wrote: Attached is the updated version which addresses all the issues raised by Simon. Is there any reason why we disallow cascading unless hot standby is enabled? ISTM we can just alter the postmaster path for walsenders, patch attached. Some people might be happier if a sync standby were not HS enabled, yet able to cascade to other standbys for reading. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services allow_cascading_without_hot_standby.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jul 19, 2011 at 5:58 PM, Simon Riggs si...@2ndquadrant.com wrote: On Mon, Jul 11, 2011 at 7:28 AM, Fujii Masao masao.fu...@gmail.com wrote: Attached is the updated version which addresses all the issues raised by Simon. Is there any reason why we disallow cascading unless hot standby is enabled? ISTM we can just alter the postmaster path for walsenders, patch attached. Some people might be happier if a sync standby were not HS enabled, yet able to cascade to other standbys for reading. - return CAC_STARTUP; /* normal startup */ + { + if (am_walsender) + return CAC_OK; + else + return CAC_STARTUP; /* normal startup */ + } In canAcceptConnections(), am_walsender is always false, so the above CAC_OK is never returned. You should change ProcessStartupPacket() as follows, instead. switch (port-canAcceptConnections) { case CAC_STARTUP: + if (am_walsender) + { + port-canAcceptConnections = CAC_OK; + break; + } ereport(FATAL, When I fixed the above, compile the code and set up the cascading replication environment (disable hot_standby), I got the following assertion error: TRAP: FailedAssertion(!(slot 0 slot = PMSignalState-num_child_flags), File: pmsignal.c, Line: 227) So we would still have some code to change. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao masao.fu...@gmail.com wrote: So we would still have some code to change. Sigh, yes, of course. The question was whether there is any reason we need to disallow cascading? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jul 19, 2011 at 9:09 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao masao.fu...@gmail.com wrote: So we would still have some code to change. Sigh, yes, of course. The question was whether there is any reason we need to disallow cascading? No, at least I have no clear reason for now. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jul 19, 2011 at 1:38 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Jul 19, 2011 at 9:09 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao masao.fu...@gmail.com wrote: So we would still have some code to change. Sigh, yes, of course. The question was whether there is any reason we need to disallow cascading? No, at least I have no clear reason for now. I'll work up a proper patch. Thanks for your earlier review, -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Mon, Jul 11, 2011 at 10:26 AM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Jul 11, 2011 at 3:30 AM, Josh Berkus j...@agliodbs.com wrote: Do you think you'll submit a new version of the patch this commitfest? Yes. I'm now updating the patch according to Simon's comments. I will submit it today. Attached is the updated version which addresses all the issues raised by Simon. The risk you describe already exists in current code. I regard it as a non-risk. The unlink() and the rename() are executed consecutively, so the gap between them is small, so the chance of a SIGKILL in that gap at the same time as losing the archive seems low, and we can always get that file from the master again if we are streaming. Any code you add to fix this will get executed so rarely it probably won't work when we need it to. In the current scheme we restart archiving from the last restartpoint, which exists only on the archive. This new patch improves upon this by keeping the most recent files locally, so we are less expose in the case of archive unavailability. So this patch already improves things and we don't need any more than that. No extra code please, IMHO. Yes, I added no extra code for the risk I raised upthread. In #2, there is another problem; walsender might have the pre-existing file open, so the startup process would need to request walsenders to close the file before removing (or renaming) it, wait for new file to appear and open it again. I implemented this. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1949,1954 SET ENABLE_SEQSCAN TO OFF; --- 1949,1956 The values of these parameters on standby servers are irrelevant, although you may wish to set them there in preparation for the possibility of a standby becoming the master. + Some of them need to be set in the standby for cascade replication + (see xref linkend=cascade-replication). /para variablelist *** *** 2019,2025 SET ENABLE_SEQSCAN TO OFF; doesn't keep any extra segments for standby purposes, so the number of old WAL segments available to standby servers is a function of the location of the previous checkpoint and status of WAL ! archiving. This parameter has no effect on restartpoints. This parameter can only be set in the filenamepostgresql.conf/ file or on the server command line. /para --- 2021,2027 doesn't keep any extra segments for standby purposes, so the number of old WAL segments available to standby servers is a function of the location of the previous checkpoint and status of WAL ! archiving. This parameter can only be set in the filenamepostgresql.conf/ file or on the server command line. /para *** *** 2121,2127 SET ENABLE_SEQSCAN TO OFF; synchronous replication is enabled, individual transactions can be configured not to wait for replication by setting the xref linkend=guc-synchronous-commit parameter to ! literallocal/ or literaloff/. /para para This parameter can only be set in the filenamepostgresql.conf/ --- 2123,2130 synchronous replication is enabled, individual transactions can be configured not to wait for replication by setting the xref linkend=guc-synchronous-commit parameter to ! literallocal/ or literaloff/. This parameter has no effect on ! cascade replication. /para para This parameter can only be set in the filenamepostgresql.conf/ *** a/doc/src/sgml/high-availability.sgml --- b/doc/src/sgml/high-availability.sgml *** *** 877,884 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' --- 877,921 network delay, or that the standby is under heavy load. /para /sect3 + /sect2 + + sect2 id=cascade-replication +titleCascade Replication/title +indexterm zone=high-availability + primaryCascade Replication/primary +/indexterm +para + Cascade replication feature allows the standby to accept the replication + connections and stream WAL records to another standbys. This is useful + for reducing the number of standbys connecting to the master and reducing + the overhead of the master, when you have many standbys. +/para +para + The cascading standby sends not only WAL records received from the + master but also those restored from the archive. So even if the replication + connection in higher level is terminated, you can continue cascade replication. +/para +para + Cascade replication is asynchronous. Note that synchronous replication + (see xref
Re: [HACKERS] Cascade replication
Fujii, In the current scheme we restart archiving from the last restartpoint, which exists only on the archive. This new patch improves upon this by keeping the most recent files locally, so we are less expose in the case of archive unavailability. So this patch already improves things and we don't need any more than that. No extra code please, IMHO. Do you think you'll submit a new version of the patch this commitfest? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Mon, Jul 11, 2011 at 3:30 AM, Josh Berkus j...@agliodbs.com wrote: Fujii, In the current scheme we restart archiving from the last restartpoint, which exists only on the archive. This new patch improves upon this by keeping the most recent files locally, so we are less expose in the case of archive unavailability. So this patch already improves things and we don't need any more than that. No extra code please, IMHO. Do you think you'll submit a new version of the patch this commitfest? Yes. I'm now updating the patch according to Simon's comments. I will submit it today. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs si...@2ndquadrant.com wrote: 1. De-archive the file to RECOVERYXLOG 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename RECOVERYXLOG to the correct name 3. Replay the file with the correct name Yes please, that makes sense. Will do. Those changes will make this code cleaner for the long term. I don't think we should simply shutdown a WALSender when we startup. That is indistinguishable from a failure, which is going to be very worrying if we do a switchover. Is there another way to do this? Or if not, at least a log message to explain it was normal that we requested this. What about outputing something like the following message in that case? if (walsender receives SIGUSR2) ereport(LOG, terminating walsender process due to administrator command); ...which doesn't explain the situation because we don't know why SIGUSR2 was sent. I was thinking of something like this LOG: requesting walsenders for cascading replication reconnect to update timeline Looks better than my proposal. but then I ask: Why not simply send a new message type saying new timeline id is X and that way we don't need to restart the connection at all? Yeah, that's very useful. But I'd like to implement that as a separate patch. I'm thinking that in that case walsender should send the timeline history file and walreceiver should write it down to the disk, instead of just sending timeline ID. Otherwise, when the standby in receive side restarts, it cannot calculate the latest timeline ID correctly. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Wed, Jul 6, 2011 at 8:53 AM, Fujii Masao masao.fu...@gmail.com wrote: What about outputing something like the following message in that case? if (walsender receives SIGUSR2) ereport(LOG, terminating walsender process due to administrator command); ...which doesn't explain the situation because we don't know why SIGUSR2 was sent. I was thinking of something like this LOG: requesting walsenders for cascading replication reconnect to update timeline Looks better than my proposal. but then I ask: Why not simply send a new message type saying new timeline id is X and that way we don't need to restart the connection at all? Yeah, that's very useful. But I'd like to implement that as a separate patch. I'm thinking that in that case walsender should send the timeline history file and walreceiver should write it down to the disk, instead of just sending timeline ID. Otherwise, when the standby in receive side restarts, it cannot calculate the latest timeline ID correctly. OK, happy to do that part as an additional patch. That way we can get this committed soon - in this CF. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Wed, Jul 6, 2011 at 4:53 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs si...@2ndquadrant.com wrote: 1. De-archive the file to RECOVERYXLOG 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename RECOVERYXLOG to the correct name 3. Replay the file with the correct name Yes please, that makes sense. In #2, if the server is killed with SIGKILL just after removing a pre-existing file and before renaming RECOVERYXLOG, we lose the file with correct name. Even in this case, we would be able to restore it from the archive, but what if unfortunately the archive is unavailable? We would lose the file infinitely. So we should introduce the following safeguard? 2'. If RECOVERYXLOG is valid, move a pre-existing file to pg_xlog/backup, rename RECOVERYXLOG to the correct name, and remove the pre-existing file from pg_xlog/backup Currently we give up a recovery if there is the target file in neither the archive nor pg_xlog. But, if we adopt the above safeguard, in that case, we should try to read the file from also pg_xlog/backup. In #2, there is another problem; walsender might have the pre-existing file open, so the startup process would need to request walsenders to close the file before removing (or renaming) it, wait for new file to appear and open it again. This might make the code complicated. Does anyone have better approach? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Wed, Jul 6, 2011 at 12:27 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 6, 2011 at 4:53 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs si...@2ndquadrant.com wrote: 1. De-archive the file to RECOVERYXLOG 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename RECOVERYXLOG to the correct name 3. Replay the file with the correct name Yes please, that makes sense. In #2, if the server is killed with SIGKILL just after removing a pre-existing file and before renaming RECOVERYXLOG, we lose the file with correct name. Even in this case, we would be able to restore it from the archive, but what if unfortunately the archive is unavailable? We would lose the file infinitely. So we should introduce the following safeguard? 2'. If RECOVERYXLOG is valid, move a pre-existing file to pg_xlog/backup, rename RECOVERYXLOG to the correct name, and remove the pre-existing file from pg_xlog/backup Currently we give up a recovery if there is the target file in neither the archive nor pg_xlog. But, if we adopt the above safeguard, in that case, we should try to read the file from also pg_xlog/backup. In #2, there is another problem; walsender might have the pre-existing file open, so the startup process would need to request walsenders to close the file before removing (or renaming) it, wait for new file to appear and open it again. This might make the code complicated. Does anyone have better approach? The risk you describe already exists in current code. I regard it as a non-risk. The unlink() and the rename() are executed consecutively, so the gap between them is small, so the chance of a SIGKILL in that gap at the same time as losing the archive seems low, and we can always get that file from the master again if we are streaming. Any code you add to fix this will get executed so rarely it probably won't work when we need it to. In the current scheme we restart archiving from the last restartpoint, which exists only on the archive. This new patch improves upon this by keeping the most recent files locally, so we are less expose in the case of archive unavailability. So this patch already improves things and we don't need any more than that. No extra code please, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jul 5, 2011 at 4:34 AM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Jul 4, 2011 at 6:24 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao masao.fu...@gmail.com wrote: The standby must not accept replication connection from that standby itself. Otherwise, since any new WAL data would not appear in that standby, replication cannot advance any more. As a safeguard against this, I introduced new ID to identify each instance. The walsender sends that ID as the fourth field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether the IDs are the same between two servers. If they are the same, which means that the standby is just connecting to that standby itself, so walreceiver emits ERROR. Thanks for waiting for review. Thanks for the review! I agree to focus on the main problem first. I removed that. Attached is the updated version. Now for the rest of the review... I'd rather not include another chunk of code related to wal_keep_segments. The existing code in CreateCheckPoint() should be refactored so that we call the same code from both CreateCheckPoint() and CreateRestartPoint(). IMHO it's time to get rid of RECOVERYXLOG as an initial target for de-archived files. That made sense once, but now we have streaming it makes more sense for us to de-archive straight onto the correct file name and let the file be cleaned up later. So de-archiving it and then copying to the new location doesn't seem the right thing to do (especially not to copy rather than rename). RECOVERYXLOG allowed us to de-archive the file without removing a pre-existing file, so we must handle that still - the current patch would fail if a pre-existing WAL file were there. Those changes will make this code cleaner for the long term. I don't think we should simply shutdown a WALSender when we startup. That is indistinguishable from a failure, which is going to be very worrying if we do a switchover. Is there another way to do this? Or if not, at least a log message to explain it was normal that we requested this. It would be possible to have synchronous cascaded replication but that is probably another patch :-) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jul 5, 2011 at 8:08 PM, Simon Riggs si...@2ndquadrant.com wrote: Now for the rest of the review... Thanks! I'd rather not include another chunk of code related to wal_keep_segments. The existing code in CreateCheckPoint() should be refactored so that we call the same code from both CreateCheckPoint() and CreateRestartPoint(). This makes sense. Will do. IMHO it's time to get rid of RECOVERYXLOG as an initial target for de-archived files. That made sense once, but now we have streaming it makes more sense for us to de-archive straight onto the correct file name and let the file be cleaned up later. So de-archiving it and then copying to the new location doesn't seem the right thing to do (especially not to copy rather than rename). RECOVERYXLOG allowed us to de-archive the file without removing a pre-existing file, so we must handle that still - the current patch would fail if a pre-existing WAL file were there. You mean that we must keep a pre-existing file? If so, why do we need to do that? Since it's older than an archived file, it seems to be OK to overwrite it with an archived file. Or is there corner case that an archived file is older than a pre-existing one? If we don't need to keep a pre-existing file, I'll change the way to de-archive according to your suggestion, as follows; 1. Rename a pre-existing file to EXISTINGXLOG 2. De-archive the file onto the correct name 3. If the de-archived file is invalid (i.e., its size is not 16MB), remove it and rename EXISTINGXLOG to the correct name 4. If the de-archived file is valid, remove EXISTINGXLOG 5. Replay the file with the correct name Or 1. De-archive the file to RECOVERYXLOG 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename RECOVERYXLOG to the correct name 3. Replay the file with the correct name Those changes will make this code cleaner for the long term. I don't think we should simply shutdown a WALSender when we startup. That is indistinguishable from a failure, which is going to be very worrying if we do a switchover. Is there another way to do this? Or if not, at least a log message to explain it was normal that we requested this. What about outputing something like the following message in that case? if (walsender receives SIGUSR2) ereport(LOG, terminating walsender process due to administrator command); It would be possible to have synchronous cascaded replication but that is probably another patch :-) Yeah, right. You'll try? ;) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Wed, Jul 6, 2011 at 2:13 AM, Fujii Masao masao.fu...@gmail.com wrote: IMHO it's time to get rid of RECOVERYXLOG as an initial target for de-archived files. That made sense once, but now we have streaming it makes more sense for us to de-archive straight onto the correct file name and let the file be cleaned up later. So de-archiving it and then copying to the new location doesn't seem the right thing to do (especially not to copy rather than rename). RECOVERYXLOG allowed us to de-archive the file without removing a pre-existing file, so we must handle that still - the current patch would fail if a pre-existing WAL file were there. snip If we don't need to keep a pre-existing file, I'll change the way to de-archive according to your suggestion, as follows; 1. Rename a pre-existing file to EXISTINGXLOG 2. De-archive the file onto the correct name 3. If the de-archived file is invalid (i.e., its size is not 16MB), remove it and rename EXISTINGXLOG to the correct name 4. If the de-archived file is valid, remove EXISTINGXLOG 5. Replay the file with the correct name I'm laughing quite hard here... :-) Or 1. De-archive the file to RECOVERYXLOG 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename RECOVERYXLOG to the correct name 3. Replay the file with the correct name Yes please, that makes sense. Those changes will make this code cleaner for the long term. I don't think we should simply shutdown a WALSender when we startup. That is indistinguishable from a failure, which is going to be very worrying if we do a switchover. Is there another way to do this? Or if not, at least a log message to explain it was normal that we requested this. What about outputing something like the following message in that case? if (walsender receives SIGUSR2) ereport(LOG, terminating walsender process due to administrator command); ...which doesn't explain the situation because we don't know why SIGUSR2 was sent. I was thinking of something like this LOG: requesting walsenders for cascading replication reconnect to update timeline but then I ask: Why not simply send a new message type saying new timeline id is X and that way we don't need to restart the connection at all? It would be possible to have synchronous cascaded replication but that is probably another patch :-) Yeah, right. You'll try? ;) I'll wait for a request... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao masao.fu...@gmail.com wrote: The standby must not accept replication connection from that standby itself. Otherwise, since any new WAL data would not appear in that standby, replication cannot advance any more. As a safeguard against this, I introduced new ID to identify each instance. The walsender sends that ID as the fourth field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether the IDs are the same between two servers. If they are the same, which means that the standby is just connecting to that standby itself, so walreceiver emits ERROR. Thanks for waiting for review. This part of the patch is troubling me. I think you have identified an important problem, but this solution doesn't work fully. If we allow standbys to connect to other standbys then we have problems with standbys not being connected to master. This can occur with a 1-step connection, as you point out, but it could also occur with a 2-step, 3-step or more connection, where a circle of standbys are all depending upon each other. Your solution only works for 1-step connections. Solving that problem in a general sense might be more dangerous than leaving it alone. I think we should think some more about the issues there and approach them as a separate problem. I think we should remove that and just focus on the main problem, for now. That will make it a simpler patch and easier to commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cascade replication
On Mon, Jul 4, 2011 at 6:24 PM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao masao.fu...@gmail.com wrote: The standby must not accept replication connection from that standby itself. Otherwise, since any new WAL data would not appear in that standby, replication cannot advance any more. As a safeguard against this, I introduced new ID to identify each instance. The walsender sends that ID as the fourth field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether the IDs are the same between two servers. If they are the same, which means that the standby is just connecting to that standby itself, so walreceiver emits ERROR. Thanks for waiting for review. Thanks for the review! This part of the patch is troubling me. I think you have identified an important problem, but this solution doesn't work fully. If we allow standbys to connect to other standbys then we have problems with standbys not being connected to master. This can occur with a 1-step connection, as you point out, but it could also occur with a 2-step, 3-step or more connection, where a circle of standbys are all depending upon each other. Your solution only works for 1-step connections. Solving that problem in a general sense might be more dangerous than leaving it alone. I think we should think some more about the issues there and approach them as a separate problem. I think we should remove that and just focus on the main problem, for now. That will make it a simpler patch and easier to commit. I agree to focus on the main problem first. I removed that. Attached is the updated version. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1998,2004 SET ENABLE_SEQSCAN TO OFF; doesn't keep any extra segments for standby purposes, and the number of old WAL segments available to standby servers is a function of the location of the previous checkpoint and status of WAL ! archiving. This parameter has no effect on restartpoints. This parameter can only be set in the filenamepostgresql.conf/ file or on the server command line. /para --- 1998,2004 doesn't keep any extra segments for standby purposes, and the number of old WAL segments available to standby servers is a function of the location of the previous checkpoint and status of WAL ! archiving. This parameter can only be set in the filenamepostgresql.conf/ file or on the server command line. /para *** *** 2105,2111 SET ENABLE_SEQSCAN TO OFF; synchronous replication is enabled, individual transactions can be configured not to wait for replication by setting the xref linkend=guc-synchronous-commit parameter to ! literallocal/ or literaloff/. /para para This parameter can only be set in the filenamepostgresql.conf/ --- 2105,2112 synchronous replication is enabled, individual transactions can be configured not to wait for replication by setting the xref linkend=guc-synchronous-commit parameter to ! literallocal/ or literaloff/. This parameter has no effect on ! cascade replication. /para para This parameter can only be set in the filenamepostgresql.conf/ *** a/doc/src/sgml/high-availability.sgml --- b/doc/src/sgml/high-availability.sgml *** *** 877,884 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' --- 877,921 network delay, or that the standby is under heavy load. /para /sect3 + /sect2 + + sect2 id=cascade-replication +titleCascade Replication/title +indexterm zone=high-availability + primaryCascade Replication/primary +/indexterm +para + Cascade replication feature allows the standby to accept the replication + connections and stream WAL records to another standbys. This is useful + for reducing the number of standbys connecting to the master and reducing + the overhead of the master, when you have many standbys. +/para +para + The cascading standby sends not only WAL records received from the + master but also those restored from the archive. So even if the replication + connection in higher level is terminated, you can continue cascade replication. +/para +para + Cascade replication is asynchronous. Note that synchronous replication + (see xref linkend=synchronous-replication) has no effect on cascade + replication. +/para +para + Promoting the cascading standby terminates all the cascade replication + connections which it uses. This is because the timeline becomes different + between standbys, and they cannot
[HACKERS] Cascade replication (WIP)
Hi, I'd like to propose cascade replication feature (i.e., allow the standby to accept replication connection from another standby) for 9.2. This feature is useful to reduce the overhead of the master since by using that we can decrease the number of standbys directly connecting to the master. I attached the WIP patch, which changes walsender so that it starts replication even during recovery. Then, the walsender attempts to send all WAL that's already been fsync'd to the standby's disk (i.e., send WAL up to the bigger location between the receive location and the replay one). When the standby is promoted, all walsenders in that standby end because they cannot continue replication any more in that case because of the timeline mismatch. The standby must not accept replication connection from that standby itself. Otherwise, since any new WAL data would not appear in that standby, replication cannot advance any more. As a safeguard against this, I introduced new ID to identify each instance. The walsender sends that ID as the fourth field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether the IDs are the same between two servers. If they are the same, which means that the standby is just connecting to that standby itself, so walreceiver emits ERROR. One remaining problem which I'll have to tackle is that: Even while walreceiver is not in progress (i.e., the startup process is retrieving WAL file from the archive), the cascading walsender should continuously send new WAL data. This means that the walsender should send the WAL file restored from the archive. The problem is that the name of such a restored WAL file is always RECOVERYXLOG. For now, walsender cannot handle the WAL file with such a name. To address the above problem, I'm thinking to make the startup process restore the WAL file with its real name instead of RECOVERYXLOG. Then, like in the master, the walsender can read and send the restored WAL file. The required WAL file can be recycled before being sent. So we might need to enable wal_keep_segments setting even in the standby. Comments? Objections? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center *** a/doc/src/sgml/protocol.sgml --- b/doc/src/sgml/protocol.sgml *** *** 1357,1362 The commands accepted in walsender mode are: --- 1357,1374 /listitem /varlistentry + varlistentry + term +identificationkey + /term + listitem + para +Identification key. Also useful to check that the standby is +not connecting to that standby itself. + /para + /listitem + /varlistentry + /variablelist /para /listitem *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 9551,9556 GetXLogReplayRecPtr(void) --- 9551,9572 } /* + * Get current standby flush position, ie, the last WAL position + * known to be fsync'd to disk in standby. + */ + XLogRecPtr + GetStandbyFlushRecPtr(void) + { + XLogRecPtr recvptr; + XLogRecPtr redoptr; + + recvptr = GetWalRcvWriteRecPtr(NULL); + redoptr = GetXLogReplayRecPtr(); + + return XLByteLT(recvptr, redoptr) ? redoptr : recvptr; + } + + /* * Report the last WAL replay location (same format as pg_start_backup etc) * * This is useful for determining how much of WAL is visible to read-only *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *** *** 351,357 static void processCancelRequest(Port *port, void *pkt); static int initMasks(fd_set *rmask); static void report_fork_failure_to_client(Port *port, int errnum); static CAC_state canAcceptConnections(void); - static long PostmasterRandom(void); static void RandomSalt(char *md5Salt); static void signal_child(pid_t pid, int signal); static bool SignalSomeChildren(int signal, int targets); --- 351,356 *** *** 2410,2415 reaper(SIGNAL_ARGS) --- 2409,2423 pmState = PM_RUN; /* + * Kill the cascading walsender to urge the cascaded standby to + * reread the timeline history file, adjust its timeline and + * establish replication connection again. This is required + * because the timeline of cascading standby is not consistent + * with that of cascaded one just after failover. + */ + SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND); + + /* * Crank up the background writer, if we didn't do that already * when we entered consistent recovery state. It doesn't matter * if this fails, we'll just try again later. *** *** 4369,4375 RandomSalt(char *md5Salt) /* * PostmasterRandom */ ! static long PostmasterRandom(void) { /* --- 4377,4383 /* * PostmasterRandom */ ! long PostmasterRandom(void) { /* *** a/src/backend/replication/basebackup.c ---