Hi, Thanks for the comment!
On Tue, Jul 7, 2009 at 5:07 PM, Heikki Linnakangas<heikki.linnakan...@enterprisedb.com> wrote: > pg_read_xlogfile() feels like a quite hacky way to implement that. Do we > require the master to always have read access to the PITR archive? And > indeed, to have a PITR archive configured to begin with. If you need to > set up archiving just because of the standby server, how do old files > that are no longer required by the standby get cleaned up? > > I feel that the master needs to explicitly know what is the oldest WAL > file the standby might still need, and refrain from deleting files the > standby might still need. IOW, keep enough history in pg_xlog. Then we > have the risk of running out of disk space on pg_xlog if the connection > to the standby is lost for a long time, so we'll need some cap on that, > after which the master declares the standby as dead and deletes the old > WAL anyway. Nevertheless, I think that would be much simpler to > implement, and simpler for admins. And if the standby can read old WAL > segments from the PITR archive, in addition to requesting them from the > primary, it is just as safe. I think of making pg_read_xlogfile() read the XLOG files from pg_xlog when restore_command is not specified or returns non-zero code (ie. failure). So, pg_read_xlogfile() with the following conditions might already cover the case you described. - checkpoint_segments = N (big number) - restore_command = '' In this case, we can expect that the XLOG files which are required for the standby exist in pg_xlog because of big checkpoint_segments. And, pg_read_xlogfile() reads them only from pg_xlog. checkpoint_segments would play a role of the cap and determine the maximum disk size of pg_xlog. The overflow files which might be no longer required for the standby are removed safely by postgres. OTOH, if there is not enough disk space for pg_xlog, we can specify restore_command and decrease checkpoint_segments. This is more flexible approach, I think. But, if the primary should not restore any archived file at any time, I have only to get rid of the code which pg_read_xlogfile() restores it? > I'd like to see a description of the proposed master/slave protocol for > replication. If I understood correctly, you're proposing that the > standby server connects to the master with libpq like any client, > authenticates as usual, and then sends a message indicating that it > wants to switch to "replication mode". In replication mode, normal FE/BE > messages are not accepted, but there's a different set of message types > for tranferring XLOG data. http://archives.postgresql.org/message-id/4951108a.5040...@enterprisedb.com > I don't think we need or should > allow running regular queries before entering "replication mode". the > backend should become a walsender process directly after authentication. I changed the protocol according to your suggestion. Here is the current protocol: On start-up, the standby calls PQstartReplication() which is new libpq function. It sends the startup packet with a special code for replication to the primary, like a cancel request. The backend which received this code becomes walsender directly. Authentication is performed as normal. Then, walsender switches the XLOG file, and sends the ReplicationStart message 'l' which includes the timeline ID and the replication start XLOG position. ReplicationStart (B) Byte1('l'): Identifies the message as a replication-start indicator. Int32(17): Length of message contents in bytes, including self. Int32: The timeline ID Int32: The start log file of replication Int32: The start byte offset of replication After that, walsender sends the XLogData message 'w' which includes the XLOG records, the flag (e.g. indicates whether the records should be fsynced or not), and the XLOG position, in real time. The standby receives the message using PQgetXLogData() which is new libpq function. OTOH, after writing or fsyncing the records, the standby sends the XLogResponse message 'r' which includes the flag and the position of the written/fsynced records, using PQputXLogRecPtr() which is new libpq function. XLogData (B) Byte1('w'): Identifies the message as XLOG records. Int32: Length of message contents in bytes, including self. Int8: Flag bits indicating how the records should be treated. Int32: The log file number of the records. Int32: The byte offset of the records. Byte n: The XLOG records. XLogResponse (F) Byte1('r'): Identifies the message as ACK for XLOG records. Int32: Length of message contents in bytes, including self. Int8: Flag bits indicating how the records were treated. Int32: The log file number of the records. Int32: The byte offset of the records. Normal exit of walsender (e.g. by smart shutdown) sends the ReplicationEnd message 'z'. OTOH, normal exit of walreceiver sends the existing Terminate message 'X'. The above protocol is used between walsender and walreceiver. > I'd like to see a more formal description of that protocol and the new > message types. Some examples of how they would be in different > scenarios, like when standby server connects to the master for the first > time and needs to catch up. If there is a missing XLOG file which is required for recovery, the startup process connects to the primary as a normal client, and receives the binary contents of the file by using the following SQL. This has nothing to do with the above protocol. So, the transfer of missing file and synchronous XLOG streaming are performed concurrently. COPY (SELECT pg_read_xlogfilie('filename', true)) TO STDOUT WITH BINARY If no missing files are found (ie. recovery of the standby has reached the replication start position), the transfer of file drops out of use. > Looking at the patch briefly, it seems to assume that there is only one > WAL sender active at any time. What happens when a new WAL sender > connects and one is active already? The new request is refused because of existing walsender. > While supporting multiple slaves > isn't a priority, I think we should support multiple WAL senders right > from the start. It shouldn't be much harder, and otherwise we need to > ensure that the switch from old WAL sender to a new one is clean, which > seems non-trivial. Or not accept a new WAL sender while old one is still > active, Yeah, the current patch doesn't accept a new walsender while old one is still active. > but then a dead WAL sender process (because the standby suddenly > crashed, for example) would inhibit a new standby from connecting, > possibly for several minutes. Yes, new standby cannot start walsender until walsender detects the death of old standby. You can shorten the time to detect it by setting some timeout (replication_timeout and some keepalive parameters). I don't think that it's a problem that walsender cannot start for a short time. You think that walsender must *always* be able to start? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers