Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, I fixed some bugs. On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs wrote: > > Can we change to IMMEDIATE when it we need the checkpoint? Perhaps yes, though current patch doesn't care about it. I'm not sure if we really need the feature. Yes, as you say, I'd like to also listen to everybody else. > > What is bkpCount for? So far, name of a backup history file consists of only checkpoint redo location. But, in this patch, since some backups use the same checkpoint, a backup history file could be overwritten unfortunately. So, I introduced bkpCount as ID of backups which use the same checkpoint. > I think we should discuss whatever that is for > separately. It isn't used in any if test, AFAICS. Yes, this patch is testbed. We need to discuss more. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center ? GNUmakefile ? config.log ? config.status ? contrib/make.log ? contrib/pgbench/pgbench ? src/Makefile.global ? src/backend/postgres ? src/backend/catalog/postgres.bki ? src/backend/catalog/postgres.description ? src/backend/catalog/postgres.shdescription ? src/backend/snowball/snowball_create.sql ? src/backend/utils/probes.h ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/bin/initdb/initdb ? src/bin/pg_config/pg_config ? src/bin/pg_controldata/pg_controldata ? src/bin/pg_ctl/pg_ctl ? src/bin/pg_dump/pg_dump ? src/bin/pg_dump/pg_dumpall ? src/bin/pg_dump/pg_restore ? src/bin/pg_resetxlog/pg_resetxlog ? src/bin/psql/psql ? src/bin/scripts/clusterdb ? src/bin/scripts/createdb ? src/bin/scripts/createlang ? src/bin/scripts/createuser ? src/bin/scripts/dropdb ? src/bin/scripts/droplang ? src/bin/scripts/dropuser ? src/bin/scripts/reindexdb ? src/bin/scripts/vacuumdb ? src/include/pg_config.h ? src/include/stamp-h ? src/interfaces/ecpg/compatlib/exports.list ? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1 ? src/interfaces/ecpg/ecpglib/exports.list ? src/interfaces/ecpg/ecpglib/libecpg.so.6.1 ? src/interfaces/ecpg/include/ecpg_config.h ? src/interfaces/ecpg/pgtypeslib/exports.list ? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1 ? src/interfaces/ecpg/preproc/ecpg ? src/interfaces/libpq/exports.list ? src/interfaces/libpq/libpq.so.5.2 ? src/port/pg_config_paths.h ? src/test/regress/log ? src/test/regress/pg_regress ? src/test/regress/results ? src/test/regress/testtablespace ? src/test/regress/tmp_check ? src/test/regress/expected/constraints.out ? src/test/regress/expected/copy.out ? src/test/regress/expected/create_function_1.out ? src/test/regress/expected/create_function_2.out ? src/test/regress/expected/largeobject.out ? src/test/regress/expected/largeobject_1.out ? src/test/regress/expected/misc.out ? src/test/regress/expected/tablespace.out ? src/test/regress/sql/constraints.sql ? src/test/regress/sql/copy.sql ? src/test/regress/sql/create_function_1.sql ? src/test/regress/sql/create_function_2.sql ? src/test/regress/sql/largeobject.sql ? src/test/regress/sql/misc.sql ? src/test/regress/sql/tablespace.sql ? src/timezone/zic Index: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.324 diff -c -r1.324 xlog.c *** src/backend/access/transam/xlog.c 17 Dec 2008 01:39:03 - 1.324 --- src/backend/access/transam/xlog.c 24 Dec 2008 18:13:45 - *** *** 295,300 --- 295,302 /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogwrtResult LogwrtResult; + uint32 bkpCount; /* ID of bkp using the same ckpt */ + bool bkpForceCkpt; /* reset full_page_writes since last ckpt? */ uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ *** *** 318,323 --- 320,332 static XLogCtlData *XLogCtl = NULL; /* + * We don't allow more than MAX_BKP_COUNT backups to use the same checkpoint. + * If XLogCtl->bkpCount > MAX_BKP_COUNT, we force new checkpoint at pg_standby + * even if there are all indispensable full pages since last checkpoint. + */ + #define MAX_BKP_COUNT 256 + + /* * We maintain an image of pg_control in shared memory. */ static ControlFileData *ControlFile = NULL; *** *** 6025,6036 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* Update shared-memory copy of checkpoint XID/epoch */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(&xlogctl->info_lck); xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl->ckptXid = checkPoint.nextXid; SpinLockRelease(&xlogctl->info_lck); --- 6034,6050 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* ! * Update shared-memory copy of checkpoint XID/epoch and reset the ! * variables of backup ID/flag. ! */ { /* use volati
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote: > Hi, > > On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao wrote: > > Hi, > > > > On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs wrote: > >> Yes, OK. So I think it would only work when full_page_writes = on, and > >> has been on since last checkpoint. So two changes: > >> > >> * We just need a boolean that starts at true every checkpoint and gets > >> set to false anytime someone resets full_page_writes or archive_command. > >> If the flag is set && full_page_writes = on then we skip the checkpoint > >> entirely and use the value from the last checkpoint. > > > > Sounds good. > > I attached the self-contained patch to skip checkpoint at pg_start_backup. Good. Can we change to IMMEDIATE when it we need the checkpoint? What is bkpCount for? I think we should discuss whatever that is for separately. It isn't used in any if test, AFAICS. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao wrote: > Hi, > > On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs wrote: >> Yes, OK. So I think it would only work when full_page_writes = on, and >> has been on since last checkpoint. So two changes: >> >> * We just need a boolean that starts at true every checkpoint and gets >> set to false anytime someone resets full_page_writes or archive_command. >> If the flag is set && full_page_writes = on then we skip the checkpoint >> entirely and use the value from the last checkpoint. > > Sounds good. I attached the self-contained patch to skip checkpoint at pg_start_backup. > > pg_start_backup on the standby (probably you are planning?) also needs > this logic? If so, resetting full_page_writes or archive_command should > generate its xlog. Now, the patch doesn't care about this. > > I have another thought: should we forbid the reset of archive_command > during online backup? Currently we can do. If we don't need to do so, > we also don't need to track the reset of archiving for fast pg_start_backup. Now, doesn't care too. Happy Holidays! -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center ? GNUmakefile ? config.log ? config.status ? contrib/pgbench/pgbench ? src/Makefile.global ? src/backend/postgres ? src/backend/catalog/postgres.bki ? src/backend/catalog/postgres.description ? src/backend/catalog/postgres.shdescription ? src/backend/postmaster/walreceiver.c ? src/backend/postmaster/walsender.c ? src/backend/snowball/snowball_create.sql ? src/backend/utils/probes.h ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/bin/initdb/initdb ? src/bin/pg_config/pg_config ? src/bin/pg_controldata/pg_controldata ? src/bin/pg_ctl/pg_ctl ? src/bin/pg_dump/pg_dump ? src/bin/pg_dump/pg_dumpall ? src/bin/pg_dump/pg_restore ? src/bin/pg_resetxlog/pg_resetxlog ? src/bin/psql/psql ? src/bin/scripts/clusterdb ? src/bin/scripts/createdb ? src/bin/scripts/createlang ? src/bin/scripts/createuser ? src/bin/scripts/dropdb ? src/bin/scripts/droplang ? src/bin/scripts/dropuser ? src/bin/scripts/reindexdb ? src/bin/scripts/vacuumdb ? src/include/pg_config.h ? src/include/stamp-h ? src/include/postmaster/walreceiver.h ? src/include/postmaster/walsender.h ? src/interfaces/ecpg/compatlib/exports.list ? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1 ? src/interfaces/ecpg/ecpglib/exports.list ? src/interfaces/ecpg/ecpglib/libecpg.so.6.1 ? src/interfaces/ecpg/include/ecpg_config.h ? src/interfaces/ecpg/pgtypeslib/exports.list ? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1 ? src/interfaces/ecpg/preproc/ecpg ? src/interfaces/libpq/exports.list ? src/interfaces/libpq/libpq.so.5.2 ? src/port/pg_config_paths.h ? src/test/regress/pg_regress ? src/test/regress/testtablespace ? src/timezone/zic Index: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.324 diff -c -r1.324 xlog.c *** src/backend/access/transam/xlog.c 17 Dec 2008 01:39:03 - 1.324 --- src/backend/access/transam/xlog.c 24 Dec 2008 14:57:27 - *** *** 295,300 --- 295,302 /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogwrtResult LogwrtResult; + uint32 bkpCount; /* ID of bkp using the same ckpt */ + bool bkpForceCkpt; /* reset full_page_writes since last ckpt? */ uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ *** *** 6025,6036 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* Update shared-memory copy of checkpoint XID/epoch */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(&xlogctl->info_lck); xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl->ckptXid = checkPoint.nextXid; SpinLockRelease(&xlogctl->info_lck); --- 6027,6043 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* ! * Update shared-memory copy of checkpoint XID/epoch and reset the ! * variables of backup ID/flag. ! */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(&xlogctl->info_lck); + xlogctl->bkpCount = 0; + xlogctl->bkpForceCkpt = true; xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl->ckptXid = checkPoint.nextXid; SpinLockRelease(&xlogctl->info_lck); *** *** 6502,6507 --- 6509,6535 } } + bool + assign_full_page_writes(bool newval, bool doit, GucSource source) + { + /* + * If full_page_writes is reset, since all indispensable full pages + * might not be written since last checkpoint, we force a checkpoint + * at pg_start_backup. + */ + if (doit && fullPageWrites != newval) + { + /* use volati
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs wrote: > Yes, OK. So I think it would only work when full_page_writes = on, and > has been on since last checkpoint. So two changes: > > * We just need a boolean that starts at true every checkpoint and gets > set to false anytime someone resets full_page_writes or archive_command. > If the flag is set && full_page_writes = on then we skip the checkpoint > entirely and use the value from the last checkpoint. Sounds good. pg_start_backup on the standby (probably you are planning?) also needs this logic? If so, resetting full_page_writes or archive_command should generate its xlog. I have another thought: should we forbid the reset of archive_command during online backup? Currently we can do. If we don't need to do so, we also don't need to track the reset of archiving for fast pg_start_backup. > > * My "infra" patch also had a modified version of pg_start_backup() that > allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems > a waste of time, and I want to listen to everybody else now and change > pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it > there. > > Can you work on those also? Umm.. I'm busy. Of course, I will try it if no one raises his or her hand. But, I'd like to put coding the core of synch rep ahead of this. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote: > > We might ask why pg_start_backup() needs to perform checkpoint though, > > since you have remarked that is a problem also. > > > > The answer is that it doesn't really need to, we just need to be certain > > that archiving has been running since whenever we choose as the start > > time. So we could easily just use the last normal checkpoint time, as > > long as we had some way of tracking the archiving. > > > > ISTM we can solve the checkpoint problem more easily and it would > > potentially save much more time than "tuning rsync for Postgres", which > > is what the other idea amounted to. So I do see a solution that is both > > better and more quickly achievable for 8.4. > > Sounds good. I agree that pg_start_backup basically doesn't need > checkpoint. But, for full_page_write == off, we probably cannot get > rid of it. Even if full_page_write == on, since we cannot make out > whether all indispensable full pages were written after last checkpoint, > pg_start_backup must do checkpoint with "forcePageWrite = on". Yes, OK. So I think it would only work when full_page_writes = on, and has been on since last checkpoint. So two changes: * We just need a boolean that starts at true every checkpoint and gets set to false anytime someone resets full_page_writes or archive_command. If the flag is set && full_page_writes = on then we skip the checkpoint entirely and use the value from the last checkpoint. * My "infra" patch also had a modified version of pg_start_backup() that allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems a waste of time, and I want to listen to everybody else now and change pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it there. Can you work on those also? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Mon, Dec 22, 2008 at 1:29 PM, Fujii Masao wrote: > Not so simple. > > At least the primary has to additionally maintain the byte position the > standby > has already fsynced. The main difference from the current patch is whether > the standby fsyncs the logfile when it fills even if you don't choose > #4(fsync). > In order to prevent from having to go back and re-open prior logfiles when an > fsync request comes along later, we would need to ignore the sync mode and > make the standby fsync the logfile when it fills. This would degrade the > performance periodically. Is this acceptable? > > I think there are four choices. Which do you prefer? > > 1) Accept the above change. > 2) Go back and re-open prior logfiles when a fsync request comes along. > 3) Stop the sync control by the primary and leave it to the standby. > 4) Add new option to specify whether to permit optimistic fsync, this option >makes the standby fsync only the current logfile when a fsync request >comes along (don't go back and re-open prior logfiles). > > 2) would cause another performance degradation. 4) would furthermore > confuse users about setting a sync mode. So, I prefer 3) though I'm sorry > for digging up the discussion about transaction control. Please feel free > to comment! 5) Only allow optimistic fsync I'm going to adopt 5) for next patch at least for a while. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 24, 2008 at 2:37 AM, Simon Riggs wrote: > > On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote: > >> Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should >> rethink the question? "Why does the failed server always need a fresh >> backup?" Though we discussed it previously and concluded that it should >> be done next time. >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php > > We might ask why pg_start_backup() needs to perform checkpoint though, > since you have remarked that is a problem also. > > The answer is that it doesn't really need to, we just need to be certain > that archiving has been running since whenever we choose as the start > time. So we could easily just use the last normal checkpoint time, as > long as we had some way of tracking the archiving. > > ISTM we can solve the checkpoint problem more easily and it would > potentially save much more time than "tuning rsync for Postgres", which > is what the other idea amounted to. So I do see a solution that is both > better and more quickly achievable for 8.4. Sounds good. I agree that pg_start_backup basically doesn't need checkpoint. But, for full_page_write == off, we probably cannot get rid of it. Even if full_page_write == on, since we cannot make out whether all indispensable full pages were written after last checkpoint, pg_start_backup must do checkpoint with "forcePageWrite = on". Problem is that online backup itself is unsafe. Even if there is no disk failure (i.e. normal case), we can easily produce a partial write in online backup. So, we always need full pages when recovering online backup, then pg_start_backup always needs checkpoint with forcePageWrite = on. I think that we probably have to track the history of full_page_write, in order to get rid of checkpoint from pg_start_backup. On the other hand, the data after crash other than media crash is "safe". Currently, we can recover it without full page write as simple crash recovery case. I think that we can use it also for archive recovery, because there isn't really any distinction between both. I've not found the corner case yet. Do you have? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote: > Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should > rethink the question? "Why does the failed server always need a fresh > backup?" Though we discussed it previously and concluded that it should > be done next time. > http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php We might ask why pg_start_backup() needs to perform checkpoint though, since you have remarked that is a problem also. The answer is that it doesn't really need to, we just need to be certain that archiving has been running since whenever we choose as the start time. So we could easily just use the last normal checkpoint time, as long as we had some way of tracking the archiving. ISTM we can solve the checkpoint problem more easily and it would potentially save much more time than "tuning rsync for Postgres", which is what the other idea amounted to. So I do see a solution that is both better and more quickly achievable for 8.4. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 24, 2008 at 12:38 AM, Simon Riggs wrote: > Perhaps, but why do you say that? Since you often pointed out that getting backup is not problem because of incremental backup (e.g. rsync), I just thought so. > I've not blocked you from adding > anything useful to Postgres. Yes, I see. > You scare me that you see failover as sufficiently frequent that you are > worried that being without one of the servers for an extra 60 seconds > during a failover is a problem. And then say you're not going to add the > feature after all. I really don't understand. If its important, add the > feature, the whole feature that is. If not, don't. Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should rethink the question? "Why does the failed server always need a fresh backup?" Though we discussed it previously and concluded that it should be done next time. http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php > My expectation is that most failovers are serious ones, that the primary > system is down and not coming back very fast. Your worries seem to come > from a scenario where the primary system is still up but Postgres > bounces/crashes, we can diagnose the cause of the crash, decide the > crashed server is safe and then wish to recommence operations on it > again as quickly as possible, where seconds count it doing so. > > Are failovers going to be common? Why? As you say, *all* failovers are not serious ones. I think that a user would choose most convenient restarting method according to his or her situation (come back immediately? need careful diagnosis?). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Simon Riggs wrote: You scare me that you see failover as sufficiently frequent that you are worried that being without one of the servers for an extra 60 seconds during a failover is a problem. And then say you're not going to add the feature after all. I really don't understand. If its important, add the feature, the whole feature that is. If not, don't. My expectation is that most failovers are serious ones, that the primary system is down and not coming back very fast. Your worries seem to come from a scenario where the primary system is still up but Postgres bounces/crashes, we can diagnose the cause of the crash, decide the crashed server is safe and then wish to recommence operations on it again as quickly as possible, where seconds count it doing so. Are failovers going to be common? Why? Hi Simon: I agree with most of your criticism to the "fail over only approach" - but don't agree that fail over frequency should really impact expectations for the failed system to return to service. I see "soft" fails (*not* serious) to potentially be common - somewhere on the network, something went down or some packet was lost, and the system took a few too many seconds to respond. My expectation is that the system can quickly detect that the node is out of service, be removed from the pool, when the situation is resolved (often automatically outside of my control) automatically "catch up" and be put back into the pool. Having to run some other process such as rsync seems unreliable as we already have a mechanism for streaming the data. All that is missing is streaming from an earlier point in time to catch up efficiently and reliably. I think I'm talking more about the complete solution though which is in line with what you are saying? :-) Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, 2008-12-23 at 23:31 +0900, Fujii Masao wrote: > Hi, > > On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs wrote: > > I'm happy if that whole feature is added. If we do add it, it will be a > > utility like "pg_resync". So in admin terms it will be almost identical > > to using rsync, just a specific version that minimizes effort even more > > than rsync does currently. The only difference as I see it would be some > > gain in performance, but we don't need to send the whole database down > > the wire again in either case. > > I think that the type of your user is different from mine. Perhaps, but why do you say that? I've not blocked you from adding anything useful to Postgres. > If server fails > by simple termination of process, I don't want to spend 1min for > restarting other than catching up itself. For me, getting a fresh backup > (not only copying backup data but also checkpoint by pg_start_backup) > is expensive operation. As I said: "I'm happy if that whole feature is added." You scare me that you see failover as sufficiently frequent that you are worried that being without one of the servers for an extra 60 seconds during a failover is a problem. And then say you're not going to add the feature after all. I really don't understand. If its important, add the feature, the whole feature that is. If not, don't. My expectation is that most failovers are serious ones, that the primary system is down and not coming back very fast. Your worries seem to come from a scenario where the primary system is still up but Postgres bounces/crashes, we can diagnose the cause of the crash, decide the crashed server is safe and then wish to recommence operations on it again as quickly as possible, where seconds count it doing so. Are failovers going to be common? Why? > Of course, since I'm not planning to tackle that problem in 8.4, If you change your mind, having it in 8.4 would be good. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Tue, Dec 23, 2008 at 11:31 PM, Fujii Masao wrote: > Of course, since I'm not planning to tackle that problem in 8.4, > I would not add "additional" synchronization point. Second thought: For normal shutdown case, we probably should force synchronous replication in CreateCheckPoint at least. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs wrote: > I'm happy if that whole feature is added. If we do add it, it will be a > utility like "pg_resync". So in admin terms it will be almost identical > to using rsync, just a specific version that minimizes effort even more > than rsync does currently. The only difference as I see it would be some > gain in performance, but we don't need to send the whole database down > the wire again in either case. I think that the type of your user is different from mine. If server fails by simple termination of process, I don't want to spend 1min for restarting other than catching up itself. For me, getting a fresh backup (not only copying backup data but also checkpoint by pg_start_backup) is expensive operation. Of course, since I'm not planning to tackle that problem in 8.4, I would not add "additional" synchronization point. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, 2008-12-23 at 18:36 +0530, Pavan Deolasee wrote: > Personally, I would like to have a > simple setup where I can initially setup primary and standby and they > continue to work in a single-failure mode without any additional > administrative overhead (such as rsync). But that's just me and I > don't know what the preferred option in the field. If you want a tripod, you need to turn up with all 3 legs. :-) PostgreSQL is a working product, not a framework or a function library. We're not going to add code that has no function at all other than as part of a larger feature, unless we add the whole feature. I'm happy if that whole feature is added. If we do add it, it will be a utility like "pg_resync". So in admin terms it will be almost identical to using rsync, just a specific version that minimizes effort even more than rsync does currently. The only difference as I see it would be some gain in performance, but we don't need to send the whole database down the wire again in either case. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, Dec 23, 2008 at 5:55 PM, Simon Riggs wrote: > > > We stream constantly from primary to standby. That point is not being > debated. The issue is whether we should add additional synchronisation > points (i.e. additional times we need to wait) into the WAL stream. > Currently, I have said no because this has no purpose in the current > design: definitely not performance, not robustness, not code clarity. > > Specifically, we're talking about slowing down WAL flushes required > because of dirty page replacement, amongst others. That's not something > I want to see slowed down on a server that has specifically opted for > asynchronous replication, presumably because of a slow link. The other > call points are also potential contention points. So we would still be sending WAL to standby at XLogWrite time (and I think that's necessary). The question is whether we should wait for standby ack at XLogFlush time, right ? Hmm. I think the argument for that would be what Fujii-san described for maintaining consistency between data and WAL. I agree with you that we should add additional synchronization points only if they give us any real value in administrating replication setup. Personally, I would like to have a simple setup where I can initially setup primary and standby and they continue to work in a single-failure mode without any additional administrative overhead (such as rsync). But that's just me and I don't know what the preferred option in the field. BTW, I won't be too much worried about dirty buffer case because the WAL synchronization at that point usually occurs much later than the WAL is actually sent to the standby. I would imagine that most of the time WAL would have made to standby by that time. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, 2008-12-23 at 16:54 +0530, Pavan Deolasee wrote: > On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao wrote: > > > > But, since I cannot obtain consensus from hackers including you, > > I would change my course, and forbid XLogFlush (called from other > > than RecordTransactionCommit) to replicate xlog synchronously > > if asynchronous replication case. > > Since synchronous/asynchronous behavior of replication is tied to a > transaction (even if there is global default) , I don't understand why > we should not ship the xlogs to the standby when xlogs are written on > primary outside of a transaction context. This is quite same as we do > with asynchronous_commit where we flush the xlog to disk at certain > points irrespective of the synchronization set. We stream constantly from primary to standby. That point is not being debated. The issue is whether we should add additional synchronisation points (i.e. additional times we need to wait) into the WAL stream. Currently, I have said no because this has no purpose in the current design: definitely not performance, not robustness, not code clarity. Specifically, we're talking about slowing down WAL flushes required because of dirty page replacement, amongst others. That's not something I want to see slowed down on a server that has specifically opted for asynchronous replication, presumably because of a slow link. The other call points are also potential contention points. > > Yes, switchover is one of case example I care. Typically, I care > > about restarting the failed server (original primary) after failover: > > > > I think this is a very important requirement because it's quite > unrealistic to expect that every time there is a failover, fresh > backup is required for the old primary to join back the replication. I personally don't expect that, because we have rsync. If that is a very important requirement then the current software needs to include all the aspects of a feature, not just some of them. Either we include a whole feature or we leave it out. A release will need to stand for 5+ years, so supporting extraneous features is troublesome and wasteful. Currently, Fujii-san has stated he is not planning to allow fast resynchronization in 8.4, so why would we need this? If we were to add fast resynchronisation as a feature in 8.4, then I will be happy to have *all* required changes included. People mention it enough that I would be happy to see the whole feature added in this release -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao wrote: > > > But, since I cannot obtain consensus from hackers including you, > I would change my course, and forbid XLogFlush (called from other > than RecordTransactionCommit) to replicate xlog synchronously > if asynchronous replication case. > Since synchronous/asynchronous behavior of replication is tied to a transaction (even if there is global default) , I don't understand why we should not ship the xlogs to the standby when xlogs are written on primary outside of a transaction context. This is quite same as we do with asynchronous_commit where we flush the xlog to disk at certain points irrespective of the synchronization set. > Yes, switchover is one of case example I care. Typically, I care > about restarting the failed server (original primary) after failover: > I think this is a very important requirement because it's quite unrealistic to expect that every time there is a failover, fresh backup is required for the old primary to join back the replication. > - > 1. a dirty buffer page is chosen as victim of buffer replacement > 2. flush xlog up to the buffer's LSN on only primary > 3. write out the dirty buffer page > 4. primary fails >(replication up to buffer's LSN is not performed) > > The above case produces inconsistency between data on the > original primary (failed server) and xlogs on the original standby > (new primary after failover). Isn't this right? > Yes, it would create inconsistency which I don't think can be corrected without a fresh backup. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Tue, Dec 23, 2008 at 6:28 PM, Simon Riggs wrote: > > On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote: >> > I don't get this argument. Why would we care what happens on the >> failed server? >> >> It's because, in the future, I'd like to use the data on the failed >> server when making it catch up with new primary. This desire might be >> violated by the inconsistency which I described. > > I don't really understand why you would put something in there that has > no use at all. Why make every server in the world do extra > synchronisation? > > Whatever you build in the future can include this, if that is still a > required point at the time you add the new feature. Right. But since it's difficult to change the once fixed specification, I ruminate about it from now for future. But, since I cannot obtain consensus from hackers including you, I would change my course, and forbid XLogFlush (called from other than RecordTransactionCommit) to replicate xlog synchronously if asynchronous replication case. BTW, here is the callers other than RecordTransactionCommit. - CreateCheckPoint() - EndPrepare() - FlushBuffer() - RecordTransactionAbortPrepared() - RecordTransactionCommitPrepared() - RelationTruncate() - SlruPhysicalWritePage() - WriteTruncateXlogRec() - XLogAsyncCommitFlush() > > Are you thinking about switchover rather than failover? I'm sure a > graceful switchover doesn't need this. Yes, switchover is one of case example I care. Typically, I care about restarting the failed server (original primary) after failover: - 1. a dirty buffer page is chosen as victim of buffer replacement 2. flush xlog up to the buffer's LSN on only primary 3. write out the dirty buffer page 4. primary fails (replication up to buffer's LSN is not performed) The above case produces inconsistency between data on the original primary (failed server) and xlogs on the original standby (new primary after failover). Isn't this right? 5. restart the failed server and make it catch up with new primary We cannot recycle the existing data on the failed server because of that inconsistency. I think this restriction should be removed. - Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote: > > I don't get this argument. Why would we care what happens on the > failed server? > > It's because, in the future, I'd like to use the data on the failed > server when making it catch up with new primary. This desire might be > violated by the inconsistency which I described. I don't really understand why you would put something in there that has no use at all. Why make every server in the world do extra synchronisation? Whatever you build in the future can include this, if that is still a required point at the time you add the new feature. Are you thinking about switchover rather than failover? I'm sure a graceful switchover doesn't need this. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Tue, Dec 23, 2008 at 5:22 PM, Simon Riggs wrote: > On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote: > >> > XLogFlush() flushes because of an interlock between a dirty buffer write >> > and an outstanding WAL write. Dirty buffer writes are not replicated, so >> > there is no need to have a similar interlock on WAL streaming. >> > >> > So making those call points synchronous is possible, but neither >> > necessary or IMHO desirable. >> >> Yes in upcoming 8.4, but probably no in the future. >> >> What if the primary fails after writing the dirty data buffer before sending >> the corresponding logs? This would make data on the primary and logs >> on the standby inconsistent. In 8.4, such inconsistency might not matter >> because we don't use the data on the failed primary for recovery (when >> restarting the failed server, we always need a fresh backup). But, since >> this restriction is not good for some people, in the future, the failed >> server >> should restart without a fresh backup, and the inconsistency would be >> problem. So, I think that the inconsistency should be removed even if >> asynchronous replication case, and we should enforce "WAL rule" over >> some servers. > > I don't get this argument. Why would we care what happens on the failed > server? It's because, in the future, I'd like to use the data on the failed server when making it catch up with new primary. This desire might be violated by the inconsistency which I described. > > The additional synchronizations you suggest are neither necessary, nor > IMHO desirable. Not additional. It's quite analogous to synchronous_commit. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote: > > XLogFlush() flushes because of an interlock between a dirty buffer write > > and an outstanding WAL write. Dirty buffer writes are not replicated, so > > there is no need to have a similar interlock on WAL streaming. > > > > So making those call points synchronous is possible, but neither > > necessary or IMHO desirable. > > Yes in upcoming 8.4, but probably no in the future. > > What if the primary fails after writing the dirty data buffer before sending > the corresponding logs? This would make data on the primary and logs > on the standby inconsistent. In 8.4, such inconsistency might not matter > because we don't use the data on the failed primary for recovery (when > restarting the failed server, we always need a fresh backup). But, since > this restriction is not good for some people, in the future, the failed server > should restart without a fresh backup, and the inconsistency would be > problem. So, I think that the inconsistency should be removed even if > asynchronous replication case, and we should enforce "WAL rule" over > some servers. I don't get this argument. Why would we care what happens on the failed server? The additional synchronizations you suggest are neither necessary, nor IMHO desirable. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 17, 2008 at 12:07 PM, Fujii Masao wrote: >> No, we've been through that loop already a few months back: >> Transaction-controlled robustness. >> >> It should be up to the client on the primary to decide how much waiting >> they would like to perform in order to provide a guarantee. A change of >> setting on the standby should not be allowed to alter the performance or >> durability on the primary. > > OK. I will extend synchronous_replication, make walsender send XLOG > with synchronization mode flag and make walreceiver perform according > to the flag. Not so simple. At least the primary has to additionally maintain the byte position the standby has already fsynced. The main difference from the current patch is whether the standby fsyncs the logfile when it fills even if you don't choose #4(fsync). In order to prevent from having to go back and re-open prior logfiles when an fsync request comes along later, we would need to ignore the sync mode and make the standby fsync the logfile when it fills. This would degrade the performance periodically. Is this acceptable? I think there are four choices. Which do you prefer? 1) Accept the above change. 2) Go back and re-open prior logfiles when a fsync request comes along. 3) Stop the sync control by the primary and leave it to the standby. 4) Add new option to specify whether to permit optimistic fsync, this option makes the standby fsync only the current logfile when a fsync request comes along (don't go back and re-open prior logfiles). 2) would cause another performance degradation. 4) would furthermore confuse users about setting a sync mode. So, I prefer 3) though I'm sorry for digging up the discussion about transaction control. Please feel free to comment! Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Simon Riggs wrote: > The second way can be done by taking a snapshot on the primary, with an > associated LSN, then using that snapshot on the standby. That is > somewhat complex, but possible. I see the requirement for getting the > same answer on multiple nodes as a further extension of "transaction > isolation mode" and think that not all people will want this, so we > should allow that as an option. I've been thinking a bit about this pretty interesting idea. It's certainly of interest for Postgres-R as well. AFAIK a function could simply wait, until the node which is being queried reaches a given point in time of application of transactions (an LSN, in the Sync-Rep world). Calling such a waiting function just after BEGIN would ensure to see (at least) the given snapshot. If that snapshot has already been reached or passed, the function does nothing. What I like is, that it's optimistic in that the wait is only enforced when needed by the reader. However, unlike enforcing the wait before COMMIT, it requires changing the application to cope with this behavior of the distributed database system. And knowing when to require which snapshot sounds rather difficult from the point of view of the application developer. Also note, that it might be the issuer of the transaction who wants to ensure "his" transaction got propagated to the remote nodes. > I'm not going to worry about this at the moment. Hot standby will be > useful without this and so I regard this as a secondary objective. Rome > wasn't built in a single release, or something like that. Sounds like a decent plan. Good luck. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Fri, Dec 19, 2008 at 5:50 PM, Simon Riggs wrote: > > On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote: > >> > Yes, please check the call points for ForceSyncCommit. >> > >> > Do I think every xlog flush should be synchronous, no, I don't. >> That's why we have a user settable parameter for it. >> >> Umm.. I focus attention on XLogFlush() called except >> RecordTransactionCommit(). >> For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These >> XLogFlush() might >> flush XLOG synchronously even if asynchronous commit case. > > XLogFlush() flushes because of an interlock between a dirty buffer write > and an outstanding WAL write. Dirty buffer writes are not replicated, so > there is no need to have a similar interlock on WAL streaming. > > So making those call points synchronous is possible, but neither > necessary or IMHO desirable. Yes in upcoming 8.4, but probably no in the future. What if the primary fails after writing the dirty data buffer before sending the corresponding logs? This would make data on the primary and logs on the standby inconsistent. In 8.4, such inconsistency might not matter because we don't use the data on the failed primary for recovery (when restarting the failed server, we always need a fresh backup). But, since this restriction is not good for some people, in the future, the failed server should restart without a fresh backup, and the inconsistency would be problem. So, I think that the inconsistency should be removed even if asynchronous replication case, and we should enforce "WAL rule" over some servers. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Mark Mielke wrote: > Good answers, Markus. Thanks. You are welcome. > So it looks like there is value to both ends of the spectrum, and while > I feel the most value would be in providing a very fast system that > scales near linear to the number of nodes in the system, even at the > expense of immediately visible transactions from all servers, I can > accept that sometimes the expectations are stricter and would appreciate > seeing an option to let me choose based upon my requirements. I absolutely agree to that. The original Postgres-R algorithm covers the eager (or virtually synchronous) part. I'm planning to extend it with a (fully) synchronous mode and let the user choose per transaction. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Josh Berkus wrote: > Peter Eisentraut wrote: >> It's the color of the bikeshed ... Agreed. It's why I've decided to support various modes for Postgres-R. I'm glad to see that the current "Sync Rep" approach does the same. > Hmmm. I thought this was pretty clear. There's three levels of synch > which are useful features: > > 1) "synchronus" standby which is really asynchronous, but only has a gap > of < 100ms. A synchronous standby which is really asynchronous? That's exactly the naming challenge I've been pointing to. Commonly used terms are: "virtually synchronous", "approximately synchronous", "near-real-time replication" or "eager replication", but for most users, this is not "synchronous" (enough). (BTW: there's no such "< 100 ms" guarantee. It may be typically below 100 ms, or even below 10 ms on average. But replication is not about the typical or average case. It's much more about failures and uncommon cases. The guarantee you can get in such a system (by declaring a node as dead) is much more likely to be within the range of several seconds and more, be it network, disk or whatever other failure-timeout that applies here.) > 2) Synchronous standby which guarentees that all committed transactions > are on the failover node and that no data will be lost for failover, but > the failover node is still in standby mode. What's the difference to 1) here? I'm not following. > 3) Synchronous replication where the standby node has identical > transactions to the master node, and is queryable read-only. So, a synchronous standby is different from synchronous replication in that it's asynchronous? Sorry for bugging with naming, but I think it is important for an understanding during development. > Any of these levels would be useful and allow a certain number of our > users to deploy PostgreSQL in an environment where it wasn't used > before. I absolutely agree to that statement. However, please do not confuse future users (and today's hackers), but instead use existing terms consistently and clearly. Something that lags behind, potentially by several seconds (in case of failure) is commonly considered asynchronous, no matter how close to "immediate" it is on average. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Mark Mielke wrote: > Robert Haas wrote: >> On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane wrote: >>> We won't call it anything, because we never will or can implement that. >>> See the theory of relativity: the notion of exactly simultaneous events >> >> OK, fine. I'll be more precise. I think we need to reserve the term >> "synchronous replication" for a system where transactions that begin >> on the standby after the transactions has committed on the master see >> the effects of the committed transaction. I agree with Robert here. As far as I know this is the common understanding of "synchronous replication". Everything less - including Postgres-R - is considered to be asynchronous. > I'd like to see proof of some sort that PostgreSQL guarantees that the > instant a 'commit' returns, any transactions already open with the > appropriate transaction isolation level, or any new sessions *will* see > the results of the commit. Given within this thread, here [1]. > Two phase commit doesn't imply that the transaction is guaranteed to be > immediately visible. Just for the record: that's plain wrong. As with any other transaction, a COMMIT of a prepared transaction guarantees visibility from all subsequent snapshots (at least for Postgres and other serious RDBSen). Systems based on 2PC are the typical synchronous replication solution: works, resistant to failures, consistent across nodes (WRT visibility), but unusably slow. This is what people have in mind and expect when they hear "synchronous replication" for databases. (And which is why I'm thinking it's better for an optimized solution not to call itself "synchronous"). > Unless transactions are > locked from starting until they are able to prove that they have the > latest commit See the cited README. It already happens for (single node) Postgres systems, because the action of snapshot taking and committing are serialized. > (a feat which I'm going to theorize as impossible - > because the moment you wait for a commit, and you begin again, you > really have no guarantee that another commit has not occurred in the > mean time) This problem is solved by locking. Regards Markus Wanner [1]: Hints to docs and source, that COMMIT actually ensures subsequent snapshots "include" changes of the committed transaction: http://archives.postgresql.org/message-id/494c.2060...@bluegap.ch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Good answers, Markus. Thanks. I've bought the thinking of several here that the user should have some control over what they expect (and what optimizations they are willing to accept as a good choice), but that commit should still be able to have a capped time limit. I can think of many of my own applications where I would choose one mode vs another mode, even within the same application, depending on the operation itself. The most important requirement is that transactions are durable. It becomes convenient, though, to provide additional guarantees for some operation sequences. I still see the requirement for seat reservation, bank account, or stock trading, as synchronizing using read-write locks before starting the select, rather than enforcing latest on every select. For my own bank, when I do an online transaction, operations don't always immediately appear in my list of transactions. They appear to sometimes be batched, sometimes in near real time, and sometimes as part of some sort of day end processing. For seat reservation, the time the seat layout is shown on the screen is not usually locked during a transaction. Between the time the travel agent brings up the seats on the plane, and the time they select the seat, the seat could be taken. What's important is that the reservation is durable, and that conflicts are not introduced. The commit must fail if another person has chosen the seat already already. The commit does not need to wait until the reservation is pushed out to all systems before completing. The same is true of stock trading. However, it can be very convenient for commits to be immediately visible after the commit completes. This allows for lazier models, such as a web site that reloads the view on the reservations or recent trades and expects to see recent commits no matter which server it accesses, rather than taking into account that the commit succeeded when presenting the next view. If I look at sites like Google - they take the opposite extreme. I can post a message, and it remembers that I posted the message and makes it immediately visible, however, I might not see other new messages in a thread until a minute or more later. So it looks like there is value to both ends of the spectrum, and while I feel the most value would be in providing a very fast system that scales near linear to the number of nodes in the system, even at the expense of immediately visible transactions from all servers, I can accept that sometimes the expectations are stricter and would appreciate seeing an option to let me choose based upon my requirements. Cheers, mark Markus Wanner wrote: Hi, Mark Mielke wrote: Where does the expectation come from? I find the seat reservation, bank account or stock trading examples pretty obvious WRT user expectations. Nonetheless, I've compiled some hints from the documentation and sources: "Since in Read Committed mode each new command starts with a new snapshot that includes all transactions committed up to that instant" [1]. "This [SERIALIZABLE ISOLATION] level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently." [1]. (IMO this implies, that a transaction "sees" changes from all preceding transactions). "All changes made by the transaction become visible to others and are guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not overly clear here, when exactly the changes become visible. OTOH, there's no warning, that another session doesn't immediately see committed transactions. Not sure where you got that from). I don't recall ever reading it in the documentation, and unless the session processes are contending over the integers (using some sort of synchronization primitive) in memory that represent the "latest visible commit" on every single select, I'm wondering how it is accomplished? See the transaction system's README [3]. It documents the process of snapshot taking and transaction isolation pretty well. Around line 226 it says: "What we actually enforce is strict serialization of commits and rollbacks with snapshot-taking". (So the outcome of your experiment is no surprise at all). And a bit later: "This rule is stronger than necessary for consistency, but is relatively simple to enforce, and it assists with some other issues as explained below.". While this implies, that an optimization is theoretically possible, I very much doubt it would be worth it (for a single node system). In a distributed system, things are a bit different. Network latency is an order of magnitude higher than memory latency (for IPC). So a similar optimization is very well worth it. However, the application (or the load balancer or both) need to know about this potential lag between nodes. And as you've outlined elsewhere, a limit for how much a single node may lag behind needs to be established. (As a side not
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Mark Mielke wrote: > Where does the expectation come from? I find the seat reservation, bank account or stock trading examples pretty obvious WRT user expectations. Nonetheless, I've compiled some hints from the documentation and sources: "Since in Read Committed mode each new command starts with a new snapshot that includes all transactions committed up to that instant" [1]. "This [SERIALIZABLE ISOLATION] level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently." [1]. (IMO this implies, that a transaction "sees" changes from all preceding transactions). "All changes made by the transaction become visible to others and are guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not overly clear here, when exactly the changes become visible. OTOH, there's no warning, that another session doesn't immediately see committed transactions. Not sure where you got that from). > I don't recall ever reading it in > the documentation, and unless the session processes are contending over > the integers (using some sort of synchronization primitive) in memory > that represent the "latest visible commit" on every single select, I'm > wondering how it is accomplished? See the transaction system's README [3]. It documents the process of snapshot taking and transaction isolation pretty well. Around line 226 it says: "What we actually enforce is strict serialization of commits and rollbacks with snapshot-taking". (So the outcome of your experiment is no surprise at all). And a bit later: "This rule is stronger than necessary for consistency, but is relatively simple to enforce, and it assists with some other issues as explained below.". While this implies, that an optimization is theoretically possible, I very much doubt it would be worth it (for a single node system). In a distributed system, things are a bit different. Network latency is an order of magnitude higher than memory latency (for IPC). So a similar optimization is very well worth it. However, the application (or the load balancer or both) need to know about this potential lag between nodes. And as you've outlined elsewhere, a limit for how much a single node may lag behind needs to be established. (As a side note: for a multi-master system like Postgres-R, it's beneficial to keep the lag time as low as possible, because the larger the lag, the higher the probability for a conflict between two transactions on different nodes.) Regards Markus Wanner [1]: Pg 8.3 Docu: Concurrency Control: http://www.postgresql.org/docs/8.3/static/transaction-iso.html [2]: Pg 8.3 Docu: COMMIT command: http://www.postgresql.org/docs/8.3/static/sql-commit.html [3]: README of transam (src/backend/access/transam/README): https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Fri, 2008-12-19 at 11:04 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On a related but different point: We don't need an interlock between > > dirty buffers and WAL during recovery because the WAL has already been > > written. > > Assuming the WAL has also been fsync'd. True, so this will need to change for 8.4 also -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Simon Riggs wrote: On a related but different point: We don't need an interlock between dirty buffers and WAL during recovery because the WAL has already been written. Assuming the WAL has also been fsync'd. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote: > > Yes, please check the call points for ForceSyncCommit. > > > > Do I think every xlog flush should be synchronous, no, I don't. > That's why we have a user settable parameter for it. > > Umm.. I focus attention on XLogFlush() called except > RecordTransactionCommit(). > For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These > XLogFlush() might > flush XLOG synchronously even if asynchronous commit case. XLogFlush() flushes because of an interlock between a dirty buffer write and an outstanding WAL write. Dirty buffer writes are not replicated, so there is no need to have a similar interlock on WAL streaming. So making those call points synchronous is possible, but neither necessary or IMHO desirable. On a related but different point: We don't need an interlock between dirty buffers and WAL during recovery because the WAL has already been written. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Thu, Dec 18, 2008 at 6:35 PM, Simon Riggs wrote: > > On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote: > >> >> Agreed, I also think that hard code is better. But I'm nervous that "off" >> >> keeps us waiting for replication in cases other than DDL, e.g. flush >> >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off >> >> is quite similar to synchronous_commit = off. If we would hard code #4, >> >> the performance might degrade although it's asynchronous replication. >> >> So, I'd like to hard code #3. What is your opinion? >> > >> > We don't do that when we flush buffer, truncate clog or checkpoint, not >> > sure why you mention those. >> > >> > We ForceSyncCommit when we >> > * VACUUM FULL >> > * CREATE/DROP DATABASE or USER >> > * Create/Drop Tablespace >> > >> > I don't see a problem in forcing an fsync for those. I will sleep safer >> > knowing those guys are on disk even in async mode. >> >> If my understanding is correct, XLOG flush is forced up to buffer's LSN >> when flushing buffer even if asynchronous commit case. Am I missing >> something? > > Yes, please check the call points for ForceSyncCommit. > > Do I think every xlog flush should be synchronous, no, I don't. That's > why we have a user settable parameter for it. Umm.. I focus attention on XLogFlush() called except RecordTransactionCommit(). For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These XLogFlush() might flush XLOG synchronously even if asynchronous commit case. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote: > >> Agreed, I also think that hard code is better. But I'm nervous that "off" > >> keeps us waiting for replication in cases other than DDL, e.g. flush > >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off > >> is quite similar to synchronous_commit = off. If we would hard code #4, > >> the performance might degrade although it's asynchronous replication. > >> So, I'd like to hard code #3. What is your opinion? > > > > We don't do that when we flush buffer, truncate clog or checkpoint, not > > sure why you mention those. > > > > We ForceSyncCommit when we > > * VACUUM FULL > > * CREATE/DROP DATABASE or USER > > * Create/Drop Tablespace > > > > I don't see a problem in forcing an fsync for those. I will sleep safer > > knowing those guys are on disk even in async mode. > > If my understanding is correct, XLOG flush is forced up to buffer's LSN > when flushing buffer even if asynchronous commit case. Am I missing > something? Yes, please check the call points for ForceSyncCommit. Do I think every xlog flush should be synchronous, no, I don't. That's why we have a user settable parameter for it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Thu, Dec 18, 2008 at 11:19 AM, Simon Riggs wrote: > > On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote: >> Hi, >> >> Thanks for the helpful comments! >> >> On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs wrote: >> > >> > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: >> > >> >> OK. I will extend synchronous_replication, make walsender send XLOG >> >> with synchronization mode flag and make walreceiver perform according >> >> to the flag. >> > >> > Sounds good. >> > >> >> > My perspective is that synchronous_replication specifies how long to >> >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait >> >> > until point #3). So I think we should change this to a list of options >> >> > to allow people to more carefully select how much waiting is required. >> >> >> >> In the latest patch, "off" keeps us waiting for replication in some >> >> cases, e.g. forceSyncCommit = true. This is analogous to the way >> >> synchronous_commit works. When "off" keeps us waiting for >> >> replication, which option (#1-#6) should we choose? Should it be >> >> user-configurable (though the parameter values are doubled)? >> >> hardcode #3? "off" always should not keep us waiting for >> >> replication? >> > >> > I would hard code #4, i.e. make it fsync, so that DDL changes are >> > regarded as "high value transactions". >> > >> > A parameter sounds like overkill. We'd need to explain what >> > forceSyncCommit does to users then, which is easier to avoid. >> >> Agreed, I also think that hard code is better. But I'm nervous that "off" >> keeps us waiting for replication in cases other than DDL, e.g. flush >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off >> is quite similar to synchronous_commit = off. If we would hard code #4, >> the performance might degrade although it's asynchronous replication. >> So, I'd like to hard code #3. What is your opinion? > > We don't do that when we flush buffer, truncate clog or checkpoint, not > sure why you mention those. > > We ForceSyncCommit when we > * VACUUM FULL > * CREATE/DROP DATABASE or USER > * Create/Drop Tablespace > > I don't see a problem in forcing an fsync for those. I will sleep safer > knowing those guys are on disk even in async mode. If my understanding is correct, XLOG flush is forced up to buffer's LSN when flushing buffer even if asynchronous commit case. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote: > Hi, > > Thanks for the helpful comments! > > On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs wrote: > > > > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: > > > >> OK. I will extend synchronous_replication, make walsender send XLOG > >> with synchronization mode flag and make walreceiver perform according > >> to the flag. > > > > Sounds good. > > > >> > My perspective is that synchronous_replication specifies how long to > >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait > >> > until point #3). So I think we should change this to a list of options > >> > to allow people to more carefully select how much waiting is required. > >> > >> In the latest patch, "off" keeps us waiting for replication in some > >> cases, e.g. forceSyncCommit = true. This is analogous to the way > >> synchronous_commit works. When "off" keeps us waiting for > >> replication, which option (#1-#6) should we choose? Should it be > >> user-configurable (though the parameter values are doubled)? > >> hardcode #3? "off" always should not keep us waiting for > >> replication? > > > > I would hard code #4, i.e. make it fsync, so that DDL changes are > > regarded as "high value transactions". > > > > A parameter sounds like overkill. We'd need to explain what > > forceSyncCommit does to users then, which is easier to avoid. > > Agreed, I also think that hard code is better. But I'm nervous that "off" > keeps us waiting for replication in cases other than DDL, e.g. flush > buffer, truncate clog, checkpoint.. etc. synchronous_replication = off > is quite similar to synchronous_commit = off. If we would hard code #4, > the performance might degrade although it's asynchronous replication. > So, I'd like to hard code #3. What is your opinion? We don't do that when we flush buffer, truncate clog or checkpoint, not sure why you mention those. We ForceSyncCommit when we * VACUUM FULL * CREATE/DROP DATABASE or USER * Create/Drop Tablespace I don't see a problem in forcing an fsync for those. I will sleep safer knowing those guys are on disk even in async mode. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Thanks for the helpful comments! On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs wrote: > > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: > >> OK. I will extend synchronous_replication, make walsender send XLOG >> with synchronization mode flag and make walreceiver perform according >> to the flag. > > Sounds good. > >> > My perspective is that synchronous_replication specifies how long to >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait >> > until point #3). So I think we should change this to a list of options >> > to allow people to more carefully select how much waiting is required. >> >> In the latest patch, "off" keeps us waiting for replication in some >> cases, e.g. forceSyncCommit = true. This is analogous to the way >> synchronous_commit works. When "off" keeps us waiting for >> replication, which option (#1-#6) should we choose? Should it be >> user-configurable (though the parameter values are doubled)? >> hardcode #3? "off" always should not keep us waiting for >> replication? > > I would hard code #4, i.e. make it fsync, so that DDL changes are > regarded as "high value transactions". > > A parameter sounds like overkill. We'd need to explain what > forceSyncCommit does to users then, which is easier to avoid. Agreed, I also think that hard code is better. But I'm nervous that "off" keeps us waiting for replication in cases other than DDL, e.g. flush buffer, truncate clog, checkpoint.. etc. synchronous_replication = off is quite similar to synchronous_commit = off. If we would hard code #4, the performance might degrade although it's asynchronous replication. So, I'd like to hard code #3. What is your opinion? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote: > OK. I will extend synchronous_replication, make walsender send XLOG > with synchronization mode flag and make walreceiver perform according > to the flag. Sounds good. > > My perspective is that synchronous_replication specifies how long to > > wait. Current settings are "off" (don't wait) or "on" (meaning wait > > until point #3). So I think we should change this to a list of options > > to allow people to more carefully select how much waiting is required. > > In the latest patch, "off" keeps us waiting for replication in some > cases, e.g. forceSyncCommit = true. This is analogous to the way > synchronous_commit works. When "off" keeps us waiting for > replication, which option (#1-#6) should we choose? Should it be > user-configurable (though the parameter values are doubled)? > hardcode #3? "off" always should not keep us waiting for > replication? I would hard code #4, i.e. make it fsync, so that DDL changes are regarded as "high value transactions". A parameter sounds like overkill. We'd need to explain what forceSyncCommit does to users then, which is easier to avoid. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Tue, Dec 16, 2008 at 7:21 PM, Simon Riggs wrote: > > On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote: > >> > So from my previous list >> > >> > 1. We sent the message to standby (A) >> > 2. We received the message on standby >> > 3. We wrote the WAL to the WAL file (B) >> > 4. We fsync'd the WAL file (C) >> > 5. We CRC checked the WAL commit record >> > 6. We applied the WAL commit record >> > >> > Please could you also add an option #4, i.e. add the *option* to fsync >> > the WAL to disk at commit time also. That requires us to add a third >> > option to synchronous_replication parameter. >> >> The above option should be configured on the primary? or standby? >> The primary is suitable to vary it from transaction to transaction. On >> the other hand, it should be configured on the standby in order to >> choose it for every standby (in the future). >> >> I prefer the latter, and thought that it should be added into recovery.conf. >> I mean, synchronous_replication identifies only whether commit waits for >> replication (if the name is confusing, I would rename it). The above >> options (#1-#6) are chosen in recovery.conf. What is your opion? > > No, we've been through that loop already a few months back: > Transaction-controlled robustness. > > It should be up to the client on the primary to decide how much waiting > they would like to perform in order to provide a guarantee. A change of > setting on the standby should not be allowed to alter the performance or > durability on the primary. OK. I will extend synchronous_replication, make walsender send XLOG with synchronization mode flag and make walreceiver perform according to the flag. > > My perspective is that synchronous_replication specifies how long to > wait. Current settings are "off" (don't wait) or "on" (meaning wait > until point #3). So I think we should change this to a list of options > to allow people to more carefully select how much waiting is required. In the latest patch, "off" keeps us waiting for replication in some cases, e.g. forceSyncCommit = true. This is analogous to the way synchronous_commit works. When "off" keeps us waiting for replication, which option (#1-#6) should we choose? Should it be user-configurable (though the parameter values are doubled)? hardcode #3? "off" always should not keep us waiting for replication? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote: > > So from my previous list > > > > 1. We sent the message to standby (A) > > 2. We received the message on standby > > 3. We wrote the WAL to the WAL file (B) > > 4. We fsync'd the WAL file (C) > > 5. We CRC checked the WAL commit record > > 6. We applied the WAL commit record > > > > Please could you also add an option #4, i.e. add the *option* to fsync > > the WAL to disk at commit time also. That requires us to add a third > > option to synchronous_replication parameter. > > The above option should be configured on the primary? or standby? > The primary is suitable to vary it from transaction to transaction. On > the other hand, it should be configured on the standby in order to > choose it for every standby (in the future). > > I prefer the latter, and thought that it should be added into recovery.conf. > I mean, synchronous_replication identifies only whether commit waits for > replication (if the name is confusing, I would rename it). The above > options (#1-#6) are chosen in recovery.conf. What is your opion? No, we've been through that loop already a few months back: Transaction-controlled robustness. It should be up to the client on the primary to decide how much waiting they would like to perform in order to provide a guarantee. A change of setting on the standby should not be allowed to alter the performance or durability on the primary. My perspective is that synchronous_replication specifies how long to wait. Current settings are "off" (don't wait) or "on" (meaning wait until point #3). So I think we should change this to a list of options to allow people to more carefully select how much waiting is required. This feature is then analogous to the way synchronous_commit works. It also provides a level of application control not seen in any other RDBMS in the industry, which makes it very suitable for large and important applications that need a fine mix of robustness and performance. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Sorry for this late reply. And, thanks for the hot discussion ;) On Tue, Dec 16, 2008 at 1:24 AM, Simon Riggs wrote: > > Fujii-san, > > Just repeating this in case you lost this comment: > > On Mon, 2008-12-15 at 09:40 +, Simon Riggs wrote: > >> Fujii-san, please can we incorporate those two options, rather than just >> one choice "synchronous_replication = on". They look like two commonly >> requested options. > > I see the comment in line 230+ of walreceiver.c, so understand that you > have implemented option #3 from the following list. > > So from my previous list > > 1. We sent the message to standby (A) > 2. We received the message on standby > 3. We wrote the WAL to the WAL file (B) > 4. We fsync'd the WAL file (C) > 5. We CRC checked the WAL commit record > 6. We applied the WAL commit record > > Please could you also add an option #4, i.e. add the *option* to fsync > the WAL to disk at commit time also. That requires us to add a third > option to synchronous_replication parameter. The above option should be configured on the primary? or standby? The primary is suitable to vary it from transaction to transaction. On the other hand, it should be configured on the standby in order to choose it for every standby (in the future). I prefer the latter, and thought that it should be added into recovery.conf. I mean, synchronous_replication identifies only whether commit waits for replication (if the name is confusing, I would rename it). The above options (#1-#6) are chosen in recovery.conf. What is your opion? >> #6 is an additional synchronization step in Hot Standby. I would say >> that people won't want that when they see how it performs (they probably >> won't want #4 either for that same reason, but that is for robustness). > > We can jointly add option #6 once we have both sync rep and hot standby > committed, or at a late stage of hot standby development. There's not > much point looking at it before then. Agreed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Mon, 2008-12-15 at 13:06 -0800, Josh Berkus wrote: > Peter Eisentraut wrote: > > Simon Riggs wrote: > >> I am truly lost to understand why the *name* "synchronous replication" > >> causes so much discussion, yet nobody has discussed what they would > >> actually like the software to *do* > > > > It's the color of the bikeshed ... > > Hmmm. I thought this was pretty clear. There's three levels of synch > which are useful features: > > 1) "synchronus" standby which is really asynchronous, but only has a gap > of < 100ms. > > 2) Synchronous standby which guarentees that all committed transactions > are on the failover node and that no data will be lost for failover, but > the failover node is still in standby mode. > > 3) Synchronous replication where the standby node has identical > transactions to the master node, and is queryable read-only. > Any of these levels would be useful and allow a certain number of our > users to deploy PostgreSQL in an environment where it wasn't used > before. So if we can only do (2) for 8.4, that's still very useful for > telecoms and banks. The (2) mentioned here could be any of sync points #2-5 referred to upthread. Different people have requested different levels of robustness. Looking at DRBD and Oracle, they both subdivide (2) into at least two further levels of option. So (2) is too broad a brush to paint with. I don't believe that (2) as stated is sufficient for banks, though is reasonable for many telco applications. But #4 or #5 would be suitable for banks, i.e. we must fsync to disk for very high value transactions. The extra code to do this is minor, which is why I've asked Fujii-san to include it now within the patch. All of this is controllable by the parameter synchronous_replication, which it is important to note can be set for each individual transaction rather than just fixed for the whole server. This is identical to the way we can mix synchronous commit and asynchronous commit transactions. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Simon, I've explained this twice now on different parts of this thread. Could I politely direct your attention to those posts? Chill. I was just explaining that the *goal* of sync standby was not complicated or really something to be argued about. It's pretty clear. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Mon, 2008-12-15 at 13:43 -0800, Josh Berkus wrote: > > Isn't the "queryable read-only" feature totally orthogonal with > > how synchronous the replication is? > > Yes. However, it introduces specific difficult issues which an > unreadable synchronous slave does not have. Don't think it's hugely difficult, but there are multiple ways of doing this. But it is irrelevant until we have the basic ability to run queries. I've explained this twice now on different parts of this thread. Could I politely direct your attention to those posts? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Isn't the "queryable read-only" feature totally orthogonal with how synchronous the replication is? Yes. However, it introduces specific difficult issues which an unreadable synchronous slave does not have. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Josh Berkus wrote: Hmmm. I thought this was pretty clear. There's three levels of synch which are useful features: 1) "synchronus" standby which is really asynchronous, but only has a gap of < 100ms. 2) Synchronous standby which guarentees that all committed transactions are on the failover node and that no data will be lost for failover, but the failover node is still in standby mode. 3) Synchronous replication where the standby node has identical transactions to the master node, and is queryable read-only. Any of these levels would be useful Isn't the "queryable read-only" feature totally orthogonal with how synchronous the replication is? For one reporting system I have, where new data is continually being added every second; I'd love to have a read-only-slave even if that system has the "100ms" gap you mentioned in #1. Heck I don't care if the queries it runs even have a 100 *minute* gap; but I sure would like it to be synchronous in the sense that all the transactions to survive a failure of the primary. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Peter Eisentraut wrote: Simon Riggs wrote: I am truly lost to understand why the *name* "synchronous replication" causes so much discussion, yet nobody has discussed what they would actually like the software to *do* It's the color of the bikeshed ... Hmmm. I thought this was pretty clear. There's three levels of synch which are useful features: 1) "synchronus" standby which is really asynchronous, but only has a gap of < 100ms. 2) Synchronous standby which guarentees that all committed transactions are on the failover node and that no data will be lost for failover, but the failover node is still in standby mode. 3) Synchronous replication where the standby node has identical transactions to the master node, and is queryable read-only. Any of these levels would be useful and allow a certain number of our users to deploy PostgreSQL in an environment where it wasn't used before. So if we can only do (2) for 8.4, that's still very useful for telecoms and banks. --Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Mon, 2008-12-15 at 09:19 -0500, Robert Haas wrote: > I understand you're point, but I think there's still a use case. The > idea is that declaring the secondary dead is a rare event, and there's > some mechanism by which you're enabled to page your network staff, and > they hightail it into the office to fix the problem. It might not be > the way that you want to run your system, but I don't think it's > unreasonable for someone else to want it. > Agreed: there's an analogy to RAID here. When a disk goes out, it still allows writes, but moves to a degraded state. Hopefully your monitoring system notifies you, and you fix it. Also, let's say that the standby suffers catastrophic storage failure. Now you only have your data on one server anyway (the primary). Rejecting new transactions from committing doesn't save all the old transactions in the event of a subsequent storage failure on the primary. I'm not advocating this option in particular, other than saying that it seems like a reasonable option to me. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Fujii-san, Just repeating this in case you lost this comment: On Mon, 2008-12-15 at 09:40 +, Simon Riggs wrote: > Fujii-san, please can we incorporate those two options, rather than just > one choice "synchronous_replication = on". They look like two commonly > requested options. I see the comment in line 230+ of walreceiver.c, so understand that you have implemented option #3 from the following list. So from my previous list 1. We sent the message to standby (A) 2. We received the message on standby 3. We wrote the WAL to the WAL file (B) 4. We fsync'd the WAL file (C) 5. We CRC checked the WAL commit record 6. We applied the WAL commit record Please could you also add an option #4, i.e. add the *option* to fsync the WAL to disk at commit time also. That requires us to add a third option to synchronous_replication parameter. That then means we will have robustness options that map directly to DRBD algorithms A, B and C (shown in brackets in the above list). I believe these map also to Data Guard options Maximum Performance and Maximum Availability. AFAICS if we implement the additional items I've requested over the last few days, then the architecture is now at a good point for 8.4 and we can begin to look at low level implementation details. Or put another way, I'm not expecting to come up with more architecture changes. > #6 is an additional synchronization step in Hot Standby. I would say > that people won't want that when they see how it performs (they probably > won't want #4 either for that same reason, but that is for robustness). We can jointly add option #6 once we have both sync rep and hot standby committed, or at a late stage of hot standby development. There's not much point looking at it before then. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> So you'd want all commits to wait until the transaction is safely replicated > in the standby. But if there's a network glitch, or the standby is > restarted, you're happy to reply to the client that it's committed if it's > only safely committed in the primary. Essentially, you wait for the reply as > long the standby responds within X seconds, but if it takes more then Y > seconds, you don't wait. I know that people do that, but it seems > counterintuitive to me. In that case, when the primary acks the transaction > as committed, you only know that it's safely committed in the primary; it > doesn't give any hard guarantee about the state in the standby. I understand you're point, but I think there's still a use case. The idea is that declaring the secondary dead is a rare event, and there's some mechanism by which you're enabled to page your network staff, and they hightail it into the office to fix the problem. It might not be the way that you want to run your system, but I don't think it's unreasonable for someone else to want it. > But when you consider the possibility to use the standby for queries, the > synchronous mode makes sense too. > I'm not opposed to providing all the options, but the synchronous mode where > we can guarantee that if you query the standby, you will see the effects of > all transactions committed in the primary, makes the synchronous mode much > more interesting. If you don't need that property, you're most likely more > happy with asynchronous mode anyway. I agree that asynchronous mode will be the right solution for a very large subset of our users. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
* Robert Haas [081215 07:32]: > > In fact, waiting for reply from standby server before acknowledging a commit > > to the client is a bit pointless otherwise. It puts you in a strange > > situation, where you're waiting for the commits in normal operation, but if > > there's a network glitch or the standby goes down, you're willing to go > > ahead without it. You get a high guarantee that your data is up-to-date in > > the standby, except when it isn't. Which isn't much of a guarantee. > > It protects you against a catastrophic loss of the primary, which is a > non-trivial consideration. At the risk of being ghoulish, imagine > that you are a large financial company headquartered in the world > trade center. This was exacty my original point - I want the transaction durably on the slave before the commit is acknowledged (to build as much local redunancy as I can), but I certatily *don't* want to loose the ability to use WAL archiving, because I ship my WAL off-site too... The ability to have an extra local copy is good. But I'm certainly not going to want to give up my off-site backup/WAL for it... a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Sync Rep: First Thoughts on Code
Robert Haas wrote: In fact, waiting for reply from standby server before acknowledging a commit to the client is a bit pointless otherwise. It puts you in a strange situation, where you're waiting for the commits in normal operation, but if there's a network glitch or the standby goes down, you're willing to go ahead without it. You get a high guarantee that your data is up-to-date in the standby, except when it isn't. Which isn't much of a guarantee. It protects you against a catastrophic loss of the primary, which is a non-trivial consideration. At the risk of being ghoulish, imagine that you are a large financial company headquartered in the world trade center. So you'd want all commits to wait until the transaction is safely replicated in the standby. But if there's a network glitch, or the standby is restarted, you're happy to reply to the client that it's committed if it's only safely committed in the primary. Essentially, you wait for the reply as long the standby responds within X seconds, but if it takes more then Y seconds, you don't wait. I know that people do that, but it seems counterintuitive to me. In that case, when the primary acks the transaction as committed, you only know that it's safely committed in the primary; it doesn't give any hard guarantee about the state in the standby. But when you consider the possibility to use the standby for queries, the synchronous mode makes sense too. I'm not opposed to providing all the options, but the synchronous mode where we can guarantee that if you query the standby, you will see the effects of all transactions committed in the primary, makes the synchronous mode much more interesting. If you don't need that property, you're most likely more happy with asynchronous mode anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
It's a real promise. The reason you're getting hand-wavy answers is because it's such a basic requirement that I'm trying to point out just how fundamental a requirement it is. If you want to see the actual code which guarantees this take a look around the code for procarray - in particular the code for taking a snapshot. There are comments there about what locks are needed when committing and when taking a snapshot and why. But it's quite technical. -- Greg On 15 Dec 2008, at 02:03, Mark Mielke wrote: Greg Stark wrote: When the database says the data is committed it has to mean the data is really committed. Imagine if you looked at a bank account balance after withdrawing all the money and saw a balance which didn't reflect the withdrawal and allowed you to withdraw more money again... Within the same session - sure. From different sessions? PostgeSQL MVCC let's you see an older snapshot, although it does prefer to have the latest snapshot with each command. For allowing to withdraw more money again, I would expect some sort of locking "SELECT ... FOR UPDATE;" to be used. This lock then forces the two transactions to become serialized and the second will either wait for the first to complete or fail. Any banking program that assumed that it could SELECT to confirm a balance and then UPDATE to withdraw the money as separate instructions would be a bad banking program. To exploit it, I would just have to start both operations at the same time - they both SELECT, they both see I have money, they both give me the money and UPDATE, and I get double the money (although my balance would show a big negative value - but I'm already gone...). Database 101. When I asked for "does PostgreSQL guarantee this?" I didn't mean hand waving examples or hand waving expectations. I meant a pointer into the code that has some comment that says "we want to guarantee that a commit in one session will be immediately visible to other sessions, and that a later select issued in the other sessions will ALWAYS see the commit whether 1 nanosecond later or 200 seconds later" Robert's expectation and yours seem like taking this "guarantee" for granted rather than being justified with design intent and proof thus far. :-) Given my experiment to try and force it to fail, I can see why this would be taken for granted. Is this a real promise, though? Or just a unlikely scenario that never seems to be hit? To me, the question is relevant in terms of the expectations of a multi-replica solution. We know people have the expectation. We know it can be convenient. Is the expectation valid in the first place? I've probably drawn this question out too long and should do my own research and report back... Sorry... :-) Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> In fact, waiting for reply from standby server before acknowledging a commit > to the client is a bit pointless otherwise. It puts you in a strange > situation, where you're waiting for the commits in normal operation, but if > there's a network glitch or the standby goes down, you're willing to go > ahead without it. You get a high guarantee that your data is up-to-date in > the standby, except when it isn't. Which isn't much of a guarantee. It protects you against a catastrophic loss of the primary, which is a non-trivial consideration. At the risk of being ghoulish, imagine that you are a large financial company headquartered in the world trade center. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Simon Riggs wrote: I am truly lost to understand why the *name* "synchronous replication" causes so much discussion, yet nobody has discussed what they would actually like the software to *do* It's the color of the bikeshed ... We can make the reply to a commit message when any of the following events have occurred 1. We sent the message to standby 2. We received the message on standby 3. We wrote the WAL to the WAL file 4. We fsync'd the WAL file 5. We CRC checked the WAL commit record 6. We applied the WAL commit record In DRBD tradition, I suggest you implement all of them, or at least factor the code so that each of them can be a one line change. (We can probably later drop one or two options.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sun, 2008-12-14 at 21:41 -0500, Robert Haas wrote: > > If this is right, #2, #3, #4, and #6 feel similar except > > that they're protecting against failures of different (but > > still all incomplete) subsets of the hardware on the slave, right? > > Right. Actually, the biggest difference with #6 has nothing to do > with protecting against failures. It has rather to do with the ease > of writing applications in the context of hot standby. You can close > your connection, open a connection to a different server, and know > that your transactions will be reflected there. On the other hand, > I'd be surprised if it didn't come with a substantial performance > penalty, so it may not be too practical in real life even if it sounds > good on paper. > > #1 , #3, and #5 don't feel that useful to me. Yes, looks that way for me also. Good analysis Ron. I agree with Robert that #6 is there for other reasons. #2 corresponds to DRBD algorithm B #4 corresponds to DRBD algorithm C Fujii-san, please can we incorporate those two options, rather than just one choice "synchronous_replication = on". They look like two commonly requested options. #6 is an additional synchronization step in Hot Standby. I would say that people won't want that when they see how it performs (they probably won't want #4 either for that same reason, but that is for robustness). Also, I would point out that the class of synch_rep is selected by the user on the primary and can vary from transaction to transaction. That is a very good thing, as far as I am concerned. We would need to enforce #6 for all transactions (if we implemented synchronisation in this way). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sun, 2008-12-14 at 12:57 -0500, Mark Mielke wrote: > I'm curious about your suggestion to direct queries that need the > latest > snapshot to the 'primary'. I might have misunderstood it - but it > seems > that the expectation from some is that *all* sessions see the latest > snapshot, so would this not imply that all sessions would be redirect > to > the 'primary'? I don't think it is reasonable myself, but I might be > misunderstanding something... I said "a snapshot taken on the primary", but the query would run on the standby. Synchronising primary and standby so that they are identical from the perspective of a query requires some synchronisation delay. I'm pointing out that the synchronisation delay can occur * at the time we apply WAL - which will slow down commits (i.e. #6 on my previous list of options) * at the time we run a query that needs to see primary and standby synchronised So the same effect can be achieved in various ways. The first way would require *all* transactions to be applied on standby, i.e. option #6 for all transactions. That is a performance disaster and I would not force that onto everybody. The second way can be done by taking a snapshot on the primary, with an associated LSN, then using that snapshot on the standby. That is somewhat complex, but possible. I see the requirement for getting the same answer on multiple nodes as a further extension of "transaction isolation mode" and think that not all people will want this, so we should allow that as an option. I'm not going to worry about this at the moment. Hot standby will be useful without this and so I regard this as a secondary objective. Rome wasn't built in a single release, or something like that. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Mark Mielke wrote: Where does the expectation come from? I don't recall ever reading it in the documentation, and unless the session processes are contending over the integers (using some sort of synchronization primitive) in memory that represent the "latest visible commit" on every single select, I'm wondering how it is accomplished? The "integers" you're imagining are the ProcArray. Every backend has an entry there, and among other things it contains the current XID the backend is running. When a backend takes a new snapshot (on every single select in read committed mode), it locks the ProcArray, scans all the entries and collects all the XIDs listed there in the snapshot. Those are the set of transactions that were running when the snapshot was taken, and is used in the visibility checks. > If they are contending over these > integers, doesn't that represent a scaling limitation, in the sense that > on a 32-core machine, they're going to be fighting with each other to > get the latest version of these shared integers into the CPU for > processing? Maybe it's such a small penalty that we don't care? :-) The ProcArrayLock is indeed quite busy on systems with a lot of CPUs. It's held for such short times that it's not a problem usually, but it can become a bottleneck with a machine like that with all backends running small transactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Mark Mielke wrote: When I asked for "does PostgreSQL guarantee this?" I didn't mean hand waving examples or hand waving expectations. I meant a pointer into the code that has some comment that says "we want to guarantee that a commit in one session will be immediately visible to other sessions, and that a later select issued in the other sessions will ALWAYS see the commit whether 1 nanosecond later or 200 seconds later" Robert's expectation and yours seem like taking this "guarantee" for granted rather than being justified with design intent and proof thus far. :-) Given my experiment to try and force it to fail, I can see why this would be taken for granted. Is this a real promise, though? Yes. In a nutshell, commit works like this: 1. Write and flush WAL record about the commit 2. Mark the transaction as committed in clog 3. Remove the xid from the shared memory ProcArray. 4. Release locks and other resources 5. Reply to client that the transaction has been committed. After step 3, any backend taking a snapshot will see the transaction as committed. Since we only reply to the client at step 5, it is guaranteed that a transaction beginning after step 5, as well as an already open transaction taking a new snapshot (ie. running a new command in read committed mode) after that will see the transaction as committed. The relevant code is in CommitTransaction() in xact.c. To me, the question is relevant in terms of the expectations of a multi-replica solution. We know people have the expectation. Yeah, I think Robert is right. We should reserve the term "synchronous replication" for the mode where that guarantee holds for the slave as well. In fact, waiting for reply from standby server before acknowledging a commit to the client is a bit pointless otherwise. It puts you in a strange situation, where you're waiting for the commits in normal operation, but if there's a network glitch or the standby goes down, you're willing to go ahead without it. You get a high guarantee that your data is up-to-date in the standby, except when it isn't. Which isn't much of a guarantee. But with hot standby, it makes a lot of sense. The guarantee is that if the standby is accepting queries, it's up-to-date with the primary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Greg Stark wrote: When the database says the data is committed it has to mean the data is really committed. Imagine if you looked at a bank account balance after withdrawing all the money and saw a balance which didn't reflect the withdrawal and allowed you to withdraw more money again... Within the same session - sure. From different sessions? PostgeSQL MVCC let's you see an older snapshot, although it does prefer to have the latest snapshot with each command. For allowing to withdraw more money again, I would expect some sort of locking "SELECT ... FOR UPDATE;" to be used. This lock then forces the two transactions to become serialized and the second will either wait for the first to complete or fail. Any banking program that assumed that it could SELECT to confirm a balance and then UPDATE to withdraw the money as separate instructions would be a bad banking program. To exploit it, I would just have to start both operations at the same time - they both SELECT, they both see I have money, they both give me the money and UPDATE, and I get double the money (although my balance would show a big negative value - but I'm already gone...). Database 101. When I asked for "does PostgreSQL guarantee this?" I didn't mean hand waving examples or hand waving expectations. I meant a pointer into the code that has some comment that says "we want to guarantee that a commit in one session will be immediately visible to other sessions, and that a later select issued in the other sessions will ALWAYS see the commit whether 1 nanosecond later or 200 seconds later" Robert's expectation and yours seem like taking this "guarantee" for granted rather than being justified with design intent and proof thus far. :-) Given my experiment to try and force it to fail, I can see why this would be taken for granted. Is this a real promise, though? Or just a unlikely scenario that never seems to be hit? To me, the question is relevant in terms of the expectations of a multi-replica solution. We know people have the expectation. We know it can be convenient. Is the expectation valid in the first place? I've probably drawn this question out too long and should do my own research and report back... Sorry... :-) Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> If this is right, #2, #3, #4, and #6 feel similar except > that they're protecting against failures of different (but > still all incomplete) subsets of the hardware on the slave, right? Right. Actually, the biggest difference with #6 has nothing to do with protecting against failures. It has rather to do with the ease of writing applications in the context of hot standby. You can close your connection, open a connection to a different server, and know that your transactions will be reflected there. On the other hand, I'd be surprised if it didn't come with a substantial performance penalty, so it may not be too practical in real life even if it sounds good on paper. #1 , #3, and #5 don't feel that useful to me. In the case of #1, sending your WAL over the network and then not checking that it got there is sort of silly: the likelihood of packet loss on the network has got to be several orders of magnitude more likely than a failure on the master. #3 and #5 just don't seem to provide any real benefits over their immediate predecessors. Honestly, I think the most useful thing is probably going to be asynchronous replication: in other words, when a commit is requested on the master, we write WAL and return success. In the background, we stream the WAL to a secondary, which writes it and applies it. This will give us a secondary which is mostly up to date (and can run queries, with hot standby) without killing performance. The other options are going to be for environments where losing a transaction is really, really bad, or (in the case of #6) read-mostly environments where it's useful to spread the query load out across several servers, but the overhead associated with waiting for the rare write transactions to apply everywhere is tolerable. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
When the database says the data is committed it has to mean the data is really committed. Imagine if you looked at a bank account balance after withdrawing all the money and saw a balance which didn't reflect the withdrawal and allowed you to withdraw more money again... -- Greg On 14 Dec 2008, at 14:44, Mark Mielke wrote: Mark Mielke wrote: Forget replication - even for the exact same server - I don't expect that if I commit from one session, I will be able to see the change immediately from my other session or a new session that I just opened. Perhaps this is often stable to rely on this, and it is useful for the database server to minimize the window during which the commit becomes visible to others, but I think it's a false expectation from the start that it absolutely will be immediately visible to another session. I'm thinking of situations where some part of the table is in cache. The only way the commit can communicate that the new transaction is available is by during communication between the processes or threads, or between the multiple CPUs on the machine. Do I want every commit to force each session to become fully in alignment before my commit completes? Does PostgreSQL make this guarantee today? I bet it doesn't if you look far enough into the guts. It might be very fast - I don't think it is infinitely fast. FYI: I haven't been able to prove this. Multiple sessions running on my dual-core CPU seem to be able to see the latest commits before they begin executing. Am I wrong about this? Does PostgreSQL provide a intentional guarantee that a commit from one session that completes immediately followed by a query from another session will always find the commit effect visible (provide the transaction isolation level doesn't get in the way)? Or is the machine and algorithms just fast enough that by the time it executes the query (up to 1 ms later) the commit is always visible in practice? Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Heikki Linnakangas wrote: Mark Mielke wrote: FYI: I haven't been able to prove this. Multiple sessions running on my dual-core CPU seem to be able to see the latest commits before they begin executing. Am I wrong about this? Does PostgreSQL provide a intentional guarantee that a commit from one session that completes immediately followed by a query from another session will always find the commit effect visible (provide the transaction isolation level doesn't get in the way)? Yes. PostgreSQL does guarantee that, and I would expect any other DBMS to do the same. Where does the expectation come from? I don't recall ever reading it in the documentation, and unless the session processes are contending over the integers (using some sort of synchronization primitive) in memory that represent the "latest visible commit" on every single select, I'm wondering how it is accomplished? If they are contending over these integers, doesn't that represent a scaling limitation, in the sense that on a 32-core machine, they're going to be fighting with each other to get the latest version of these shared integers into the CPU for processing? Maybe it's such a small penalty that we don't care? :-) I was never instilled with the logic that 'commit in one session guarantees visibility of the effects in another session'. But, as I say above, I wasn't able to make PostgreSQL "fail" in this regard. So maybe I have no clue what I am talking about? :-) If you happen to know where the code or documentation makes this promise, feel free to point it out. I'd like to review the code. If you don't know - don't worry about it, I'll find it later... Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Robert Haas wrote: We can make the reply to a commit message when any of the following events have occurred 1. We sent the message to standby 2. We received the message on standby 3. We wrote the WAL to the WAL file 4. We fsync'd the WAL file 5. We CRC checked the WAL commit record 6. We applied the WAL commit record Perhaps it'd be useful if the failure modes these are trying to protect against were described too. If I understand right. 1. Protects all the transactions from the failure of the master; so long as neither the network nor the slave machine die soon? 2. Protects all the transactions from the failure of the master and the network between the slave and master, so long as the slave doesn't die soon? 3. Same as #2? 4. Protects against the failure of the master, the network, and parts of the slave; so long as the slave's disk survives the failure? 5. Protects against all of the above, and bit-errors in the memories of the slave machine (except the slave's disk controller?)? Or are we reading-back the CRC from the slave's disk and comparing to the CRC computed on the master where it might protect from even more? 6. Same as 4? If this is right, #2, #3, #4, and #6 feel similar except that they're protecting against failures of different (but still all incomplete) subsets of the hardware on the slave, right? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Le 14 déc. 08 à 16:48, Simon Riggs a écrit : I am truly lost to understand why the *name* "synchronous replication" causes so much discussion, yet nobody has discussed what they would actually like the software to *do* (this being a software discussion list...). AFAICS we can make the software behave like *any* of the definitions discussed so far. It seems that the easy parts are the one the more people will participate into. Maybe that's that simple. We can make the reply to a commit message when any of the following events have occurred 1. We sent the message to standby 2. We received the message on standby 3. We wrote the WAL to the WAL file 4. We fsync'd the WAL file 5. We CRC checked the WAL commit record 6. We applied the WAL commit record Ok, so let's talk about this easy part: my understanding of "synchronous replication" is that it gives to its users the strong guarantee that at commit time the transaction is secured to the slave(s). That means you get the D of ACID on more than one server. Why synchronous? Because you know the durability is ensured exactly when you receive the COMMIT ack. So I'm with Simon on this, the term Synchronous Replication does describe accurately what's being implemented here, and on the other hand, as so many of us are saying, it's true that it tells very little about it. Those 6 options are all in the scope of the infamous naming, just different guarantees level, from almost strong to very strong, with some "almost, but not quite, entirely unlike the strong I want". Pick your naming here too. At least, that's how I'm understanding this, the bottom line of why care sending this email is that maybe it'll help some people to recover from sleep deprivation ;) My 2¢, - -- dim -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Darwin) iEYEARECAAYFAklFcEsACgkQlBXRlnbh1bk0YwCfa+zGBKTK5EoH/Nmu0x+R6vKI buAAniyL6Z+3MdT4rim5/xZQvdr4QOIQ =iHnY -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Mark Mielke wrote: Mark Mielke wrote: Forget replication - even for the exact same server - I don't expect that if I commit from one session, I will be able to see the change immediately from my other session or a new session that I just opened. Perhaps this is often stable to rely on this, and it is useful for the database server to minimize the window during which the commit becomes visible to others, but I think it's a false expectation from the start that it absolutely will be immediately visible to another session. I'm thinking of situations where some part of the table is in cache. The only way the commit can communicate that the new transaction is available is by during communication between the processes or threads, or between the multiple CPUs on the machine. Do I want every commit to force each session to become fully in alignment before my commit completes? Does PostgreSQL make this guarantee today? I bet it doesn't if you look far enough into the guts. It might be very fast - I don't think it is infinitely fast. FYI: I haven't been able to prove this. Multiple sessions running on my dual-core CPU seem to be able to see the latest commits before they begin executing. Am I wrong about this? Does PostgreSQL provide a intentional guarantee that a commit from one session that completes immediately followed by a query from another session will always find the commit effect visible (provide the transaction isolation level doesn't get in the way)? Yes. PostgreSQL does guarantee that, and I would expect any other DBMS to do the same. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Mark Mielke wrote: Forget replication - even for the exact same server - I don't expect that if I commit from one session, I will be able to see the change immediately from my other session or a new session that I just opened. Perhaps this is often stable to rely on this, and it is useful for the database server to minimize the window during which the commit becomes visible to others, but I think it's a false expectation from the start that it absolutely will be immediately visible to another session. I'm thinking of situations where some part of the table is in cache. The only way the commit can communicate that the new transaction is available is by during communication between the processes or threads, or between the multiple CPUs on the machine. Do I want every commit to force each session to become fully in alignment before my commit completes? Does PostgreSQL make this guarantee today? I bet it doesn't if you look far enough into the guts. It might be very fast - I don't think it is infinitely fast. FYI: I haven't been able to prove this. Multiple sessions running on my dual-core CPU seem to be able to see the latest commits before they begin executing. Am I wrong about this? Does PostgreSQL provide a intentional guarantee that a commit from one session that completes immediately followed by a query from another session will always find the commit effect visible (provide the transaction isolation level doesn't get in the way)? Or is the machine and algorithms just fast enough that by the time it executes the query (up to 1 ms later) the commit is always visible in practice? Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> We can make the reply to a commit message when any of the following > events have occurred > > 1. We sent the message to standby > 2. We received the message on standby > 3. We wrote the WAL to the WAL file > 4. We fsync'd the WAL file > 5. We CRC checked the WAL commit record > 6. We applied the WAL commit record Also 0. The same time we would have done so if replication had not been configured at all. I think the basic problem here is that we can talk about "asynchronous replication" and "synchronous replication" but there are n>2 possible/useful behaviors (I would guess principally 0, 2, 4, and 6, but YMMV). So we're going to need some way to clarify what we mean. BTW, in case my previous emails on this topic might have given someone the contrary impression, I'm not really that worked up about this either. Interesting? Yes. Have opinions? Yes. Lie awake nights worrying about it? Nope. :-) ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Simon Riggs wrote: I am truly lost to understand why the *name* "synchronous replication" causes so much discussion, yet nobody has discussed what they would actually like the software to *do* (this being a software discussion list...). AFAICS we can make the software behave like *any* of the definitions discussed so far. I think people have talked about 'like' in the context of user expectations. That is, there seems to exist a set of people (probably those who've never worked with a multi-replica solution before) who expect that once commit completes on one server, they can query any other master or slave and be guaranteed visibility of the transaction they just committed. These people may theoretically change their decision to not use Postgres-R, or at least change their approach to how they work with Postgres-R, if the name was in some way more intuitive to them in terms of what is actually being provided. "Synchronous replication" itself says only details about replication, it does not say anything about visibility, so to some degree, people are focusing on the wrong term as the problem. Even if it says "asynchronous replication" - not sure that I care either way - this doesn't improve the understanding for the casual user of what is happening behind the scenes. Neither synchronous nor asynchronous guarantees that the change will be immediately visible from other nodes after I type 'commit;'. Asynchronous might err on the side of not immediately visible, where synchronous might (incorrectly) imply immediate visibility, but it's not an accurate guarantee to provide. Synchronous does not guarantee visibility immediately after. Some indefinite but usually short time must normally pass from when my 'commit;' completes until when the shared memory visible to my process "sees" the transaction. Multiple replicas with network latency or reliability issues increases the theoretical minimum size of this window to something that would be normally encountered as opposed to something that is normally not encountered. The only way to guarantee visibility is to ensure that the new transaction is guaranteed to be visible from a shared memory perspective on every machine in the pool, and every active backend process. If my 'commit;' is going to wait for this to occur, first, I think this forces every commit to have numerous network round trips to each machine in the pool, it forces each machine in the pool to be network accessible and responsive, it forces all commits to be serialized in the sense of "the slowest machine in the pool determines the time for my commit to complete", and I think it implies some sort of inter-process signalling, or at the very least CPU level signalling about shared memory (in the case of multiple CPUs). People such as myself think that a visibility guarantee is unreasonable and certain to cause scalability or reliability problems. So, my 'like' is an efficient multi-master solution where if I put 10 machines in the pool, I expect my normal query/commit loads to approach 10X as fast. My like prefers scalability over guarantees that may be difficult to provide, and probably are not provided today even in a single server scenario. It is certainly far too early to say what the final exact behaviour will be and there is no reason at all to pre-suppose that it need only be a single behaviour. I'm in favour of options, generally, but I would say that the distinction between some of these options is mostly very fine and strongly doubt whether people would use them if they existed. *But* I think we can add them at a later stage of development if requirements genuinely exist once all the benefits *and* costs are understood. The above 'commit;' behaviour difference - whether it completes when the commit is permanent (it definitely will be applied for certain to all replicas - it just may take time to apply to all replicas), or when the commit has actually taken effect (two-phase commit on all replicas - and both phases have completed on all replicas - what happens if second phase commit fails on one or more servers?), or when the commit is guaranteed to be visible from all existing and new sessionss (two-phase commit plus additional signalling required?) might be such an option. I'm doubtful, though - as the difference in implementation between the first and second is pretty significant. I'm curious about your suggestion to direct queries that need the latest snapshot to the 'primary'. I might have misunderstood it - but it seems that the expectation from some is that *all* sessions see the latest snapshot, so would this not imply that all sessions would be redirect to the 'primary'? I don't think it is reasonable myself, but I might be misunderstanding something... Cheers, mark -- Mark Mielke -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postg
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sun, 2008-12-14 at 13:31 +0900, Tatsuo Ishii wrote: > > The point here is that synchronous replication, at least to some > > people, is going to imply that the user-visible states of the two > > copies are consistent. To other people, it is going to imply that > > committed transactions will never be lost even in the event of a > > catastropic loss of the primary 1 picosecond after the commit is > > acknowledged. We need to choose some word that implies that we are > > guaranteeing the latter of these two things but not the former. > > Otherwise, we will have confused users, and terminological confusion > > when and if we ever implement the former as well. > > Right. Before watching this thread, I had thought that the log > shipping sync replication behaves former (and I had told so to people > in Japan who are interested in 8.4 development. Of course this is my > fault, though). > > Now I understand the log shipping sync replication does not behave > same as other "sync replications" such as pgpool and PGCluster (there > maybe more, but I don't know) GENERAL COMMENTS, not to anybody in particular: 'Tis but thy name that is my enemy. ... What's in a name? That which we call a rose By any other name would smell as sweet. ... Juliet, from "Romeo and Juliet" I am truly lost to understand why the *name* "synchronous replication" causes so much discussion, yet nobody has discussed what they would actually like the software to *do* (this being a software discussion list...). AFAICS we can make the software behave like *any* of the definitions discussed so far. It is certainly far too early to say what the final exact behaviour will be and there is no reason at all to pre-suppose that it need only be a single behaviour. I'm in favour of options, generally, but I would say that the distinction between some of these options is mostly very fine and strongly doubt whether people would use them if they existed. *But* I think we can add them at a later stage of development if requirements genuinely exist once all the benefits *and* costs are understood. I would also point out that the distinction made between various meanings of synchronous is *only* important if Hot Standby is included as well. And that is closely linked to the replication feature, which we really need to complete first. We have much to do yet. So let's please end the name debate there and think about software. ... We can make the reply to a commit message when any of the following events have occurred 1. We sent the message to standby 2. We received the message on standby 3. We wrote the WAL to the WAL file 4. We fsync'd the WAL file 5. We CRC checked the WAL commit record 6. We applied the WAL commit record Now you might think from what people have said that having synchronised contents on both primary and standby is the only way to achieve exactly the same results to queries on both nodes. Another way is to utilise a snapshot taken on the primary and simply wait until the standby catches up with that snapshot's LSN. So there is more than one way of achieving a particular result and it is not dependent upon the exact synchronisation we employ at commit time. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Robert Haas wrote: The term of art for making sure that transactions committed on the primary are visible on the secondary seems to be "one-copy serializability" (see, for example, a Google Books search on that term). Not exactly. 1-copy-serializability which is the standard for multi-master solutions, guarantees that transactions are executed in the same serializable order at each replica (which means that transactions can be executed in different order and committed at different times on different replica as long as a consistent serializable view is presented to the client). There are a number of optimizations in that area but in a multi-master case, replicas rarely commit at the same time. There are interesting papers on the subject (like Tashkent & Tashkent+ based on Postgres) for those who want to understand these problems more thoroughly. Hope this helps, manu -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development & Consulting -- Web: http://www.frogthinker.org email: m...@frogthinker.org Skype: emmanuel_cecchet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Robert Haas wrote: On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane wrote: We won't call it anything, because we never will or can implement that. See the theory of relativity: the notion of exactly simultaneous events OK, fine. I'll be more precise. I think we need to reserve the term "synchronous replication" for a system where transactions that begin on the standby after the transactions has committed on the master see the effects of the committed transaction. Wouldn't this be serialized transactions? I'd like to see proof of some sort that PostgreSQL guarantees that the instant a 'commit' returns, any transactions already open with the appropriate transaction isolation level, or any new sessions *will* see the results of the commit. I know that most of the time this happens - but what process synchronization steps occur to *guarantee* that this happens? I just googled "synchronous replication" and read through the first page of hits. Most of them do not address the question of whether synchronous replication can be said to have be completed when WAL has been received by the standby not but yet applied. One of the ones that does is: http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign ...which refers to what we're proposing to call "Synchronous Replication" as "Semi-Synchronous Replication" (or 2-safe replication) specifically to distinguish it. The other is: http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf ...which doesn't specifically examine the issue but seems to take the opposite position, namely that the server on which the transaction is executed needs to wait only for one server to apply the changes to the database (the others need only to know that they need to commit it; they don't actually need to have done it). However, that same paper refers to two-phase commit as a synchronous replication algorithm, and Wikipedia's discussion of two-phase commit: http://en.wikipedia.org/wiki/Two-phase_commit_protocol ...clearly implies that the transaction must be applied everywhere before it can be said to have committed. The second page of Google results is mostly a further discussion of the MySQL solution, which is mostly described as "semi-synchronous replication". Simon Riggs said upthread that Oracle called this "synchronous redo transport". That is obviously much closer to what we are doing than "synchronous replication". Two phase commit doesn't imply that the transaction is guaranteed to be immediately visible. See my previous paragraph. Unless transactions are locked from starting until they are able to prove that they have the latest commit (a feat which I'm going to theorize as impossible - because the moment you wait for a commit, and you begin again, you really have no guarantee that another commit has not occurred in the mean time), I think it's clear that two phase commit guarantees that the commit has taken place, but does *not* guarantee anything about visibility. It might be a good bet - but guarantee? There is no such guarantee. Cheers, mark -- Mark Mielke
Re: [HACKERS] Sync Rep: First Thoughts on Code
> The point here is that synchronous replication, at least to some > people, is going to imply that the user-visible states of the two > copies are consistent. To other people, it is going to imply that > committed transactions will never be lost even in the event of a > catastropic loss of the primary 1 picosecond after the commit is > acknowledged. We need to choose some word that implies that we are > guaranteeing the latter of these two things but not the former. > Otherwise, we will have confused users, and terminological confusion > when and if we ever implement the former as well. With apologies for replying to my own post: It's also important to understand that these two invariants are completely separate and it is possible to guarantee either without the other. If you want (1), the standby needs to apply the WAL before sending an acknowledgment to the primary but does not necessarily need to write it to disk (of course, it will have to be written to disk before the modified buffers are written to disk, but that's a separate issue). If you want (2), the standby needs to write the WAL to disk before sending the acknowledgment but does not necessarily need to apply it. If you want both, then, you need to wait for both (and it's worth noting that your performance will probably be nothing to write home about). I also did some research on terminology that has been used in the literature. As Jim Gray describes it: 1-safe replication. Transaction is committed when it has been locally WAL-logged to durable storage. Group-safe replication. Transaction is committed when WAL has been received by all remote servers, but not necessarily written to durable storage. Group-safe & 1-safe replication. Transaction is committed when it has been locally WAL-logged to durable storage and WAL has been received by all remote servers. 2-safe replication. Transaction is committed when it has been written to durable storage on both local and remote servers. Very safe replication. As 2-safe, but fails any read-write transaction if the secondary is down. (Actually, it appears that "Transaction Processing" Jim Gray and Andreas Reuter, 1993 uses 2-safe to refer to either 2-safe or group-safe; the distinction between the two is a subsequent development. See e.g. Advances in Database Technology-EDBT 2004 by Elisa Bertino) The term of art for making sure that transactions committed on the primary are visible on the secondary seems to be "one-copy serializability" (see, for example, a Google Books search on that term). ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> The point here is that synchronous replication, at least to some > people, is going to imply that the user-visible states of the two > copies are consistent. To other people, it is going to imply that > committed transactions will never be lost even in the event of a > catastropic loss of the primary 1 picosecond after the commit is > acknowledged. We need to choose some word that implies that we are > guaranteeing the latter of these two things but not the former. > Otherwise, we will have confused users, and terminological confusion > when and if we ever implement the former as well. Right. Before watching this thread, I had thought that the log shipping sync replication behaves former (and I had told so to people in Japan who are interested in 8.4 development. Of course this is my fault, though). Now I understand the log shipping sync replication does not behave same as other "sync replications" such as pgpool and PGCluster (there maybe more, but I don't know) -- Tatsuo Ishii SRA OSS, Inc. Japan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 22:23 -0500, Robert Haas wrote: > > If it's guaranteed to be visible on the standby after it's committed on > > the master, and you don't have any way to make it actually simultaneous, > > then that implies that it's visible on the slave for some brief period > > of time before it's committed on the master. > > > > That situation is still asymmetric, so why is that a better use of the > > term "synchronous"? > > Because that happens anyway. If I request a commit on a single, > unreplicated server, the server makes the commit visible to new > transactions and then sends me a message informing me that the commit > has completed. Since the message takes some finite time to reach me, > there is a window of time after the commit has completed and before I > know that the commit has been completed. > Oh, I see the distinction now. Thanks for the detailed reply. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> Might it not be true that anybody unfamiliar would be confused and that this > is a bit of a straw man? [...] > If my application assumes that it can commit to one server, and then read > back the commit from another server, and my application breaks as a result, > it's because I didn't understand the problem. Even if PostgreSQL didn't use > the word "synchronous replication", I could still be confused. I need to > understand the problem no matter what words are used. That is certainly true. But there is value in choosing words which elucidate the situation as much as possible. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> If it's guaranteed to be visible on the standby after it's committed on > the master, and you don't have any way to make it actually simultaneous, > then that implies that it's visible on the slave for some brief period > of time before it's committed on the master. > > That situation is still asymmetric, so why is that a better use of the > term "synchronous"? Because that happens anyway. If I request a commit on a single, unreplicated server, the server makes the commit visible to new transactions and then sends me a message informing me that the commit has completed. Since the message takes some finite time to reach me, there is a window of time after the commit has completed and before I know that the commit has been completed. Suppose for the sake of argument that the single, unreplicated server did these two tasks in the opposite order - namely, first, it sent a message to the process requesting the commit stating that the commit had completed, and only then made the transaction visible. This would create a race condition: the process requesting the commit might receive the commit and begin a new transaction before the previous transaction had been made visible, and would therefore not be able to see the results of its own previous actions. I think it's fair to say that this behavior would be judged totally intolerable. Therefore, there can't possibly be any applications out there which are depending on the fact that commits don't become visible until they are acknowledged, but there very well could be some applications which depend on the fact that one commits are acknowledged, they are visible. If replication is synchronous in this sense, then I can open a connection to the master, write some data, close the connection, open a new connection to the master or the slave (not caring which), and read back the data that I just wrote (assuming no one else has modified it in the mean time). If it isn't, then I can't. Some people will not care about this, but some will. The point here is that synchronous replication, at least to some people, is going to imply that the user-visible states of the two copies are consistent. To other people, it is going to imply that committed transactions will never be lost even in the event of a catastropic loss of the primary 1 picosecond after the commit is acknowledged. We need to choose some word that implies that we are guaranteeing the latter of these two things but not the former. Otherwise, we will have confused users, and terminological confusion when and if we ever implement the former as well. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 21:35 -0500, Robert Haas wrote: > On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane wrote: > > "Robert Haas" writes: > >> I think we need to reserve the term "synchronous replication" for a > >> system where transactions that begin at the same time on the primary > >> and standby see the same tuples. Clearly that is "more" synchronous > > > > We won't call it anything, because we never will or can implement that. > > See the theory of relativity: the notion of exactly simultaneous events > > OK, fine. I'll be more precise. I think we need to reserve the term > "synchronous replication" for a system where transactions that begin > on the standby after the transactions has committed on the master see > the effects of the committed transaction. > If it's guaranteed to be visible on the standby after it's committed on the master, and you don't have any way to make it actually simultaneous, then that implies that it's visible on the slave for some brief period of time before it's committed on the master. That situation is still asymmetric, so why is that a better use of the term "synchronous"? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane wrote: > "Robert Haas" writes: >> I think we need to reserve the term "synchronous replication" for a >> system where transactions that begin at the same time on the primary >> and standby see the same tuples. Clearly that is "more" synchronous > > We won't call it anything, because we never will or can implement that. > See the theory of relativity: the notion of exactly simultaneous events OK, fine. I'll be more precise. I think we need to reserve the term "synchronous replication" for a system where transactions that begin on the standby after the transactions has committed on the master see the effects of the committed transaction. > at distinct locations isn't even well-defined, because observers at yet > other locations will disagree about what is "simultaneous". And I'm > not just making a joke here --- speed-of-light delays in a WAN are > meaningful compared to current computer speeds. In practice, the > slave and the master will never commit at exactly the same time. > > I agree with the point made upthread that we should use the term > "synchronous replication" the way it's commonly used in the industry. > Inventing our own terminology might be fun but it's not really going > to result in less confusion. I just googled "synchronous replication" and read through the first page of hits. Most of them do not address the question of whether synchronous replication can be said to have be completed when WAL has been received by the standby not but yet applied. One of the ones that does is: http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign ...which refers to what we're proposing to call "Synchronous Replication" as "Semi-Synchronous Replication" (or 2-safe replication) specifically to distinguish it. The other is: http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf ...which doesn't specifically examine the issue but seems to take the opposite position, namely that the server on which the transaction is executed needs to wait only for one server to apply the changes to the database (the others need only to know that they need to commit it; they don't actually need to have done it). However, that same paper refers to two-phase commit as a synchronous replication algorithm, and Wikipedia's discussion of two-phase commit: http://en.wikipedia.org/wiki/Two-phase_commit_protocol ...clearly implies that the transaction must be applied everywhere before it can be said to have committed. The second page of Google results is mostly a further discussion of the MySQL solution, which is mostly described as "semi-synchronous replication". Simon Riggs said upthread that Oracle called this "synchronous redo transport". That is obviously much closer to what we are doing than "synchronous replication". ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Markus Wanner wrote: I don't think synchronous replication guarantees that it will be immediately visible. Even if it did push the change to the other machine, and the other machine had committed it, that doesn't guarantee that any reader sees it any more than if I commit to the same machine (no replication), I am guaranteed to see the change from another session. AFAIK every snapshot taken after a transaction has acknowledged its commit is guaranteed to see changes from that transaction. Isn't that a pretty frequent and obvious user expectation? Yes - but that's only really true while the session continues. From another session? I've never assumed that I could reconnect and be guaranteed to get the latest snapshot that includes absolutely everything that has been committed. Any system that guaranteed this even when involving multiple machines would be guaranteed to be inefficient and difficult to scale in my opinion. How could any system promise to have reasonable commit times while also guaranteeing that once a commit completes, any session to any other server will be able to see the commit? I think this forces some sort of serialization between multiple machines and defeats the purpose of having multiple machines. Where before it was indeterminate to know when the commit would take effect at each replica, it's not indeterminate when my commit will succeed. That is, my commit cannot succeed until every single server acknowledge that it is has fully received and committed my transaction. What happens if there are network problems, or what happens if I am replicating over a slower link? What if I am committing to 100 servers? Is it reasonable to expect 100 server negotiations to complete in full before my own commit will return? Synchronous replication only means that I can be assured that my change has been saved permanently by the time my commit completes. It doesn't mean anybody else can see my change or is guaranteed to see my change if the query from another session. So you wouldn't be surprised if a transaction from two hours ago isn't visible on another node, just because that node happens to be rather busy with lots of other readers and maintenance tasks? Any system that is two hours behind should fall out of the pool used to satisfy reads from. So, if there was a surprise, it would be this. I don't believe ACID requires that a commit on one server is immediately visible on another server. Any work I do on the "behind" server would still be safe from a transaction and referential integrity perspective. However, if I executed 'commit' on this "behind" server, I would expect the commit to wait until it catches up, or in the case of a 2 hour behind, I would expect the commit to fail. Look at the alternative - all commits to any server in the pool would be locked up waiting for this one machine to catch up on 2 hours of transaction. This emphasizes that the problem is that a server two hours of date is still in the pool, rather than the problem being keeping things up-to-date. If my application assumes that it can commit to one server, and then read back the commit from another server, and my application breaks as a result, it's because I didn't understand the problem. Well, yeah, depends on user expectations. I'm surprised to hear that you have that understanding of synchronous replication. I've seen people face it in the past. Most recently we had a presentation from the developer of digg.com, and he described how he had this problem with MySQL and that he had to work around it. On a smaller scale and slightly unrelated, I had this problem frequently between memcache and PostgreSQL. That is, memcache would always be latest, but PostgreSQL might not be latest, because the commit had not occurred. It seems like a standard enough problem to me. I don't expect Postgres-R to do the impossible. As with my previous paragraph, I don't expect Postgres-R to wait 2-hours to commit just because one server is falling behind. Even if PostgreSQL didn't use the word "synchronous replication", I could still be confused. I need to understand the problem no matter what words are used. As said, it depends on what the common understanding of "synchronous replication" is. I've so far been under the impression, that these potential lags are unexpected and confusing. Several people pointed me at that problem and I've thus "relabeled" Postgres-R as not being synchronous. I'm at least surprised to suddenly get pushed into the other direction. :-) However, I absolutely agree that it's not that important how we name it. What is important, is that users and developers understand the difference I agree they are unexpected and confusing. I don't agree that they are unexpected or confusing to those knowledgeable in the domain. So, the question becomes - whose expectation is wrong? Should the user learn more? Or should we push for a c
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Mark Mielke wrote: > Might it not be true that anybody unfamiliar would be confused and that > this is a bit of a straw man? Might be. I've neglected the issue myself for a while. > I don't think synchronous replication guarantees that it will be > immediately visible. Even if it did push the change to the other > machine, and the other machine had committed it, that doesn't guarantee > that any reader sees it any more than if I commit to the same machine > (no replication), I am guaranteed to see the change from another > session. AFAIK every snapshot taken after a transaction has acknowledged its commit is guaranteed to see changes from that transaction. Isn't that a pretty frequent and obvious user expectation? > Synchronous replication only means that I can be assured that > my change has been saved permanently by the time my commit completes. It > doesn't mean anybody else can see my change or is guaranteed to see my > change if the query from another session. So you wouldn't be surprised if a transaction from two hours ago isn't visible on another node, just because that node happens to be rather busy with lots of other readers and maintenance tasks? > If my application assumes that it can commit to one server, and then > read back the commit from another server, and my application breaks as a > result, it's because I didn't understand the problem. Well, yeah, depends on user expectations. I'm surprised to hear that you have that understanding of synchronous replication. > Even if PostgreSQL > didn't use the word "synchronous replication", I could still be > confused. I need to understand the problem no matter what words are used. As said, it depends on what the common understanding of "synchronous replication" is. I've so far been under the impression, that these potential lags are unexpected and confusing. Several people pointed me at that problem and I've thus "relabeled" Postgres-R as not being synchronous. I'm at least surprised to suddenly get pushed into the other direction. :-) However, I absolutely agree that it's not that important how we name it. What is important, is that users and developers understand the difference. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Aidan Van Dyk wrote: > Well, I think the PG MVCC (which wal-streaming just ships across > somewhere else) will save that. So with hot-standby you could have > another client could see the result *after* the COMMIT has been > requested, but *before* the COMMIT returns... But we have this > situation in a single current PG instance anyways, so it's nothing > new AFAIU the proposed algorithm only waits until WAL is written on the slave before acknowledging COMMIT. Application of the changes may be deferred, so it's not necessarily immediately visible on the slave. > But with hot-standby, I could also see that it could be done such that > the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but > because of a current running query, application of it is delayed... But > this is hot-standby's problem of describing itself, not sync-rep. I'm thinking of the overall system and don't care much if it's hot-standby's or sync-rep's problem. But it's certainly the master which needs to await certain acknowledgments of the slaves. That has so far been discussed within this sync-rep thread. > IMHO, sync-rep is about getting the change "durrably to a slave" before > acknoledging the COMMIT. That slave could be any number of things: > - A "WAL archive" type system having the ability to be used for > recover > - A PG with special "recovery mode" that reads the stream and applies it > - A full hot-standby recovery > > I could see any and all of those (and probably other) being usefull and > used. I certainly agree to that. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Markus Wanner wrote: Tom Lane wrote: We won't call it anything, because we never will or can implement that. See the theory of relativity: the notion of exactly simultaneous events at distinct locations isn't even well-defined That has never been the point of the discussion. It's rather about the question if changes from transactions are guaranteed to be visible on remote nodes immediately after commit acknowledgment. Whether or not this is guaranteed, in both cases the term "synchronous replication" is commonly used, which is causing confusion. Might it not be true that anybody unfamiliar would be confused and that this is a bit of a straw man? I don't think synchronous replication guarantees that it will be immediately visible. Even if it did push the change to the other machine, and the other machine had committed it, that doesn't guarantee that any reader sees it any more than if I commit to the same machine (no replication), I am guaranteed to see the change from another session. Synchronous replication only means that I can be assured that my change has been saved permanently by the time my commit completes. It doesn't mean anybody else can see my change or is guaranteed to see my change if the query from another session. If my application assumes that it can commit to one server, and then read back the commit from another server, and my application breaks as a result, it's because I didn't understand the problem. Even if PostgreSQL didn't use the word "synchronous replication", I could still be confused. I need to understand the problem no matter what words are used. Cheers, mark -- Mark Mielke
Re: [HACKERS] Sync Rep: First Thoughts on Code
* Markus Wanner [081213 16:33]: > Hi, > > Hannu Krosing wrote: > > You can have a variantof sync rep + hot standby where the master does > > not return committed before the slave has both synced the data and > > replied the transaction so that it is visible on slave but in that case > > you may have a usecase, where it is actually visible on slave _before_ > > it is visible on master. > > As long as it's not visible *before* the client requests a COMMIT, that > certainly doesn't matter (because the application cannot check that). Well, I think the PG MVCC (which wal-streaming just ships across somewhere else) will save that. So with hot-standby you could have another client could see the result *after* the COMMIT has been requested, but *before* the COMMIT returns... But we have this situation in a single current PG instance anyways, so it's nothing new But with hot-standby, I could also see that it could be done such that the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but because of a current running query, application of it is delayed... But this is hot-standby's problem of describing itself, not sync-rep. IMHO, sync-rep is about getting the change "durrably to a slave" before acknoledging the COMMIT. That slave could be any number of things: - A "WAL archive" type system having the ability to be used for recover - A PG with special "recovery mode" that reads the stream and applies it - A full hot-standby recovery I could see any and all of those (and probably other) being usefull and used. But in the current patch, it focusses on the streaming (sending), and and a receiver "recovery" mode that can accept/apply them, again, without worrying about acutally running queries (yet) ... a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Hannu Krosing wrote: > You can have a variantof sync rep + hot standby where the master does > not return committed before the slave has both synced the data and > replied the transaction so that it is visible on slave but in that case > you may have a usecase, where it is actually visible on slave _before_ > it is visible on master. As long as it's not visible *before* the client requests a COMMIT, that certainly doesn't matter (because the application cannot check that). What matters is, that an application might expect a node to show the changes of a transaction which has previously (seen from the application itself) been committed and acknowledged by another node. AFAICT the common understanding of synchronous replication is, that all nodes confirm to have committed the changes of a transaction *before* acknowledging COMMIT to the application (and obviously only *after* the application requested to COMMIT the transaction, so the guarantee is that all nodes commit *sometime* within that time frame, which is certainly possible to guarantee, see 2PC approaches). This guarantee is not provided by the Postgres-R algorithm, nor by the approach presented. Both only guarantee, that the transaction *will* get committed (and thus get visible) on all nodes *sometime* *after* the application requested to commit it (even in case of various failures, that is) [1]. As cited before, that has been enough of a reason for Jan Wieck to call Postgres-R asynchronous, and I certainly see his point. Note that the amount of time that passes between the commit acknowledgment and the actual commit on remote nodes may theoretically be infinitely long. And in practice certainly long enough for an application to notice the difference. However, it still is a practical optimization, because most applications should cope with it just fine. But not all... Do you consider the proposed log shipping approach to be synchronous? How about the Postgres-R algorithm? Regards Markus Wanner [1]: of course these approaches also guarantee that the transaction is committed on the local node *before* acknowledging commit, so that subsequent (seen from the application) queries are guaranteed to see the changes. But that guarantee only holds true for the local node. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Simon Riggs wrote: >> Hot Standby (although the latter >> seems to have stalled a bit...) > > It's just being worked on asynchronously. ;-) LOL, thanks for bringing humor into this discussion :-) Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Tom Lane wrote: > We won't call it anything, because we never will or can implement that. > See the theory of relativity: the notion of exactly simultaneous events > at distinct locations isn't even well-defined That has never been the point of the discussion. It's rather about the question if changes from transactions are guaranteed to be visible on remote nodes immediately after commit acknowledgment. Whether or not this is guaranteed, in both cases the term "synchronous replication" is commonly used, which is causing confusion. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 21:35 +0200, Hannu Krosing wrote: > We still could call Sync Rep as a feature "synchronous replication" on > basis that "WAL Streaming - Synchronous Write" is the highest security > level achievable using the feature. > > And maybe have Sync Hot Standby as a feature on top of that which > provides "WAL Streaming - Synchronous Apply" Or maybe better call it Serializable Hot Standby, as the actual guarantee that can be achieved is that when one client does something on master and after committing on master starts another transaction on slave, then the effects of query on master are visible on slave. -- -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote: > > I certainly agree to using such terms. Unfortunately, in my experience, > > synchronous replication is commonly used to mean that transactions are > > guaranteed to be immediately visible on remote nodes after the client > > got commit acknowledgment. That's the cause for confusion I'm envisioning. > > I think that's a very important point. It's very possible that 8.4 > may support both this feature and Hot Standby (although the latter > seems to have stalled a bit...). That makes me think "oh, great, I > can offload any subset of my read-only queries to the standby". Not > so fast. > > I think we need to reserve the term "synchronous replication" for a > system where transactions that begin at the same time on the primary > and standby see the same tuples. Define "same time". You can have a variantof sync rep + hot standby where the master does not return committed before the slave has both synced the data and replied the transaction so that it is visible on slave but in that case you may have a usecase, where it is actually visible on slave _before_ it is visible on master. actually you can't have that "same time" guarantee even on single system, that is, if you start two transactions connections "at the same time", you still cant be sure there is not third transaction which has committed between those two and which makes the visible data on those two different. > Clearly that is "more" synchronous > than what is being proposed here; if we call this "synchronous > replication", what will we call that? "Really Synchronous, Honest, No > Kidding"? Admittedly, we may never implement that feature, but that > seems irrelevant. > > It would be useful to have names for all the different possibilities. > Random ideas: > > Log Shipping. After each log switch, the previous WAL log is copied > to the standby in its entirety. > > WAL Streaming - Asynchronous. The WAL log is streamed from master to > standby as it is written, but transactions on the master never wait. > > WAL Streaming - Synchronous Receive. The WAL log is streamed from > master to standby as it is written, and transactions on the master > wait until the standby acknowledges receipt of the WAL. > > WAL Streaming - Synchronous Write. The WAL log is streamed from > master to standby as it is written, and transactions on the master > wait until the standby acknowledges that the WAL has been written to > disk. > > WAL Streaming - Synchronous Apply. The WAL log is streamed from > master to standby as it is written, and transactions on the master > wait until the standby acknowledges that WAL has been written to disk > and applied. We still could call Sync Rep as a feature "synchronous replication" on basis that "WAL Streaming - Synchronous Write" is the highest security level achievable using the feature. And maybe have Sync Hot Standby as a feature on top of that which provides "WAL Streaming - Synchronous Apply" -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote: > Hot Standby (although the latter > seems to have stalled a bit...) It's just being worked on asynchronously. ;-) -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Synchronous replication, "sync rep" is *not* intersted in the "slave's visiblity of the commit", because PostgreSQL doesn't "serve" requests when in recovery (wal receiving) mode *now*. This sync rep patch/proposal/discution is *strictly* (at this point yet, hot standby may eventually or hopefully soon change that) the means to get the data "safely in 2 seperate places", before the COMMIT returns, by means of wal streaming. That "safely in 2 places" can have various implementation options (like received, on disk, or applied), and Fujii-san explained some of the options as to what to consider "safe" and their trade-offs at his presentation at last year. Once both sync-rep (the wal-streaming get changes in two places) and hot-standby (run queries while WAL is being applied) are available in PostgreSQL, at that point we might need to start "other client visibility", but even then, we still don't need to worry about multi-master options... a. * Markus Wanner [081213 12:17]: > Hi, > > Simon Riggs wrote: > > On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote: > >> Speaking of a "synchronous commit" > >> is utterly misleading, because the commit itself is exactly the thing > >> that's *not* synchronous. > > > > Not really sure where you're going here. > > I'm pointing to a potential misunderstanding, trying to help to prevent > you from running into the same issues and discussions as I did. > > I've learned the hard way, that the Postgres-R algorithm is not fully > synchronous (in the strict sense). This caused confusion for people who > take the word "synchronous" by its original meaning. The algorithm > proposed here seems similar enough to potentially cause the same confusion. > > As I see it now, I think it's well worth to point out the difference, > from both, the technical as well as from the marketing perspective. The > former for better understanding, the later to prevent users from > thinking it must be slow per definition. Arguing that your approach is > not fully synchronous definitely helps defending that concern. > > However, I'm just now realizing, that the difference is only relevant as > soon as you begin to allow read-only access on the slave. AFAIK that's > among the goals of this effort, no? > > > "synchronous replication" is > > used exactly as described in the Wikipedia entry here: > > http://en.wikipedia.org/wiki/Database_replication > > That article describes pretty much all variants of replication, what > exactly are you referring to? > > Under "Database Replication > Multi-Master replication" it describes > eager vs lazy variants, which is IMO a more appropriate and useful > distinction than sync vs async. (But that's admittedly a sentence I've > contributed myself, IIRC). > > Under "Storage Replication > Synchronous Replication" one can read: > "Write is not considered complete until acknowledgement by both local > and remote storage." For the proposed approach this might hold true for > WAL writing. However, the user certainly doesn't care how synchronous > the log is shipped nor written, is as long as she doesn't see the > changes on the slave. > > That's the difference between fully synchronous and eager (or virtually > or approximately synchronous) algorithms. You seem to refer to both as > "synchronous". Phrases like "synchronous commit" or "synchronous data > transfer" do not help me to understand what exactly you are talking about. > > Explaining that the slave commits (and therefore makes the transactions > visible) asynchronously would help. And it would prevent disappointment > for users who expect changes to be immediately visible on the slave. > > > No two word phrase is going to accurately sum up the complexity and > > potential for data loss in these situations. DRBD saw that too and just > > called them A, B and C and then describe them more accurately. > > Agreed. I've chosen lazy, eager and sync, so far. I'm open for better > terms, and I leave it up to you to call your variants whatever you like. > But to understand what you are talking about, I'd prefer to get to know > these distinctions crisp and clear. > > > But I don't think we should say "PostgreSQL just implemented algorithm > > B" which is just unhelpful. I don't think its "marketing" to refer to it > > by the phrase most commonly used for the technology we are building. > > I certainly agree to using such terms. Unfortunately, in my experience, > synchronous replication is commonly used to mean that transactions are > guaranteed to be immediately visible on remote nodes after the client > got commit acknowledgment. That's the cause for confusion I'm envisioning. > > > I'm hoping to be somewhat helpful to this effort of getting a log > shipping replication variant into Postgres. It can only be beneficial > for Postgres-R in that we gain field experience with ..uhm.. this > special kind of replication, however we name it. > > I'm already on xmas vacation, so I won't bother you any fu
Re: [HACKERS] Sync Rep: First Thoughts on Code
"Robert Haas" writes: > I think we need to reserve the term "synchronous replication" for a > system where transactions that begin at the same time on the primary > and standby see the same tuples. Clearly that is "more" synchronous > than what is being proposed here; if we call this "synchronous > replication", what will we call that? "Really Synchronous, Honest, No > Kidding"? Admittedly, we may never implement that feature, but that > seems irrelevant. We won't call it anything, because we never will or can implement that. See the theory of relativity: the notion of exactly simultaneous events at distinct locations isn't even well-defined, because observers at yet other locations will disagree about what is "simultaneous". And I'm not just making a joke here --- speed-of-light delays in a WAN are meaningful compared to current computer speeds. In practice, the slave and the master will never commit at exactly the same time. I agree with the point made upthread that we should use the term "synchronous replication" the way it's commonly used in the industry. Inventing our own terminology might be fun but it's not really going to result in less confusion. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
> I certainly agree to using such terms. Unfortunately, in my experience, > synchronous replication is commonly used to mean that transactions are > guaranteed to be immediately visible on remote nodes after the client > got commit acknowledgment. That's the cause for confusion I'm envisioning. I think that's a very important point. It's very possible that 8.4 may support both this feature and Hot Standby (although the latter seems to have stalled a bit...). That makes me think "oh, great, I can offload any subset of my read-only queries to the standby". Not so fast. I think we need to reserve the term "synchronous replication" for a system where transactions that begin at the same time on the primary and standby see the same tuples. Clearly that is "more" synchronous than what is being proposed here; if we call this "synchronous replication", what will we call that? "Really Synchronous, Honest, No Kidding"? Admittedly, we may never implement that feature, but that seems irrelevant. It would be useful to have names for all the different possibilities. Random ideas: Log Shipping. After each log switch, the previous WAL log is copied to the standby in its entirety. WAL Streaming - Asynchronous. The WAL log is streamed from master to standby as it is written, but transactions on the master never wait. WAL Streaming - Synchronous Receive. The WAL log is streamed from master to standby as it is written, and transactions on the master wait until the standby acknowledges receipt of the WAL. WAL Streaming - Synchronous Write. The WAL log is streamed from master to standby as it is written, and transactions on the master wait until the standby acknowledges that the WAL has been written to disk. WAL Streaming - Synchronous Apply. The WAL log is streamed from master to standby as it is written, and transactions on the master wait until the standby acknowledges that WAL has been written to disk and applied. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Simon Riggs wrote: > On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote: >> Speaking of a "synchronous commit" >> is utterly misleading, because the commit itself is exactly the thing >> that's *not* synchronous. > > Not really sure where you're going here. I'm pointing to a potential misunderstanding, trying to help to prevent you from running into the same issues and discussions as I did. I've learned the hard way, that the Postgres-R algorithm is not fully synchronous (in the strict sense). This caused confusion for people who take the word "synchronous" by its original meaning. The algorithm proposed here seems similar enough to potentially cause the same confusion. As I see it now, I think it's well worth to point out the difference, from both, the technical as well as from the marketing perspective. The former for better understanding, the later to prevent users from thinking it must be slow per definition. Arguing that your approach is not fully synchronous definitely helps defending that concern. However, I'm just now realizing, that the difference is only relevant as soon as you begin to allow read-only access on the slave. AFAIK that's among the goals of this effort, no? > "synchronous replication" is > used exactly as described in the Wikipedia entry here: > http://en.wikipedia.org/wiki/Database_replication That article describes pretty much all variants of replication, what exactly are you referring to? Under "Database Replication > Multi-Master replication" it describes eager vs lazy variants, which is IMO a more appropriate and useful distinction than sync vs async. (But that's admittedly a sentence I've contributed myself, IIRC). Under "Storage Replication > Synchronous Replication" one can read: "Write is not considered complete until acknowledgement by both local and remote storage." For the proposed approach this might hold true for WAL writing. However, the user certainly doesn't care how synchronous the log is shipped nor written, is as long as she doesn't see the changes on the slave. That's the difference between fully synchronous and eager (or virtually or approximately synchronous) algorithms. You seem to refer to both as "synchronous". Phrases like "synchronous commit" or "synchronous data transfer" do not help me to understand what exactly you are talking about. Explaining that the slave commits (and therefore makes the transactions visible) asynchronously would help. And it would prevent disappointment for users who expect changes to be immediately visible on the slave. > No two word phrase is going to accurately sum up the complexity and > potential for data loss in these situations. DRBD saw that too and just > called them A, B and C and then describe them more accurately. Agreed. I've chosen lazy, eager and sync, so far. I'm open for better terms, and I leave it up to you to call your variants whatever you like. But to understand what you are talking about, I'd prefer to get to know these distinctions crisp and clear. > But I don't think we should say "PostgreSQL just implemented algorithm > B" which is just unhelpful. I don't think its "marketing" to refer to it > by the phrase most commonly used for the technology we are building. I certainly agree to using such terms. Unfortunately, in my experience, synchronous replication is commonly used to mean that transactions are guaranteed to be immediately visible on remote nodes after the client got commit acknowledgment. That's the cause for confusion I'm envisioning. I'm hoping to be somewhat helpful to this effort of getting a log shipping replication variant into Postgres. It can only be beneficial for Postgres-R in that we gain field experience with ..uhm.. this special kind of replication, however we name it. I'm already on xmas vacation, so I won't bother you any further on this issue. Have fun coding and make sure to enjoy this time of the year. All the best. Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote: > Speaking of a "synchronous commit" > is utterly misleading, because the commit itself is exactly the thing > that's *not* synchronous. Not really sure where you're going here. "synchronous replication" is used exactly as described in the Wikipedia entry here: http://en.wikipedia.org/wiki/Database_replication No two word phrase is going to accurately sum up the complexity and potential for data loss in these situations. DRBD saw that too and just called them A, B and C and then describe them more accurately. But I don't think we should say "PostgreSQL just implemented algorithm B" which is just unhelpful. I don't think its "marketing" to refer to it by the phrase most commonly used for the technology we are building. Nobody suggested we call it "wizrep" or suchlike... The docs can contain the exact description of data loss and timing windows. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On 2008-12-13, at 13:07, Markus Wanner wrote: However, that is a marketing decision [1], which should not be mixed with the technical discussion here. Speaking of a "synchronous commit" is utterly misleading, because the commit itself is exactly the thing that's *not* synchronous. [1]: Some people like the term "virtually synchronous" for marketing purposes. That's at least half-ways technically correct. Marketing people are virtually trustworthy, from my life experience. If you ask me, this is just preposterous. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Simon Riggs wrote: > You're right that neither the data transfer nor data availability is > entirely synchronous, but data transfer is synchronous at time of > *commit*: it is recorded on multiple nodes at the same time. I'm unsure what you mean by a "data transfer being synchronous". To what other process or state should the data transfer be synchronous to? > The term "synchronous replication" is already well used in the industry > to mean synchronous commit, so I don't think we should change the name > now. The project here is also known to everybody as "synch rep". I understand very well, that you don't want to change the name. I've been hesitant to "relabel" Postgres-R from synchronous to asynchronous to eager. However, that is a marketing decision [1], which should not be mixed with the technical discussion here. Speaking of a "synchronous commit" is utterly misleading, because the commit itself is exactly the thing that's *not* synchronous. It *is* an optimization to fully synchronous replication to defer commit on the "slave" and only make sure that the transaction *can* be applied at some time in the future. However, this *does* have the drawback of transactions not being immediately visible on the slave. Often enough, this is acceptable. But it certainly matters to some applications developers. > What is confusing is that "replication" itself is a much abused term and > is used to describe technologies for HA, DR and data movement. I absolutely agree to that. And I'm thus recommending to at least be consistent and honest with the term "synchronous" and point out that WAL writing is synchronous for the log shipping approach here (AFAIK). But that the commit is asynchronous for performance reasons. In other words: this approach is certainly (and hopefully, for performance reasons) different from a fully synchronous approach. Even for marketing reasons, it might make sense to point out that difference (.. "no, we are faster than fully sync rep."). Regards Markus Wanner [1]: Some people like the term "virtually synchronous" for marketing purposes. That's at least half-ways technically correct. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Sat, 2008-12-13 at 00:00 +0100, Markus Wanner wrote: > Hi, > > Fujii Masao wrote: > > I'd like to define the meaning of "synch rep" again. "synch rep" means: > > > > (1) Transaction commit waits for WAL records to be replicated to the standby > > before the command returns a "success" indication to the client. > > > > (2) The standby has (can read) all WAL files indispensable for recovery. > > Let me point out that - very much like the original Postgres-R algorithm > - this guarantees committed transactions to be durable and consistent > (no late aborts of conflicting transactions), but it does not guarantee > that a transaction committed on one node is immediately visible on the > other node. In that sense, it is not synchronous as commonly understood, > because it does not "operate with all their parts in synchrony" [1], as > implied by the term "synchronous". This might (and often has in the > past) lead to confusion. You're right that neither the data transfer nor data availability is entirely synchronous, but data transfer is synchronous at time of *commit*: it is recorded on multiple nodes at the same time. The term "synchronous replication" is already well used in the industry to mean synchronous commit, so I don't think we should change the name now. The project here is also known to everybody as "synch rep". * Oracle Data Guard calls it "synchronous redo transport" * MS Exchange calls it "synchronous replication" * MS SQL Server has "Database Mirroring", "Log Shipping" and "Replication". "Database Mirroring" provides synchronous mechanism, with "Replication" meaning data transfer to other databases, publish&subscribe. * DB2 HADR provides "synchronous replication" * MySQL call it "synchronous replication" What is confusing is that "replication" itself is a much abused term and is used to describe technologies for HA, DR and data movement. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, Fujii Masao wrote: > I'd like to define the meaning of "synch rep" again. "synch rep" means: > > (1) Transaction commit waits for WAL records to be replicated to the standby > before the command returns a "success" indication to the client. > > (2) The standby has (can read) all WAL files indispensable for recovery. Let me point out that - very much like the original Postgres-R algorithm - this guarantees committed transactions to be durable and consistent (no late aborts of conflicting transactions), but it does not guarantee that a transaction committed on one node is immediately visible on the other node. In that sense, it is not synchronous as commonly understood, because it does not "operate with all their parts in synchrony" [1], as implied by the term "synchronous". This might (and often has in the past) lead to confusion. It's certainly enough of a reason for me to rather use the term "eager replication". See [2] for a more in-depth explanation. I might also point out, that Jan Wieck called this very same approch "an asynchronous replication system by all means" [3]. Regards Markus Wanner [1]: Wikipedia on Synchronization http://en.wikipedia.org/wiki/Synchronization [2]: Postgres-R general mailing list, by Markus Wanner, subject: terms for database replication: synchronous vs eager http://lists.pgfoundry.org/pipermail/postgres-r-general/2008-September/14.html [3]: Postgres General mailing list, by Jan Wieck, subject: terms for database replication: synchronous vs eager http://archives.postgresql.org/pgsql-hackers/2007-09/msg00631.php -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Fri, 2008-12-12 at 14:23 -0500, Aidan Van Dyk wrote: > So when would I have to call that function? Before begin, after begin, > before commit, or all, to guarentee that know that my application is > suppose to "delay" calling commit until when sync-mode is actualyl > synchronous? And then afterwards, I have to call it again t omake sure > it didn't fall "out of" mode between my previous call and the commit > actually working? I'm not suggesting that applications call the function. It's a way for a monitoring system to know that you're in a degraded state and notify you. I'm not sure I entirely understand the use case you're advocating: Let's say the standby has a major failure. Now you have a single point of failure (the primary), so _all_ of your transactions are in jeopardy anyway -- at least until you get back into sync rep. Rejecting new transactions won't save your old ones. The only time it helps is when the failure is temporary, i.e. you didn't really lose the storage on the standby. But you would need to rely on some guarantee that the storage is still intact on the standby system even though the standby is unresponsive. Is that the use case? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers