Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote:

  We might ask why pg_start_backup() needs to perform checkpoint though,
  since you have remarked that is a problem also.
 
  The answer is that it doesn't really need to, we just need to be certain
  that archiving has been running since whenever we choose as the start
  time. So we could easily just use the last normal checkpoint time, as
  long as we had some way of tracking the archiving.
 
  ISTM we can solve the checkpoint problem more easily and it would
  potentially save much more time than tuning rsync for Postgres, which
  is what the other idea amounted to. So I do see a solution that is both
  better and more quickly achievable for 8.4.
 
 Sounds good. I agree that pg_start_backup basically doesn't need
 checkpoint. But, for full_page_write == off, we probably cannot get
 rid of it. Even if full_page_write == on, since we cannot make out
 whether all indispensable full pages were written after last checkpoint,
 pg_start_backup must do checkpoint with forcePageWrite = on.

Yes, OK. So I think it would only work when full_page_writes = on, and
has been on since last checkpoint. So two changes:

* We just need a boolean that starts at true every checkpoint and gets
set to false anytime someone resets full_page_writes or archive_command.
If the flag is set  full_page_writes = on then we skip the checkpoint
entirely and use the value from the last checkpoint.

* My infra patch also had a modified version of pg_start_backup() that
allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
a waste of time, and I want to listen to everybody else now and change
pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
there.

Can you work on those also?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, OK. So I think it would only work when full_page_writes = on, and
 has been on since last checkpoint. So two changes:

 * We just need a boolean that starts at true every checkpoint and gets
 set to false anytime someone resets full_page_writes or archive_command.
 If the flag is set  full_page_writes = on then we skip the checkpoint
 entirely and use the value from the last checkpoint.

Sounds good.

pg_start_backup on the standby (probably you are planning?) also needs
this logic? If so, resetting full_page_writes or archive_command should
generate its xlog.

I have another thought: should we forbid the reset of archive_command
during online backup? Currently we can do. If we don't need to do so,
we also don't need to track the reset of archiving for fast pg_start_backup.


 * My infra patch also had a modified version of pg_start_backup() that
 allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
 a waste of time, and I want to listen to everybody else now and change
 pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
 there.

 Can you work on those also?

Umm.. I'm busy. Of course, I will try it if no one raises his or her hand.
But, I'd like to put coding the core of synch rep ahead of this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, OK. So I think it would only work when full_page_writes = on, and
 has been on since last checkpoint. So two changes:

 * We just need a boolean that starts at true every checkpoint and gets
 set to false anytime someone resets full_page_writes or archive_command.
 If the flag is set  full_page_writes = on then we skip the checkpoint
 entirely and use the value from the last checkpoint.

 Sounds good.

I attached the self-contained patch to skip checkpoint at pg_start_backup.


 pg_start_backup on the standby (probably you are planning?) also needs
 this logic? If so, resetting full_page_writes or archive_command should
 generate its xlog.

Now, the patch doesn't care about this.


 I have another thought: should we forbid the reset of archive_command
 during online backup? Currently we can do. If we don't need to do so,
 we also don't need to track the reset of archiving for fast pg_start_backup.

Now, doesn't care too.

Happy Holidays!

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
? GNUmakefile
? config.log
? config.status
? contrib/pgbench/pgbench
? src/Makefile.global
? src/backend/postgres
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/postmaster/walreceiver.c
? src/backend/postmaster/walsender.c
? src/backend/snowball/snowball_create.sql
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/include/postmaster/walreceiver.h
? src/include/postmaster/walsender.h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.2
? src/port/pg_config_paths.h
? src/test/regress/pg_regress
? src/test/regress/testtablespace
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.324
diff -c -r1.324 xlog.c
*** src/backend/access/transam/xlog.c	17 Dec 2008 01:39:03 -	1.324
--- src/backend/access/transam/xlog.c	24 Dec 2008 14:57:27 -
***
*** 295,300 
--- 295,302 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
+ 	uint32		bkpCount;		/* ID of bkp using the same ckpt */
+ 	bool		bkpForceCkpt;	/* reset full_page_writes since last ckpt? */
  	uint32		ckptXidEpoch;	/* nextXID  epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
***
*** 6025,6036 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
  		xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl-ckptXid = checkPoint.nextXid;
  		SpinLockRelease(xlogctl-info_lck);
--- 6027,6043 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* 
! 	 * Update shared-memory copy of checkpoint XID/epoch and reset the
! 	 * variables of backup ID/flag.
! 	 */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
+ 		xlogctl-bkpCount = 0;
+ 		xlogctl-bkpForceCkpt = true;
  		xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl-ckptXid = checkPoint.nextXid;
  		SpinLockRelease(xlogctl-info_lck);
***
*** 6502,6507 
--- 6509,6535 
  	}
  }
  
+ bool
+ assign_full_page_writes(bool newval, bool doit, GucSource source)
+ {
+ 	/*
+ 	 * If full_page_writes is reset, since all indispensable full pages
+ 	 * might not be written since last checkpoint, we force a checkpoint
+ 	 * at pg_start_backup.
+ 	 */
+ 	if (doit  fullPageWrites != newval)
+ 	{
+ 		/* use volatile 

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Simon Riggs

On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote:
 Hi,
 
 On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote:
  Hi,
 
  On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Yes, OK. So I think it would only work when full_page_writes = on, and
  has been on since last checkpoint. So two changes:
 
  * We just need a boolean that starts at true every checkpoint and gets
  set to false anytime someone resets full_page_writes or archive_command.
  If the flag is set  full_page_writes = on then we skip the checkpoint
  entirely and use the value from the last checkpoint.
 
  Sounds good.
 
 I attached the self-contained patch to skip checkpoint at pg_start_backup.

Good.

Can we change to IMMEDIATE when it we need the checkpoint?

What is bkpCount for? I think we should discuss whatever that is for
separately. It isn't used in any if test, AFAICS.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao
Hi,

I fixed some bugs.

On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Can we change to IMMEDIATE when it we need the checkpoint?

Perhaps yes, though current patch doesn't care about it.
I'm not sure if we really need the feature. Yes, as you say,
I'd like to also listen to everybody else.


 What is bkpCount for?

So far, name of a backup history file consists of only
checkpoint redo location. But, in this patch, since some
backups use the same checkpoint, a backup history file
could be overwritten unfortunately. So, I introduced
bkpCount as ID of backups which use the same checkpoint.

 I think we should discuss whatever that is for
 separately. It isn't used in any if test, AFAICS.

Yes, this patch is testbed. We need to discuss more.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
? GNUmakefile
? config.log
? config.status
? contrib/make.log
? contrib/pgbench/pgbench
? src/Makefile.global
? src/backend/postgres
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/snowball/snowball_create.sql
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.2
? src/port/pg_config_paths.h
? src/test/regress/log
? src/test/regress/pg_regress
? src/test/regress/results
? src/test/regress/testtablespace
? src/test/regress/tmp_check
? src/test/regress/expected/constraints.out
? src/test/regress/expected/copy.out
? src/test/regress/expected/create_function_1.out
? src/test/regress/expected/create_function_2.out
? src/test/regress/expected/largeobject.out
? src/test/regress/expected/largeobject_1.out
? src/test/regress/expected/misc.out
? src/test/regress/expected/tablespace.out
? src/test/regress/sql/constraints.sql
? src/test/regress/sql/copy.sql
? src/test/regress/sql/create_function_1.sql
? src/test/regress/sql/create_function_2.sql
? src/test/regress/sql/largeobject.sql
? src/test/regress/sql/misc.sql
? src/test/regress/sql/tablespace.sql
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.324
diff -c -r1.324 xlog.c
*** src/backend/access/transam/xlog.c	17 Dec 2008 01:39:03 -	1.324
--- src/backend/access/transam/xlog.c	24 Dec 2008 18:13:45 -
***
*** 295,300 
--- 295,302 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
+ 	uint32		bkpCount;		/* ID of bkp using the same ckpt */
+ 	bool		bkpForceCkpt;	/* reset full_page_writes since last ckpt? */
  	uint32		ckptXidEpoch;	/* nextXID  epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
***
*** 318,323 
--- 320,332 
  static XLogCtlData *XLogCtl = NULL;
  
  /*
+  * We don't allow more than MAX_BKP_COUNT backups to use the same checkpoint.
+  * If XLogCtl-bkpCount  MAX_BKP_COUNT, we force new checkpoint at pg_standby
+  * even if there are all indispensable full pages since last checkpoint.
+  */
+ #define MAX_BKP_COUNT 256
+ 
+ /*
   * We maintain an image of pg_control in shared memory.
   */
  static ControlFileData *ControlFile = NULL;
***
*** 6025,6036 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
  		xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl-ckptXid = checkPoint.nextXid;
  		SpinLockRelease(xlogctl-info_lck);
--- 6034,6050 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* 
! 	 * Update shared-memory copy of checkpoint XID/epoch and reset the
! 	 * variables of backup ID/flag.
! 	 */
  	{
  		/* use 

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs
On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote:

  XLogFlush() flushes because of an interlock between a dirty buffer write
  and an outstanding WAL write. Dirty buffer writes are not replicated, so
  there is no need to have a similar interlock on WAL streaming.
 
  So making those call points synchronous is possible, but neither
  necessary or IMHO desirable.
 
 Yes in upcoming 8.4, but probably no in the future.
 
 What if the primary fails after writing the dirty data buffer before sending
 the corresponding logs? This would make data on the primary and logs
 on the standby inconsistent. In 8.4, such inconsistency might not matter
 because we don't use the data on the failed primary for recovery (when
 restarting the failed server, we always need a fresh backup). But, since
 this restriction is not good for some people, in the future, the failed server
 should restart without a fresh backup, and the inconsistency would be
 problem. So, I think that the inconsistency should be removed even if
 asynchronous replication case, and we should enforce WAL rule over
 some servers.

I don't get this argument. Why would we care what happens on the failed server?

The additional synchronizations you suggest are neither necessary, nor
IMHO desirable.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Tue, Dec 23, 2008 at 5:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote:

  XLogFlush() flushes because of an interlock between a dirty buffer write
  and an outstanding WAL write. Dirty buffer writes are not replicated, so
  there is no need to have a similar interlock on WAL streaming.
 
  So making those call points synchronous is possible, but neither
  necessary or IMHO desirable.

 Yes in upcoming 8.4, but probably no in the future.

 What if the primary fails after writing the dirty data buffer before sending
 the corresponding logs? This would make data on the primary and logs
 on the standby inconsistent. In 8.4, such inconsistency might not matter
 because we don't use the data on the failed primary for recovery (when
 restarting the failed server, we always need a fresh backup). But, since
 this restriction is not good for some people, in the future, the failed 
 server
 should restart without a fresh backup, and the inconsistency would be
 problem. So, I think that the inconsistency should be removed even if
 asynchronous replication case, and we should enforce WAL rule over
 some servers.

 I don't get this argument. Why would we care what happens on the failed 
 server?

It's because, in the future, I'd like to use the data on the failed server when
making it catch up with new primary. This desire might be violated by the
inconsistency which I described.


 The additional synchronizations you suggest are neither necessary, nor
 IMHO desirable.

Not additional. It's quite analogous to synchronous_commit.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote:
  I don't get this argument. Why would we care what happens on the
 failed server?
 
 It's because, in the future, I'd like to use the data on the failed
 server when making it catch up with new primary. This desire might be
 violated by the inconsistency which I described.

I don't really understand why you would put something in there that has
no use at all. Why make every server in the world do extra
synchronisation? 

Whatever you build in the future can include this, if that is still a
required point at the time you add the new feature.

Are you thinking about switchover rather than failover? I'm sure a
graceful switchover doesn't need this.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Tue, Dec 23, 2008 at 6:28 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote:
  I don't get this argument. Why would we care what happens on the
 failed server?

 It's because, in the future, I'd like to use the data on the failed
 server when making it catch up with new primary. This desire might be
 violated by the inconsistency which I described.

 I don't really understand why you would put something in there that has
 no use at all. Why make every server in the world do extra
 synchronisation?

 Whatever you build in the future can include this, if that is still a
 required point at the time you add the new feature.

Right. But since it's difficult to change the once fixed specification,
I ruminate about it from now for future.

But, since I cannot obtain consensus from hackers including you,
I would change my course, and forbid XLogFlush (called from other
than RecordTransactionCommit) to replicate xlog synchronously
if asynchronous replication case.

BTW, here is the callers other than RecordTransactionCommit.
- CreateCheckPoint()
- EndPrepare()
- FlushBuffer()
- RecordTransactionAbortPrepared()
- RecordTransactionCommitPrepared()
- RelationTruncate()
- SlruPhysicalWritePage()
- WriteTruncateXlogRec()
- XLogAsyncCommitFlush()


 Are you thinking about switchover rather than failover? I'm sure a
 graceful switchover doesn't need this.

Yes, switchover is one of case example I care. Typically, I care
about restarting the failed server (original primary) after failover:

-
1. a dirty buffer page is chosen as victim of buffer replacement
2. flush xlog up to the buffer's LSN on only primary
3. write out the dirty buffer page
4. primary fails
(replication up to buffer's LSN is not performed)

The above case produces inconsistency between data on the
original primary (failed server) and xlogs on the original standby
(new primary after failover). Isn't this right?

5. restart the failed server and make it catch up with new primary

We cannot recycle the existing data on the failed server because
of that inconsistency. I think this restriction should be removed.
-

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Pavan Deolasee
On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao masao.fu...@gmail.com wrote:


 But, since I cannot obtain consensus from hackers including you,
 I would change my course, and forbid XLogFlush (called from other
 than RecordTransactionCommit) to replicate xlog synchronously
 if asynchronous replication case.


Since synchronous/asynchronous behavior of replication is tied to a
transaction (even if there is global default) , I don't understand why
we should not ship the xlogs to the standby when xlogs are written on
primary outside of a transaction context.  This is quite same as we do
with asynchronous_commit where we flush the xlog to disk at certain
points irrespective of the synchronization set.


 Yes, switchover is one of case example I care. Typically, I care
 about restarting the failed server (original primary) after failover:


I think this is a very important requirement because it's quite
unrealistic to expect that every time there is a failover, fresh
backup is required for the old primary to join back the replication.

 -
 1. a dirty buffer page is chosen as victim of buffer replacement
 2. flush xlog up to the buffer's LSN on only primary
 3. write out the dirty buffer page
 4. primary fails
(replication up to buffer's LSN is not performed)

 The above case produces inconsistency between data on the
 original primary (failed server) and xlogs on the original standby
 (new primary after failover). Isn't this right?


Yes, it would create inconsistency which I don't think can be
corrected without a fresh backup.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 16:54 +0530, Pavan Deolasee wrote:
 On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao masao.fu...@gmail.com wrote:
 
  But, since I cannot obtain consensus from hackers including you,
  I would change my course, and forbid XLogFlush (called from other
  than RecordTransactionCommit) to replicate xlog synchronously
  if asynchronous replication case.
 
 Since synchronous/asynchronous behavior of replication is tied to a
 transaction (even if there is global default) , I don't understand why
 we should not ship the xlogs to the standby when xlogs are written on
 primary outside of a transaction context.  This is quite same as we do
 with asynchronous_commit where we flush the xlog to disk at certain
 points irrespective of the synchronization set.

We stream constantly from primary to standby. That point is not being
debated. The issue is whether we should add additional synchronisation
points (i.e. additional times we need to wait) into the WAL stream.
Currently, I have said no because this has no purpose in the current
design: definitely not performance, not robustness, not code clarity.

Specifically, we're talking about slowing down WAL flushes required
because of dirty page replacement, amongst others. That's not something
I want to see slowed down on a server that has specifically opted for
asynchronous replication, presumably because of a slow link. The other
call points are also potential contention points.

  Yes, switchover is one of case example I care. Typically, I care
  about restarting the failed server (original primary) after failover:
 
 
 I think this is a very important requirement because it's quite
 unrealistic to expect that every time there is a failover, fresh
 backup is required for the old primary to join back the replication.

I personally don't expect that, because we have rsync.

If that is a very important requirement then the current software needs
to include all the aspects of a feature, not just some of them. Either
we include a whole feature or we leave it out. A release will need to
stand for 5+ years, so supporting extraneous features is troublesome and
wasteful.

Currently, Fujii-san has stated he is not planning to allow fast
resynchronization in 8.4, so why would we need this?

If we were to add fast resynchronisation as a feature in 8.4, then I
will be happy to have *all* required changes included. People mention it
enough that I would be happy to see the whole feature added in this
release

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Pavan Deolasee
On Tue, Dec 23, 2008 at 5:55 PM, Simon Riggs si...@2ndquadrant.com wrote:


 We stream constantly from primary to standby. That point is not being
 debated. The issue is whether we should add additional synchronisation
 points (i.e. additional times we need to wait) into the WAL stream.
 Currently, I have said no because this has no purpose in the current
 design: definitely not performance, not robustness, not code clarity.

 Specifically, we're talking about slowing down WAL flushes required
 because of dirty page replacement, amongst others. That's not something
 I want to see slowed down on a server that has specifically opted for
 asynchronous replication, presumably because of a slow link. The other
 call points are also potential contention points.

So we would still be sending WAL to standby at XLogWrite time (and I
think that's necessary). The question is whether we should wait for
standby ack at XLogFlush time, right ? Hmm. I think the argument for
that would be what Fujii-san described for maintaining consistency
between data and WAL. I agree with you that we should add additional
synchronization points only if they give us any real value in
administrating replication setup. Personally, I would like to have a
simple setup where I can initially setup primary and standby and they
continue to work in a single-failure mode without any additional
administrative overhead (such as rsync). But that's just me and I
don't know what the preferred option in the field.

BTW, I won't be too much worried about dirty buffer case because the
WAL synchronization at that point usually occurs much later than the
WAL is actually sent to the standby. I would imagine that most of the
time WAL would have made to standby by that time.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 18:36 +0530, Pavan Deolasee wrote:

 Personally, I would like to have a
 simple setup where I can initially setup primary and standby and they
 continue to work in a single-failure mode without any additional
 administrative overhead (such as rsync). But that's just me and I
 don't know what the preferred option in the field.

If you want a tripod, you need to turn up with all 3 legs. :-) 

PostgreSQL is a working product, not a framework or a function library.
We're not going to add code that has no function at all other than as
part of a larger feature, unless we add the whole feature.

I'm happy if that whole feature is added. If we do add it, it will be a
utility like pg_resync. So in admin terms it will be almost identical
to using rsync, just a specific version that minimizes effort even more
than rsync does currently. The only difference as I see it would be some
gain in performance, but we don't need to send the whole database down
the wire again in either case.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I'm happy if that whole feature is added. If we do add it, it will be a
 utility like pg_resync. So in admin terms it will be almost identical
 to using rsync, just a specific version that minimizes effort even more
 than rsync does currently. The only difference as I see it would be some
 gain in performance, but we don't need to send the whole database down
 the wire again in either case.

I think that the type of your user is different from mine. If server fails
by simple termination of process, I don't want to spend 1min for
restarting other than catching up itself. For me, getting a fresh backup
(not only copying backup data but also checkpoint by pg_start_backup)
is expensive operation.

Of course, since I'm not planning to tackle that problem in 8.4,
I would not add additional synchronization point.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Tue, Dec 23, 2008 at 11:31 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Of course, since I'm not planning to tackle that problem in 8.4,
 I would not add additional synchronization point.

Second thought:
For normal shutdown case, we probably should force synchronous
replication in CreateCheckPoint at least.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 23:31 +0900, Fujii Masao wrote:
 Hi,
 
 On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
  I'm happy if that whole feature is added. If we do add it, it will be a
  utility like pg_resync. So in admin terms it will be almost identical
  to using rsync, just a specific version that minimizes effort even more
  than rsync does currently. The only difference as I see it would be some
  gain in performance, but we don't need to send the whole database down
  the wire again in either case.
 
 I think that the type of your user is different from mine. 

Perhaps, but why do you say that? I've not blocked you from adding
anything useful to Postgres.

 If server fails
 by simple termination of process, I don't want to spend 1min for
 restarting other than catching up itself. For me, getting a fresh backup
 (not only copying backup data but also checkpoint by pg_start_backup)
 is expensive operation.

As I said: I'm happy if that whole feature is added.

You scare me that you see failover as sufficiently frequent that you are
worried that being without one of the servers for an extra 60 seconds
during a failover is a problem. And then say you're not going to add the
feature after all. I really don't understand. If its important, add the
feature, the whole feature that is. If not, don't.

My expectation is that most failovers are serious ones, that the primary
system is down and not coming back very fast. Your worries seem to come
from a scenario where the primary system is still up but Postgres
bounces/crashes, we can diagnose the cause of the crash, decide the
crashed server is safe and then wish to recommence operations on it
again as quickly as possible, where seconds count it doing so.

Are failovers going to be common? Why?

 Of course, since I'm not planning to tackle that problem in 8.4,

If you change your mind, having it in 8.4 would be good. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Mark Mielke

Simon Riggs wrote:

You scare me that you see failover as sufficiently frequent that you are
worried that being without one of the servers for an extra 60 seconds
during a failover is a problem. And then say you're not going to add the
feature after all. I really don't understand. If its important, add the
feature, the whole feature that is. If not, don't.

My expectation is that most failovers are serious ones, that the primary
system is down and not coming back very fast. Your worries seem to come
from a scenario where the primary system is still up but Postgres
bounces/crashes, we can diagnose the cause of the crash, decide the
crashed server is safe and then wish to recommence operations on it
again as quickly as possible, where seconds count it doing so.

Are failovers going to be common? Why?
  


Hi Simon:

I agree with most of your criticism to the fail over only approach - 
but don't agree that fail over frequency should really impact 
expectations for the failed system to return to service. I see soft 
fails (*not* serious) to potentially be common - somewhere on the 
network, something went down or some packet was lost, and the system 
took a few too many seconds to respond. My expectation is that the 
system can quickly  detect that the node is out of service, be removed 
from the pool, when the situation is resolved (often automatically 
outside of my control) automatically catch up and be put back into the 
pool. Having to run some other process such as rsync seems unreliable as 
we already have a mechanism for streaming the data. All that is missing 
is streaming from an earlier point in time to catch up efficiently and 
reliably.


I think I'm talking more about the complete solution though which is in 
line with what you are saying? :-)


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 12:38 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Perhaps, but why do you say that?

Since you often pointed out that getting backup is not problem because
of incremental backup (e.g. rsync), I just thought so.

 I've not blocked you from adding
 anything useful to Postgres.

Yes, I see.

 You scare me that you see failover as sufficiently frequent that you are
 worried that being without one of the servers for an extra 60 seconds
 during a failover is a problem. And then say you're not going to add the
 feature after all. I really don't understand. If its important, add the
 feature, the whole feature that is. If not, don't.

Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
rethink the question? Why does the failed server always need a fresh
backup? Though we discussed it previously and concluded that it should
be done next time.
http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

 My expectation is that most failovers are serious ones, that the primary
 system is down and not coming back very fast. Your worries seem to come
 from a scenario where the primary system is still up but Postgres
 bounces/crashes, we can diagnose the cause of the crash, decide the
 crashed server is safe and then wish to recommence operations on it
 again as quickly as possible, where seconds count it doing so.

 Are failovers going to be common? Why?

As you say, *all* failovers are not serious ones. I think that a user
would choose most convenient restarting method according to his
or her situation (come back immediately? need careful diagnosis?).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote:

 Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
 rethink the question? Why does the failed server always need a fresh
 backup? Though we discussed it previously and concluded that it should
 be done next time.
 http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

We might ask why pg_start_backup() needs to perform checkpoint though,
since you have remarked that is a problem also.

The answer is that it doesn't really need to, we just need to be certain
that archiving has been running since whenever we choose as the start
time. So we could easily just use the last normal checkpoint time, as
long as we had some way of tracking the archiving.

ISTM we can solve the checkpoint problem more easily and it would
potentially save much more time than tuning rsync for Postgres, which
is what the other idea amounted to. So I do see a solution that is both
better and more quickly achievable for 8.4.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 2:37 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote:

 Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
 rethink the question? Why does the failed server always need a fresh
 backup? Though we discussed it previously and concluded that it should
 be done next time.
 http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

 We might ask why pg_start_backup() needs to perform checkpoint though,
 since you have remarked that is a problem also.

 The answer is that it doesn't really need to, we just need to be certain
 that archiving has been running since whenever we choose as the start
 time. So we could easily just use the last normal checkpoint time, as
 long as we had some way of tracking the archiving.

 ISTM we can solve the checkpoint problem more easily and it would
 potentially save much more time than tuning rsync for Postgres, which
 is what the other idea amounted to. So I do see a solution that is both
 better and more quickly achievable for 8.4.

Sounds good. I agree that pg_start_backup basically doesn't need
checkpoint. But, for full_page_write == off, we probably cannot get
rid of it. Even if full_page_write == on, since we cannot make out
whether all indispensable full pages were written after last checkpoint,
pg_start_backup must do checkpoint with forcePageWrite = on.

Problem is that online backup itself is unsafe. Even if there is no
disk failure (i.e. normal case), we can easily produce a partial write
in online backup. So, we always need full pages when recovering
online backup, then pg_start_backup always needs checkpoint
with forcePageWrite = on.

I think that we probably have to track the history of full_page_write,
in order to get rid of checkpoint from pg_start_backup.

On the other hand, the data after crash other than media crash
is safe. Currently, we can recover it without full page write
as simple crash recovery case. I think that we can use it also for
archive recovery, because there isn't really any distinction between
both. I've not found the corner case yet. Do you have?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao
Hi,

On Mon, Dec 22, 2008 at 1:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Not so simple.

 At least the primary has to additionally maintain the byte position the 
 standby
 has already fsynced. The main difference from the current patch is whether
 the standby fsyncs the logfile when it fills even if you don't choose 
 #4(fsync).
 In order to prevent from having to go back and re-open prior logfiles when an
 fsync request comes along later, we would need to ignore the sync mode and
 make the standby fsync the logfile when it fills. This would degrade the
 performance periodically. Is this acceptable?

 I think there are four choices. Which do you prefer?

 1) Accept the above change.
 2) Go back and re-open prior logfiles when a fsync request comes along.
 3) Stop the sync control by the primary and leave it to the standby.
 4) Add new option to specify whether to permit optimistic fsync, this option
makes the standby fsync only the current logfile when a fsync request
comes along (don't go back and re-open prior logfiles).

 2) would cause another performance degradation. 4) would furthermore
 confuse users about setting a sync mode. So, I prefer 3) though I'm sorry
 for digging up the discussion about transaction control. Please feel free
 to comment!

5) Only allow optimistic fsync

I'm going to adopt 5) for next patch at least for a while.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-21 Thread Markus Wanner
Hi,

Simon Riggs wrote:
 The second way can be done by taking a snapshot on the primary, with an
 associated LSN, then using that snapshot on the standby. That is
 somewhat complex, but possible. I see the requirement for getting the
 same answer on multiple nodes as a further extension of transaction
 isolation mode and think that not all people will want this, so we
 should allow that as an option.

I've been thinking a bit about this pretty interesting idea. It's
certainly of interest for Postgres-R as well.

AFAIK a function could simply wait, until the node which is being
queried reaches a given point in time of application of transactions (an
LSN, in the Sync-Rep world). Calling such a waiting function just after
BEGIN would ensure to see (at least) the given snapshot. If that
snapshot has already been reached or passed, the function does nothing.

What I like is, that it's optimistic in that the wait is only enforced
when needed by the reader. However, unlike enforcing the wait before
COMMIT, it requires changing the application to cope with this behavior
of the distributed database system. And knowing when to require which
snapshot sounds rather difficult from the point of view of the
application developer.

Also note, that it might be the issuer of the transaction who wants to
ensure his transaction got propagated to the remote nodes.

 I'm not going to worry about this at the moment. Hot standby will be
 useful without this and so I regard this as a secondary objective. Rome
 wasn't built in a single release, or something like that.

Sounds like a decent plan. Good luck.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-21 Thread Fujii Masao
Hi,

On Wed, Dec 17, 2008 at 12:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 No, we've been through that loop already a few months back:
 Transaction-controlled robustness.

 It should be up to the client on the primary to decide how much waiting
 they would like to perform in order to provide a guarantee. A change of
 setting on the standby should not be allowed to alter the performance or
 durability on the primary.

 OK. I will extend synchronous_replication, make walsender send XLOG
 with synchronization mode flag and make walreceiver perform according
 to the flag.

Not so simple.

At least the primary has to additionally maintain the byte position the standby
has already fsynced. The main difference from the current patch is whether
the standby fsyncs the logfile when it fills even if you don't choose #4(fsync).
In order to prevent from having to go back and re-open prior logfiles when an
fsync request comes along later, we would need to ignore the sync mode and
make the standby fsync the logfile when it fills. This would degrade the
performance periodically. Is this acceptable?

I think there are four choices. Which do you prefer?

1) Accept the above change.
2) Go back and re-open prior logfiles when a fsync request comes along.
3) Stop the sync control by the primary and leave it to the standby.
4) Add new option to specify whether to permit optimistic fsync, this option
makes the standby fsync only the current logfile when a fsync request
comes along (don't go back and re-open prior logfiles).

2) would cause another performance degradation. 4) would furthermore
confuse users about setting a sync mode. So, I prefer 3) though I'm sorry
for digging up the discussion about transaction control. Please feel free
to comment!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner
Hi,

Mark Mielke wrote:
 Where does the expectation come from?

I find the seat reservation, bank account or stock trading examples
pretty obvious WRT user expectations.

Nonetheless, I've compiled some hints from the documentation and sources:

Since in Read Committed mode each new command starts with a new
snapshot that includes all transactions committed up to that instant [1].

This [SERIALIZABLE ISOLATION] level emulates serial transaction
execution, as if transactions had been executed one after another,
serially, rather than concurrently. [1].  (IMO this implies, that a
transaction sees changes from all preceding transactions).

All changes made by the transaction become visible to others and are
guaranteed to be durable if a crash occurs. [2]. (Agreed, it's not
overly clear here, when exactly the changes become visible. OTOH,
there's no warning, that another session doesn't immediately see
committed transactions. Not sure where you got that from).

 I don't recall ever reading it in
 the documentation, and unless the session processes are contending over
 the integers (using some sort of synchronization primitive) in memory
 that represent the latest visible commit on every single select, I'm
 wondering how it is accomplished?

See the transaction system's README [3]. It documents the process of
snapshot taking and transaction isolation pretty well. Around line 226
it says: What we actually enforce is strict serialization of commits
and rollbacks with snapshot-taking. (So the outcome of your experiment
is no surprise at all).

And a bit later: This rule is stronger than necessary for consistency,
but is relatively simple to enforce, and it assists with some other
issues as explained below.. While this implies, that an optimization is
theoretically possible, I very much doubt it would be worth it (for a
single node system).

In a distributed system, things are a bit different. Network latency is
an order of magnitude higher than memory latency (for IPC). So a similar
optimization is very well worth it. However, the application (or the
load balancer or both) need to know about this potential lag between
nodes. And as you've outlined elsewhere, a limit for how much a single
node may lag behind needs to be established.

(As a side note: for a multi-master system like Postgres-R, it's
beneficial to keep the lag time as low as possible, because the larger
the lag, the higher the probability for a conflict between two
transactions on different nodes.)

Regards

Markus Wanner


[1]: Pg 8.3 Docu: Concurrency Control:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html

[2]: Pg 8.3 Docu: COMMIT command:
http://www.postgresql.org/docs/8.3/static/sql-commit.html

[3]: README of transam (src/backend/access/transam/README):
https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Mark Mielke

Good answers, Markus. Thanks.

I've bought the thinking of several here that the user should have some 
control over what they expect (and what optimizations they are willing 
to accept as a good choice), but that commit should still be able to 
have a capped time limit.


I can think of many of my own applications where I would choose one mode 
vs another mode, even within the same application, depending on the 
operation itself. The most important requirement is that transactions 
are durable. It becomes convenient, though, to provide additional 
guarantees for some operation sequences.


I still see the requirement for seat reservation, bank account, or stock 
trading, as synchronizing using read-write locks before starting the 
select, rather than enforcing latest on every select.


For my own bank, when I do an online transaction, operations don't 
always immediately appear in my list of transactions. They appear to 
sometimes be batched, sometimes in near real time, and sometimes as part 
of some sort of day end processing.


For seat reservation, the time the seat layout is shown on the screen is 
not usually locked during a transaction. Between the time the travel 
agent brings up the seats on the plane, and the time they select the 
seat, the seat could be taken. What's important is that the reservation 
is durable, and that conflicts are not introduced. The commit must fail 
if another person has chosen the seat already already. The commit does 
not need to wait until the reservation is pushed out to all systems 
before completing. The same is true of stock trading.


However, it can be very convenient for commits to be immediately visible 
after the commit completes. This allows for lazier models, such as a web 
site that reloads the view on the reservations or recent trades and 
expects to see recent commits no matter which server it accesses, rather 
than taking into account that the commit succeeded when presenting the 
next view.


If I look at sites like Google - they take the opposite extreme. I can 
post a message, and it remembers that I posted the message and makes it 
immediately visible, however, I might not see other new messages in a 
thread until a minute or more later.


So it looks like there is value to both ends of the spectrum, and while 
I feel the most value would be in providing a very fast system that 
scales near linear to the number of nodes in the system, even at the 
expense of immediately visible transactions from all servers, I can 
accept that sometimes the expectations are stricter and would appreciate 
seeing an option to let me choose based upon my requirements.


Cheers,
mark


Markus Wanner wrote:

Hi,

Mark Mielke wrote:
  

Where does the expectation come from?



I find the seat reservation, bank account or stock trading examples
pretty obvious WRT user expectations.

Nonetheless, I've compiled some hints from the documentation and sources:

Since in Read Committed mode each new command starts with a new
snapshot that includes all transactions committed up to that instant [1].

This [SERIALIZABLE ISOLATION] level emulates serial transaction
execution, as if transactions had been executed one after another,
serially, rather than concurrently. [1].  (IMO this implies, that a
transaction sees changes from all preceding transactions).

All changes made by the transaction become visible to others and are
guaranteed to be durable if a crash occurs. [2]. (Agreed, it's not
overly clear here, when exactly the changes become visible. OTOH,
there's no warning, that another session doesn't immediately see
committed transactions. Not sure where you got that from).

  

I don't recall ever reading it in
the documentation, and unless the session processes are contending over
the integers (using some sort of synchronization primitive) in memory
that represent the latest visible commit on every single select, I'm
wondering how it is accomplished?



See the transaction system's README [3]. It documents the process of
snapshot taking and transaction isolation pretty well. Around line 226
it says: What we actually enforce is strict serialization of commits
and rollbacks with snapshot-taking. (So the outcome of your experiment
is no surprise at all).

And a bit later: This rule is stronger than necessary for consistency,
but is relatively simple to enforce, and it assists with some other
issues as explained below.. While this implies, that an optimization is
theoretically possible, I very much doubt it would be worth it (for a
single node system).

In a distributed system, things are a bit different. Network latency is
an order of magnitude higher than memory latency (for IPC). So a similar
optimization is very well worth it. However, the application (or the
load balancer or both) need to know about this potential lag between
nodes. And as you've outlined elsewhere, a limit for how much a single
node may lag behind needs to be established.

(As a side note: for a 

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner
Hi,

Mark Mielke wrote:
 Robert Haas wrote:
 On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 We won't call it anything, because we never will or can implement that.
 See the theory of relativity: the notion of exactly simultaneous events

 OK, fine.  I'll be more precise.  I think we need to reserve the term
 synchronous replication for a system where transactions that begin
 on the standby after the transactions has committed on the master see
 the effects of the committed transaction.

I agree with Robert here. As far as I know this is the common
understanding of synchronous replication. Everything less - including
Postgres-R - is considered to be asynchronous.

 I'd like to see proof of some sort that PostgreSQL guarantees that the
 instant a 'commit' returns, any transactions already open with the
 appropriate transaction isolation level, or any new sessions *will* see
 the results of the commit.

Given within this thread, here [1].

 Two phase commit doesn't imply that the transaction is guaranteed to be
 immediately visible.

Just for the record: that's plain wrong. As with any other transaction,
a COMMIT of a prepared transaction guarantees visibility from all
subsequent snapshots (at least for Postgres and other serious RDBSen).

Systems based on 2PC are the typical synchronous replication solution:
works, resistant to failures, consistent across nodes (WRT visibility),
but unusably slow. This is what people have in mind and expect when they
hear synchronous replication for databases. (And which is why I'm
thinking it's better for an optimized solution not to call itself
synchronous).

 Unless transactions are
 locked from starting until they are able to prove that they have the
 latest commit

See the cited README. It already happens for (single node) Postgres
systems, because the action of snapshot taking and committing are
serialized.

 (a feat which I'm going to theorize as impossible -
 because the moment you wait for a commit, and you begin again, you
 really have no guarantee that another commit has not occurred in the
 mean time)

This problem is solved by locking.

Regards

Markus Wanner


[1]: Hints to docs and source, that COMMIT actually ensures subsequent
snapshots include changes of the committed transaction:
http://archives.postgresql.org/message-id/494c.2060...@bluegap.ch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner
Hi,

Josh Berkus wrote:
 Peter Eisentraut wrote:
 It's the color of the bikeshed ...

Agreed. It's why I've decided to support various modes for Postgres-R.
I'm glad to see that the current Sync Rep approach does the same.

 Hmmm.  I thought this was pretty clear.  There's three levels of synch
 which are useful features:
 
 1) synchronus standby which is really asynchronous, but only has a gap
 of  100ms.

A synchronous standby which is really asynchronous? That's exactly the
naming challenge I've been pointing to.

Commonly used terms are: virtually synchronous, approximately
synchronous, near-real-time replication or eager replication, but
for most users, this is not synchronous (enough).

(BTW: there's no such  100 ms guarantee. It may be typically below
100 ms, or even below 10 ms on average. But replication is not about the
typical or average case. It's much more about failures and uncommon
cases. The guarantee you can get in such a system (by declaring a node
as dead) is much more likely to be within the range of several seconds
and more, be it network, disk or whatever other failure-timeout that
applies here.)

 2) Synchronous standby which guarentees that all committed transactions
 are on the failover node and that no data will be lost for failover, but
 the failover node is still in standby mode.

What's the difference to 1) here? I'm not following.

 3) Synchronous replication where the standby node has identical
 transactions to the master node, and is queryable read-only.

So, a synchronous standby is different from synchronous replication in
that it's asynchronous?

Sorry for bugging with naming, but I think it is important for an
understanding during development.

 Any of these levels would be useful and allow a certain number of our
 users to deploy PostgreSQL in an environment where it wasn't used
 before.

I absolutely agree to that statement.

However, please do not confuse future users (and today's hackers), but
instead use existing terms consistently and clearly. Something that lags
behind, potentially by several seconds (in case of failure) is commonly
considered asynchronous, no matter how close to immediate it is on
average.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner
Hi,

Mark Mielke wrote:
 Good answers, Markus. Thanks.

You are welcome.

 So it looks like there is value to both ends of the spectrum, and while
 I feel the most value would be in providing a very fast system that
 scales near linear to the number of nodes in the system, even at the
 expense of immediately visible transactions from all servers, I can
 accept that sometimes the expectations are stricter and would appreciate
 seeing an option to let me choose based upon my requirements.

I absolutely agree to that. The original Postgres-R algorithm covers the
eager (or virtually synchronous) part. I'm planning to extend it with a
(fully) synchronous mode and let the user choose per transaction.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Fujii Masao
Hi,

On Fri, Dec 19, 2008 at 5:50 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote:

  Yes, please check the call points for ForceSyncCommit.
 
  Do I think every xlog flush should be synchronous, no, I don't.
 That's why we have a user settable parameter for it.

 Umm.. I focus attention on XLogFlush() called except
 RecordTransactionCommit().
 For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
 XLogFlush() might
 flush XLOG synchronously even if asynchronous commit case.

 XLogFlush() flushes because of an interlock between a dirty buffer write
 and an outstanding WAL write. Dirty buffer writes are not replicated, so
 there is no need to have a similar interlock on WAL streaming.

 So making those call points synchronous is possible, but neither
 necessary or IMHO desirable.

Yes in upcoming 8.4, but probably no in the future.

What if the primary fails after writing the dirty data buffer before sending
the corresponding logs? This would make data on the primary and logs
on the standby inconsistent. In 8.4, such inconsistency might not matter
because we don't use the data on the failed primary for recovery (when
restarting the failed server, we always need a fresh backup). But, since
this restriction is not good for some people, in the future, the failed server
should restart without a fresh backup, and the inconsistency would be
problem. So, I think that the inconsistency should be removed even if
asynchronous replication case, and we should enforce WAL rule over
some servers.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-19 Thread Simon Riggs

On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote:

  Yes, please check the call points for ForceSyncCommit.
 
  Do I think every xlog flush should be synchronous, no, I don't.
 That's why we have a user settable parameter for it.
 
 Umm.. I focus attention on XLogFlush() called except
 RecordTransactionCommit().
 For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
 XLogFlush() might
 flush XLOG synchronously even if asynchronous commit case.

XLogFlush() flushes because of an interlock between a dirty buffer write
and an outstanding WAL write. Dirty buffer writes are not replicated, so
there is no need to have a similar interlock on WAL streaming.

So making those call points synchronous is possible, but neither
necessary or IMHO desirable.

On a related but different point: We don't need an interlock between
dirty buffers and WAL during recovery because the WAL has already been
written.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-19 Thread Heikki Linnakangas

Simon Riggs wrote:

On a related but different point: We don't need an interlock between
dirty buffers and WAL during recovery because the WAL has already been
written.


Assuming the WAL has also been fsync'd.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-19 Thread Simon Riggs

On Fri, 2008-12-19 at 11:04 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On a related but different point: We don't need an interlock between
  dirty buffers and WAL during recovery because the WAL has already been
  written.
 
 Assuming the WAL has also been fsync'd.

True, so this will need to change for 8.4 also

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-18 Thread Simon Riggs

On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote:

  Agreed, I also think that hard code is better. But I'm nervous that off
  keeps us waiting for replication in cases other than DDL, e.g. flush
  buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
  is quite similar to synchronous_commit = off. If we would hard code #4,
  the performance might degrade although it's asynchronous replication.
  So, I'd like to hard code #3. What is your opinion?
 
  We don't do that when we flush buffer, truncate clog or checkpoint, not
  sure why you mention those.
 
  We ForceSyncCommit when we
  * VACUUM FULL
  * CREATE/DROP DATABASE or USER
  * Create/Drop Tablespace
 
  I don't see a problem in forcing an fsync for those. I will sleep safer
  knowing those guys are on disk even in async mode.
 
 If my understanding is correct, XLOG flush is forced up to buffer's LSN
 when flushing buffer even if asynchronous commit case. Am I missing
 something?

Yes, please check the call points for ForceSyncCommit.

Do I think every xlog flush should be synchronous, no, I don't. That's
why we have a user settable parameter for it.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-18 Thread Fujii Masao
Hi,

On Thu, Dec 18, 2008 at 6:35 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote:

  Agreed, I also think that hard code is better. But I'm nervous that off
  keeps us waiting for replication in cases other than DDL, e.g. flush
  buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
  is quite similar to synchronous_commit = off. If we would hard code #4,
  the performance might degrade although it's asynchronous replication.
  So, I'd like to hard code #3. What is your opinion?
 
  We don't do that when we flush buffer, truncate clog or checkpoint, not
  sure why you mention those.
 
  We ForceSyncCommit when we
  * VACUUM FULL
  * CREATE/DROP DATABASE or USER
  * Create/Drop Tablespace
 
  I don't see a problem in forcing an fsync for those. I will sleep safer
  knowing those guys are on disk even in async mode.

 If my understanding is correct, XLOG flush is forced up to buffer's LSN
 when flushing buffer even if asynchronous commit case. Am I missing
 something?

 Yes, please check the call points for ForceSyncCommit.

 Do I think every xlog flush should be synchronous, no, I don't. That's
 why we have a user settable parameter for it.

Umm.. I focus attention on XLogFlush() called except RecordTransactionCommit().
For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
XLogFlush() might
flush XLOG synchronously even if asynchronous commit case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Simon Riggs

On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:

 OK. I will extend synchronous_replication, make walsender send XLOG
 with synchronization mode flag and make walreceiver perform according
 to the flag.

Sounds good.

  My perspective is that synchronous_replication specifies how long to
  wait. Current settings are off (don't wait) or on (meaning wait
  until point #3). So I think we should change this to a list of options
  to allow people to more carefully select how much waiting is required.
 
 In the latest patch, off keeps us waiting for replication in some
 cases, e.g. forceSyncCommit = true. This is analogous to the way
 synchronous_commit works. When off keeps us waiting for
 replication, which option (#1-#6) should we choose? Should it be
 user-configurable (though the parameter values are doubled)?
 hardcode #3? off always should not keep us waiting for
 replication?

I would hard code #4, i.e. make it fsync, so that DDL changes are
regarded as high value transactions.

A parameter sounds like overkill. We'd need to explain what
forceSyncCommit does to users then, which is easier to avoid.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Fujii Masao
Hi,

Thanks for the helpful comments!

On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:

 OK. I will extend synchronous_replication, make walsender send XLOG
 with synchronization mode flag and make walreceiver perform according
 to the flag.

 Sounds good.

  My perspective is that synchronous_replication specifies how long to
  wait. Current settings are off (don't wait) or on (meaning wait
  until point #3). So I think we should change this to a list of options
  to allow people to more carefully select how much waiting is required.

 In the latest patch, off keeps us waiting for replication in some
 cases, e.g. forceSyncCommit = true. This is analogous to the way
 synchronous_commit works. When off keeps us waiting for
 replication, which option (#1-#6) should we choose? Should it be
 user-configurable (though the parameter values are doubled)?
 hardcode #3? off always should not keep us waiting for
 replication?

 I would hard code #4, i.e. make it fsync, so that DDL changes are
 regarded as high value transactions.

 A parameter sounds like overkill. We'd need to explain what
 forceSyncCommit does to users then, which is easier to avoid.

Agreed, I also think that hard code is better. But I'm nervous that off
keeps us waiting for replication in cases other than DDL, e.g. flush
buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
is quite similar to synchronous_commit = off. If we would hard code #4,
the performance might degrade although it's asynchronous replication.
So, I'd like to hard code #3. What is your opinion?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Simon Riggs

On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote:
 Hi,
 
 Thanks for the helpful comments!
 
 On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
  On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
 
  OK. I will extend synchronous_replication, make walsender send XLOG
  with synchronization mode flag and make walreceiver perform according
  to the flag.
 
  Sounds good.
 
   My perspective is that synchronous_replication specifies how long to
   wait. Current settings are off (don't wait) or on (meaning wait
   until point #3). So I think we should change this to a list of options
   to allow people to more carefully select how much waiting is required.
 
  In the latest patch, off keeps us waiting for replication in some
  cases, e.g. forceSyncCommit = true. This is analogous to the way
  synchronous_commit works. When off keeps us waiting for
  replication, which option (#1-#6) should we choose? Should it be
  user-configurable (though the parameter values are doubled)?
  hardcode #3? off always should not keep us waiting for
  replication?
 
  I would hard code #4, i.e. make it fsync, so that DDL changes are
  regarded as high value transactions.
 
  A parameter sounds like overkill. We'd need to explain what
  forceSyncCommit does to users then, which is easier to avoid.
 
 Agreed, I also think that hard code is better. But I'm nervous that off
 keeps us waiting for replication in cases other than DDL, e.g. flush
 buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
 is quite similar to synchronous_commit = off. If we would hard code #4,
 the performance might degrade although it's asynchronous replication.
 So, I'd like to hard code #3. What is your opinion?

We don't do that when we flush buffer, truncate clog or checkpoint, not
sure why you mention those.

We ForceSyncCommit when we
* VACUUM FULL
* CREATE/DROP DATABASE or USER
* Create/Drop Tablespace

I don't see a problem in forcing an fsync for those. I will sleep safer
knowing those guys are on disk even in async mode.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Fujii Masao
Hi,

On Thu, Dec 18, 2008 at 11:19 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote:
 Hi,

 Thanks for the helpful comments!

 On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
  On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
 
  OK. I will extend synchronous_replication, make walsender send XLOG
  with synchronization mode flag and make walreceiver perform according
  to the flag.
 
  Sounds good.
 
   My perspective is that synchronous_replication specifies how long to
   wait. Current settings are off (don't wait) or on (meaning wait
   until point #3). So I think we should change this to a list of options
   to allow people to more carefully select how much waiting is required.
 
  In the latest patch, off keeps us waiting for replication in some
  cases, e.g. forceSyncCommit = true. This is analogous to the way
  synchronous_commit works. When off keeps us waiting for
  replication, which option (#1-#6) should we choose? Should it be
  user-configurable (though the parameter values are doubled)?
  hardcode #3? off always should not keep us waiting for
  replication?
 
  I would hard code #4, i.e. make it fsync, so that DDL changes are
  regarded as high value transactions.
 
  A parameter sounds like overkill. We'd need to explain what
  forceSyncCommit does to users then, which is easier to avoid.

 Agreed, I also think that hard code is better. But I'm nervous that off
 keeps us waiting for replication in cases other than DDL, e.g. flush
 buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
 is quite similar to synchronous_commit = off. If we would hard code #4,
 the performance might degrade although it's asynchronous replication.
 So, I'd like to hard code #3. What is your opinion?

 We don't do that when we flush buffer, truncate clog or checkpoint, not
 sure why you mention those.

 We ForceSyncCommit when we
 * VACUUM FULL
 * CREATE/DROP DATABASE or USER
 * Create/Drop Tablespace

 I don't see a problem in forcing an fsync for those. I will sleep safer
 knowing those guys are on disk even in async mode.

If my understanding is correct, XLOG flush is forced up to buffer's LSN
when flushing buffer even if asynchronous commit case. Am I missing
something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-16 Thread Simon Riggs

On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote:

  So from my previous list
 
  1. We sent the message to standby (A)
  2. We received the message on standby
  3. We wrote the WAL to the WAL file (B)
  4. We fsync'd the WAL file (C)
  5. We CRC checked the WAL commit record
  6. We applied the WAL commit record
 
  Please could you also add an option #4, i.e. add the *option* to fsync
  the WAL to disk at commit time also. That requires us to add a third
  option to synchronous_replication parameter.
 
 The above option should be configured on the primary? or standby?
 The primary is suitable to vary it from transaction to transaction. On
 the other hand, it should be configured on the standby in order to
 choose it for every standby (in the future).
 
 I prefer the latter, and thought that it should be added into recovery.conf.
 I mean, synchronous_replication identifies only whether commit waits for
 replication (if the name is confusing, I would rename it). The above
 options (#1-#6) are chosen in recovery.conf. What is your opion?

No, we've been through that loop already a few months back:
Transaction-controlled robustness.

It should be up to the client on the primary to decide how much waiting
they would like to perform in order to provide a guarantee. A change of
setting on the standby should not be allowed to alter the performance or
durability on the primary.

My perspective is that synchronous_replication specifies how long to
wait. Current settings are off (don't wait) or on (meaning wait
until point #3). So I think we should change this to a list of options
to allow people to more carefully select how much waiting is required.

This feature is then analogous to the way synchronous_commit works. It
also provides a level of application control not seen in any other RDBMS
in the industry, which makes it very suitable for large and important
applications that need a fine mix of robustness and performance.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-16 Thread Fujii Masao
Hi,

On Tue, Dec 16, 2008 at 7:21 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote:

  So from my previous list
 
  1. We sent the message to standby (A)
  2. We received the message on standby
  3. We wrote the WAL to the WAL file (B)
  4. We fsync'd the WAL file (C)
  5. We CRC checked the WAL commit record
  6. We applied the WAL commit record
 
  Please could you also add an option #4, i.e. add the *option* to fsync
  the WAL to disk at commit time also. That requires us to add a third
  option to synchronous_replication parameter.

 The above option should be configured on the primary? or standby?
 The primary is suitable to vary it from transaction to transaction. On
 the other hand, it should be configured on the standby in order to
 choose it for every standby (in the future).

 I prefer the latter, and thought that it should be added into recovery.conf.
 I mean, synchronous_replication identifies only whether commit waits for
 replication (if the name is confusing, I would rename it). The above
 options (#1-#6) are chosen in recovery.conf. What is your opion?

 No, we've been through that loop already a few months back:
 Transaction-controlled robustness.

 It should be up to the client on the primary to decide how much waiting
 they would like to perform in order to provide a guarantee. A change of
 setting on the standby should not be allowed to alter the performance or
 durability on the primary.

OK. I will extend synchronous_replication, make walsender send XLOG
with synchronization mode flag and make walreceiver perform according
to the flag.


 My perspective is that synchronous_replication specifies how long to
 wait. Current settings are off (don't wait) or on (meaning wait
 until point #3). So I think we should change this to a list of options
 to allow people to more carefully select how much waiting is required.

In the latest patch, off keeps us waiting for replication in some
cases, e.g. forceSyncCommit = true. This is analogous to the way
synchronous_commit works. When off keeps us waiting for
replication, which option (#1-#6) should we choose? Should it be
user-configurable (though the parameter values are doubled)?
hardcode #3? off always should not keep us waiting for
replication?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Heikki Linnakangas

Mark Mielke wrote:
Where does the expectation come from? I don't recall ever reading it in 
the documentation, and unless the session processes are contending over 
the integers (using some sort of synchronization primitive) in memory 
that represent the latest visible commit on every single select, I'm 
wondering how it is accomplished? 


The integers you're imagining are the ProcArray. Every backend has an 
entry there, and among other things it contains the current XID the 
backend is running. When a backend takes a new snapshot (on every single 
select in read committed mode), it locks the ProcArray, scans all the 
entries and collects all the XIDs listed there in the snapshot. Those 
are the set of transactions that were running when the snapshot was 
taken, and is used in the visibility checks.


 If they are contending over these
 integers, doesn't that represent a scaling limitation, in the sense that
 on a 32-core machine, they're going to be fighting with each other to
 get the latest version of these shared integers into the CPU for
 processing? Maybe it's such a small penalty that we don't care? :-)

The ProcArrayLock is indeed quite busy on systems with a lot of CPUs. 
It's held for such short times that it's not a problem usually, but it 
can become a bottleneck with a machine like that with all backends 
running small transactions.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Aidan Van Dyk
* Robert Haas robertmh...@gmail.com [081215 07:32]:
  In fact, waiting for reply from standby server before acknowledging a commit
  to the client is a bit pointless otherwise. It puts you in a strange
  situation, where you're waiting for the commits in normal operation, but if
  there's a network glitch or the standby goes down, you're willing to go
  ahead without it. You get a high guarantee that your data is up-to-date in
  the standby, except when it isn't. Which isn't much of a guarantee.
 
 It protects you against a catastrophic loss of the primary, which is a
 non-trivial consideration.  At the risk of being ghoulish, imagine
 that you are a large financial company headquartered in the world
 trade center.

This was exacty my original point - I want the transaction durably on
the slave before the commit is acknowledged (to build as much local
redunancy as I can), but I certatily *don't* want to loose the ability
to use WAL archiving, because I ship my WAL off-site too...

The ability to have an extra local copy is good.  But I'm certainly not
going to want to give up my off-site backup/WAL for it...

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Robert Haas
 So you'd want all commits to wait until the transaction is safely replicated
 in the standby. But if there's a network glitch, or the standby is
 restarted, you're happy to reply to the client that it's committed if it's
 only safely committed in the primary. Essentially, you wait for the reply as
 long the standby responds within X seconds, but if it takes more then Y
 seconds, you don't wait. I know that people do that, but it seems
 counterintuitive to me. In that case, when the primary acks the transaction
 as committed, you only know that it's safely committed in the primary; it
 doesn't give any hard guarantee about the state in the standby.

I understand you're point, but I think there's still a use case.   The
idea is that declaring the secondary dead is a rare event, and there's
some mechanism by which you're enabled to page your network staff, and
they hightail it into the office to fix the problem.  It might not be
the way that you want to run your system, but I don't think it's
unreasonable for someone else to want it.

 But when you consider the possibility to use the standby for queries, the
 synchronous mode makes sense too.
 I'm not opposed to providing all the options, but the synchronous mode where
 we can guarantee that if you query the standby, you will see the effects of
 all transactions committed in the primary, makes the synchronous mode much
 more interesting. If you don't need that property, you're most likely more
 happy with asynchronous mode anyway.

I agree that asynchronous mode will be the right solution for a very
large subset of our users.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Sun, 2008-12-14 at 21:41 -0500, Robert Haas wrote:
  If this is right, #2, #3, #4, and #6 feel similar except
  that they're protecting against failures of different (but
  still all incomplete) subsets of the hardware on the slave, right?
 
 Right.  Actually, the biggest difference with #6 has nothing to do
 with protecting against failures.  It has rather to do with the ease
 of writing applications in the context of hot standby.  You can close
 your connection, open a connection to a different server, and know
 that your transactions will be reflected there.  On the other hand,
 I'd be surprised if it didn't come with a substantial performance
 penalty, so it may not be too practical in real life even if it sounds
 good on paper.
 
 #1 , #3, and #5 don't feel that useful to me. 

Yes, looks that way for me also.

Good analysis Ron. I agree with Robert that #6 is there for other
reasons.

#2 corresponds to DRBD algorithm B

#4 corresponds to DRBD algorithm C

Fujii-san, please can we incorporate those two options, rather than just
one choice synchronous_replication = on. They look like two commonly
requested options.

#6 is an additional synchronization step in Hot Standby. I would say
that people won't want that when they see how it performs (they probably
won't want #4 either for that same reason, but that is for robustness).

Also, I would point out that the class of synch_rep is selected by the
user on the primary and can vary from transaction to transaction. That
is a very good thing, as far as I am concerned. We would need to enforce
#6 for all transactions (if we implemented synchronisation in this way).

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Robert Haas
 In fact, waiting for reply from standby server before acknowledging a commit
 to the client is a bit pointless otherwise. It puts you in a strange
 situation, where you're waiting for the commits in normal operation, but if
 there's a network glitch or the standby goes down, you're willing to go
 ahead without it. You get a high guarantee that your data is up-to-date in
 the standby, except when it isn't. Which isn't much of a guarantee.

It protects you against a catastrophic loss of the primary, which is a
non-trivial consideration.  At the risk of being ghoulish, imagine
that you are a large financial company headquartered in the world
trade center.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

Fujii-san,

Just repeating this in case you lost this comment:

On Mon, 2008-12-15 at 09:40 +, Simon Riggs wrote:

 Fujii-san, please can we incorporate those two options, rather than just
 one choice synchronous_replication = on. They look like two commonly
 requested options.

I see the comment in line 230+ of walreceiver.c, so understand that you
have implemented option #3 from the following list.

So from my previous list

1. We sent the message to standby (A)
2. We received the message on standby
3. We wrote the WAL to the WAL file (B)
4. We fsync'd the WAL file (C)
5. We CRC checked the WAL commit record
6. We applied the WAL commit record

Please could you also add an option #4, i.e. add the *option* to fsync
the WAL to disk at commit time also. That requires us to add a third
option to synchronous_replication parameter.

That then means we will have robustness options that map directly to
DRBD algorithms A, B and C (shown in brackets in the above list). I
believe these map also to Data Guard options Maximum Performance and
Maximum Availability.

AFAICS if we implement the additional items I've requested over the last
few days, then the architecture is now at a good point for 8.4 and we
can begin to look at low level implementation details. Or put another
way, I'm not expecting to come up with more architecture changes.

 #6 is an additional synchronization step in Hot Standby. I would say
 that people won't want that when they see how it performs (they probably
 won't want #4 either for that same reason, but that is for robustness).

We can jointly add option #6 once we have both sync rep and hot standby
committed, or at a late stage of hot standby development. There's not
much point looking at it before then.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Sun, 2008-12-14 at 12:57 -0500, Mark Mielke wrote:

 I'm curious about your suggestion to direct queries that need the
 latest 
 snapshot to the 'primary'. I might have misunderstood it - but it
 seems 
 that the expectation from some is that *all* sessions see the latest 
 snapshot, so would this not imply that all sessions would be redirect
 to 
 the 'primary'? I don't think it is reasonable myself, but I might be 
 misunderstanding something...

I said a snapshot taken on the primary, but the query would run on the
standby.

Synchronising primary and standby so that they are identical from the
perspective of a query requires some synchronisation delay. I'm pointing
out that the synchronisation delay can occur 

* at the time we apply WAL - which will slow down commits (i.e. #6 on my
previous list of options)
* at the time we run a query that needs to see primary and standby
synchronised

So the same effect can be achieved in various ways.

The first way would require *all* transactions to be applied on standby,
i.e. option #6 for all transactions. That is a performance disaster and
I would not force that onto everybody.

The second way can be done by taking a snapshot on the primary, with an
associated LSN, then using that snapshot on the standby. That is
somewhat complex, but possible. I see the requirement for getting the
same answer on multiple nodes as a further extension of transaction
isolation mode and think that not all people will want this, so we
should allow that as an option.

I'm not going to worry about this at the moment. Hot standby will be
useful without this and so I regard this as a secondary objective. Rome
wasn't built in a single release, or something like that.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Peter Eisentraut

Simon Riggs wrote:

I am truly lost to understand why the *name* synchronous replication
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do*


It's the color of the bikeshed ...


We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record


In DRBD tradition, I suggest you implement all of them, or at least 
factor the code so that each of them can be a one line change.  (We can 
probably later drop one or two options.)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Heikki Linnakangas

Robert Haas wrote:

In fact, waiting for reply from standby server before acknowledging a commit
to the client is a bit pointless otherwise. It puts you in a strange
situation, where you're waiting for the commits in normal operation, but if
there's a network glitch or the standby goes down, you're willing to go
ahead without it. You get a high guarantee that your data is up-to-date in
the standby, except when it isn't. Which isn't much of a guarantee.


It protects you against a catastrophic loss of the primary, which is a
non-trivial consideration.  At the risk of being ghoulish, imagine
that you are a large financial company headquartered in the world
trade center.


So you'd want all commits to wait until the transaction is safely 
replicated in the standby. But if there's a network glitch, or the 
standby is restarted, you're happy to reply to the client that it's 
committed if it's only safely committed in the primary. Essentially, you 
wait for the reply as long the standby responds within X seconds, but if 
it takes more then Y seconds, you don't wait. I know that people do 
that, but it seems counterintuitive to me. In that case, when the 
primary acks the transaction as committed, you only know that it's 
safely committed in the primary; it doesn't give any hard guarantee 
about the state in the standby.


But when you consider the possibility to use the standby for queries, 
the synchronous mode makes sense too.


I'm not opposed to providing all the options, but the synchronous mode 
where we can guarantee that if you query the standby, you will see the 
effects of all transactions committed in the primary, makes the 
synchronous mode much more interesting. If you don't need that property, 
you're most likely more happy with asynchronous mode anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Greg Stark
It's a real promise. The reason you're getting hand-wavy answers is  
because it's such a basic requirement that I'm trying to point out  
just how fundamental a requirement it is.


If you want to see the actual code which guarantees this take a look  
around the code for procarray - in particular the code for taking a  
snapshot. There are comments there about what locks are needed when  
committing and when taking a snapshot and why. But it's quite technical.


--
Greg


On 15 Dec 2008, at 02:03, Mark Mielke m...@mark.mielke.cc wrote:


Greg Stark wrote:
When the database says the data is committed it has to mean the  
data is really committed. Imagine if you looked at a bank account  
balance after withdrawing all the money and saw a balance which  
didn't reflect the withdrawal and allowed you to withdraw more  
money again...


Within the same session - sure. From different sessions? PostgeSQL  
MVCC let's you see an older snapshot, although it does prefer to  
have the latest snapshot with each command.


For allowing to withdraw more money again, I would expect some sort  
of locking SELECT ... FOR UPDATE; to be used. This lock then  
forces the two transactions to become serialized and the second will  
either wait for the first to complete or fail. Any banking program  
that assumed that it could SELECT to confirm a balance and then  
UPDATE to withdraw the money as separate instructions would be a bad  
banking program. To exploit it, I would just have to start both  
operations at the same time - they both SELECT, they both see I have  
money, they both give me the money and UPDATE, and I get double the  
money (although my balance would show a big negative value - but I'm  
already gone...). Database 101.


When I asked for does PostgreSQL guarantee this? I didn't mean  
hand waving examples or hand waving expectations. I meant a pointer  
into the code that has some comment that says we want to guarantee  
that a commit in one session will be immediately visible to other  
sessions, and that a later select issued in the other sessions will  
ALWAYS see the commit whether 1 nanosecond later or 200 seconds  
later Robert's expectation and yours seem like taking this  
guarantee for granted rather than being justified with design  
intent and proof thus far. :-) Given my experiment to try and force  
it to fail, I can see why this would be taken for granted. Is this a  
real promise, though? Or just a unlikely scenario that never seems  
to be hit?


To me, the question is relevant in terms of the expectations of a  
multi-replica solution. We know people have the expectation. We know  
it can be convenient. Is the expectation valid in the first place?


I've probably drawn this question out too long and should do my own  
research and report back... Sorry... :-)


Cheers,
mark

--
Mark Mielke m...@mielke.cc



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Jeff Davis
On Mon, 2008-12-15 at 09:19 -0500, Robert Haas wrote:
 I understand you're point, but I think there's still a use case.   The
 idea is that declaring the secondary dead is a rare event, and there's
 some mechanism by which you're enabled to page your network staff, and
 they hightail it into the office to fix the problem.  It might not be
 the way that you want to run your system, but I don't think it's
 unreasonable for someone else to want it.
 

Agreed: there's an analogy to RAID here. When a disk goes out, it still
allows writes, but moves to a degraded state. Hopefully your monitoring
system notifies you, and you fix it.

Also, let's say that the standby suffers catastrophic storage failure.
Now you only have your data on one server anyway (the primary).
Rejecting new transactions from committing doesn't save all the old
transactions in the event of a subsequent storage failure on the
primary.

I'm not advocating this option in particular, other than saying that it
seems like a reasonable option to me.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Josh Berkus

Peter Eisentraut wrote:

Simon Riggs wrote:

I am truly lost to understand why the *name* synchronous replication
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do*


It's the color of the bikeshed ...


Hmmm.  I thought this was pretty clear.  There's three levels of synch 
which are useful features:


1) synchronus standby which is really asynchronous, but only has a gap 
of  100ms.


2) Synchronous standby which guarentees that all committed transactions 
are on the failover node and that no data will be lost for failover, but 
the failover node is still in standby mode.


3) Synchronous replication where the standby node has identical 
transactions to the master node, and is queryable read-only.


Any of these levels would be useful and allow a certain number of our 
users to deploy PostgreSQL in an environment where it wasn't used 
before.  So if we can only do (2) for 8.4, that's still very useful for 
telecoms and banks.


--Josh


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Ron Mayer

Josh Berkus wrote:


Hmmm.  I thought this was pretty clear.  There's three levels of synch 
which are useful features:


1) synchronus standby which is really asynchronous, but only has a gap 
of  100ms.


2) Synchronous standby which guarentees that all committed transactions 
are on the failover node and that no data will be lost for failover, but 
the failover node is still in standby mode.


3) Synchronous replication where the standby node has identical 
transactions to the master node, and is queryable read-only.


Any of these levels would be useful


Isn't the queryable read-only feature totally orthogonal with
how synchronous the replication is?

For one reporting system I have, where new data is continually
being added every second; I'd love to have a read-only-slave
even if that system has the 100ms gap you mentioned in #1.

Heck I don't care if the queries it runs even have a 100 *minute*
gap; but I sure would like it to be synchronous in the sense
that all the transactions to survive a failure of the primary.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Josh Berkus



Isn't the queryable read-only feature totally orthogonal with
how synchronous the replication is?


Yes.  However, it introduces specific difficult issues which an 
unreadable synchronous slave does not have.


--Josh

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Mon, 2008-12-15 at 13:43 -0800, Josh Berkus wrote:
  Isn't the queryable read-only feature totally orthogonal with
  how synchronous the replication is?
 
 Yes.  However, it introduces specific difficult issues which an 
 unreadable synchronous slave does not have.

Don't think it's hugely difficult, but there are multiple ways of doing
this. But it is irrelevant until we have the basic ability to run
queries.

I've explained this twice now on different parts of this thread. Could I
politely direct your attention to those posts?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Josh Berkus

Simon,


I've explained this twice now on different parts of this thread. Could I
politely direct your attention to those posts?


Chill.  I was just explaining that the *goal* of sync standby was not 
complicated or really something to be argued about.  It's pretty clear.


--Josh


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Mon, 2008-12-15 at 13:06 -0800, Josh Berkus wrote:
 Peter Eisentraut wrote:
  Simon Riggs wrote:
  I am truly lost to understand why the *name* synchronous replication
  causes so much discussion, yet nobody has discussed what they would
  actually like the software to *do*
  
  It's the color of the bikeshed ...
 
 Hmmm.  I thought this was pretty clear.  There's three levels of synch 
 which are useful features:
 
 1) synchronus standby which is really asynchronous, but only has a gap 
 of  100ms.
 
 2) Synchronous standby which guarentees that all committed transactions 
 are on the failover node and that no data will be lost for failover, but 
 the failover node is still in standby mode.
 
 3) Synchronous replication where the standby node has identical 
 transactions to the master node, and is queryable read-only.

 Any of these levels would be useful and allow a certain number of our 
 users to deploy PostgreSQL in an environment where it wasn't used 
 before.  So if we can only do (2) for 8.4, that's still very useful for 
 telecoms and banks.

The (2) mentioned here could be any of sync points #2-5 referred to
upthread. Different people have requested different levels of
robustness. Looking at DRBD and Oracle, they both subdivide (2) into at
least two further levels of option. So (2) is too broad a brush to paint
with.

I don't believe that (2) as stated is sufficient for banks, though is
reasonable for many telco applications. But #4 or #5 would be suitable
for banks, i.e. we must fsync to disk for very high value transactions.

The extra code to do this is minor, which is why I've asked Fujii-san to
include it now within the patch.

All of this is controllable by the parameter synchronous_replication,
which it is important to note can be set for each individual transaction
rather than just fixed for the whole server. This is identical to the
way we can mix synchronous commit and asynchronous commit transactions.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Fujii Masao
Hi,

Sorry for this late reply. And, thanks for the hot discussion ;)

On Tue, Dec 16, 2008 at 1:24 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Fujii-san,

 Just repeating this in case you lost this comment:

 On Mon, 2008-12-15 at 09:40 +, Simon Riggs wrote:

 Fujii-san, please can we incorporate those two options, rather than just
 one choice synchronous_replication = on. They look like two commonly
 requested options.

 I see the comment in line 230+ of walreceiver.c, so understand that you
 have implemented option #3 from the following list.

 So from my previous list

 1. We sent the message to standby (A)
 2. We received the message on standby
 3. We wrote the WAL to the WAL file (B)
 4. We fsync'd the WAL file (C)
 5. We CRC checked the WAL commit record
 6. We applied the WAL commit record

 Please could you also add an option #4, i.e. add the *option* to fsync
 the WAL to disk at commit time also. That requires us to add a third
 option to synchronous_replication parameter.

The above option should be configured on the primary? or standby?
The primary is suitable to vary it from transaction to transaction. On
the other hand, it should be configured on the standby in order to
choose it for every standby (in the future).

I prefer the latter, and thought that it should be added into recovery.conf.
I mean, synchronous_replication identifies only whether commit waits for
replication (if the name is confusing, I would rename it). The above
options (#1-#6) are chosen in recovery.conf. What is your opion?

 #6 is an additional synchronization step in Hot Standby. I would say
 that people won't want that when they see how it performs (they probably
 won't want #4 either for that same reason, but that is for robustness).

 We can jointly add option #6 once we have both sync rep and hot standby
 committed, or at a late stage of hot standby development. There's not
 much point looking at it before then.

Agreed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke

Robert Haas wrote:

On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events



OK, fine.  I'll be more precise.  I think we need to reserve the term
synchronous replication for a system where transactions that begin
on the standby after the transactions has committed on the master see
the effects of the committed transaction.
  


Wouldn't this be serialized transactions?

I'd like to see proof of some sort that PostgreSQL guarantees that the 
instant a 'commit' returns, any transactions already open with the 
appropriate transaction isolation level, or any new sessions *will* see 
the results of the commit.


I know that most of the time this happens - but what process 
synchronization steps occur to *guarantee* that this happens?



I just googled synchronous replication and read through the first
page of hits.  Most of them do not address the question of whether
synchronous replication can be said to have be completed when WAL has
been received by the standby not but yet applied.  One of the ones
that does is:

http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

...which refers to what we're proposing to call Synchronous
Replication as Semi-Synchronous Replication (or 2-safe replication)
specifically to distinguish it.  The other is:

http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf

...which doesn't specifically examine the issue but seems to take the
opposite position, namely that the server on which the transaction is
executed needs to wait only for one server to apply the changes to the
database (the others need only to know that they need to commit it;
they don't actually need to have done it).  However, that same paper
refers to two-phase commit as a synchronous replication algorithm, and
Wikipedia's discussion of two-phase commit:

http://en.wikipedia.org/wiki/Two-phase_commit_protocol

...clearly implies that the transaction must be applied everywhere
before it can be said to have committed.

The second page of Google results is mostly a further discussion of
the MySQL solution, which is mostly described as semi-synchronous
replication.

Simon Riggs said upthread that Oracle called this synchronous redo
transport.  That is obviously much closer to what we are doing than
synchronous replication.
  


Two phase commit doesn't imply that the transaction is guaranteed to be 
immediately visible. See my previous paragraph. Unless transactions are 
locked from starting until they are able to prove that they have the 
latest commit (a feat which I'm going to theorize as impossible - 
because the moment you wait for a commit, and you begin again, you 
really have no guarantee that another commit has not occurred in the 
mean time), I think it's clear that two phase commit guarantees that the 
commit has taken place, but does *not* guarantee anything about visibility.


It might be a good bet - but guarantee? There is no such guarantee.

Cheers,
mark

--
Mark Mielke m...@mielke.cc



Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Emmanuel Cecchet

Robert Haas wrote:

The term of art for making sure that transactions committed on the
primary are visible on the secondary seems to be one-copy
serializability (see, for example, a Google Books search on that
term).
Not exactly. 1-copy-serializability which is the standard for 
multi-master solutions, guarantees that transactions are executed in the 
same serializable order at each replica (which means that transactions 
can be executed in different order and committed at different times on 
different replica as long as a consistent serializable view is presented 
to the client).
There are a number of optimizations in that area but in a multi-master 
case, replicas rarely commit at the same time. There are interesting 
papers on the subject (like Tashkent  Tashkent+ based on Postgres) for 
those who want to understand these problems more thoroughly.


Hope this helps,
manu

--
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development  Consulting

--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Simon Riggs

On Sun, 2008-12-14 at 13:31 +0900, Tatsuo Ishii wrote:
  The point here is that synchronous replication, at least to some
  people, is going to imply that the user-visible states of the two
  copies are consistent.  To other people, it is going to imply that
  committed transactions will never be lost even in the event of a
  catastropic loss of the primary 1 picosecond after the commit is
  acknowledged.  We need to choose some word that implies that we are
  guaranteeing the latter of these two things but not the former.
  Otherwise, we will have confused users, and terminological confusion
  when and if we ever implement the former as well.
 
 Right. Before watching this thread, I had thought that the log
 shipping sync replication behaves former (and I had told so to people
 in Japan who are interested in 8.4 development. Of course this is my
 fault, though).
 
 Now I understand the log shipping sync replication does not behave
 same as other sync replications such as pgpool and PGCluster (there
 maybe more, but I don't know)

GENERAL COMMENTS, not to anybody in particular:


'Tis but thy name that is my enemy.
...
What's in a name? That which we call a rose
By any other name would smell as sweet.
...

Juliet, from Romeo and Juliet


I am truly lost to understand why the *name* synchronous replication
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.

It is certainly far too early to say what the final exact behaviour will
be and there is no reason at all to pre-suppose that it need only be a
single behaviour. I'm in favour of options, generally, but I would say
that the distinction between some of these options is mostly very fine
and strongly doubt whether people would use them if they existed. *But*
I think we can add them at a later stage of development if requirements
genuinely exist once all the benefits *and* costs are understood.

I would also point out that the distinction made between various
meanings of synchronous is *only* important if Hot Standby is included
as well. And that is closely linked to the replication feature, which we
really need to complete first. We have much to do yet.

So let's please end the name debate there and think about software.

...

We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record

Now you might think from what people have said that having synchronised
contents on both primary and standby is the only way to achieve exactly
the same results to queries on both nodes. Another way is to utilise a
snapshot taken on the primary and simply wait until the standby catches
up with that snapshot's LSN. So there is more than one way of achieving
a particular result and it is not dependent upon the exact
synchronisation we employ at commit time.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke

Simon Riggs wrote:

I am truly lost to understand why the *name* synchronous replication
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.
  


I think people have talked about 'like' in the context of user 
expectations. That is, there seems to exist a set of people (probably 
those who've never worked with a multi-replica solution before) who 
expect that once commit completes on one server, they can query any 
other master or slave and be guaranteed visibility of the transaction 
they just committed. These people may theoretically change their 
decision to not use Postgres-R, or at least change their approach to how 
they work with Postgres-R, if the name was in some way more intuitive to 
them in terms of what is actually being provided.


Synchronous replication itself says only details about replication, it 
does not say anything about visibility, so to some degree, people are 
focusing on the wrong term as the problem. Even if it says asynchronous 
replication - not sure that I care either way - this doesn't improve 
the understanding for the casual user of what is happening behind the 
scenes. Neither synchronous nor asynchronous guarantees that the change 
will be immediately visible from other nodes after I type 'commit;'. 
Asynchronous might err on the side of not immediately visible, where 
synchronous might (incorrectly) imply immediate visibility, but it's not 
an accurate guarantee to provide.


Synchronous does not guarantee visibility immediately after. Some 
indefinite but usually short time must normally pass from when my 
'commit;' completes until when the shared memory visible to my process 
sees the transaction. Multiple replicas with network latency or 
reliability issues increases the theoretical minimum size of this window 
to something that would be normally encountered as opposed to something 
that is normally not encountered.


The only way to guarantee visibility is to ensure that the new 
transaction is guaranteed to be visible from a shared memory perspective 
on every machine in the pool, and every active backend process. If my 
'commit;' is going to wait for this to occur, first, I think this forces 
every commit to have numerous network round trips to each machine in the 
pool, it forces each machine in the pool to be network accessible and 
responsive, it forces all commits to be serialized in the sense of the 
slowest machine in the pool determines the time for my commit to 
complete, and I think it implies some sort of inter-process signalling, 
or at the very least CPU level signalling about shared memory (in the 
case of multiple CPUs).


People such as myself think that a visibility guarantee is unreasonable 
and certain to cause scalability or reliability problems. So, my 'like' 
is an efficient multi-master solution where if I put 10 machines in the 
pool, I expect my normal query/commit loads to approach 10X as fast. My 
like prefers scalability over guarantees that may be difficult to 
provide, and probably are not provided today even in a single server 
scenario.



It is certainly far too early to say what the final exact behaviour will
be and there is no reason at all to pre-suppose that it need only be a
single behaviour. I'm in favour of options, generally, but I would say
that the distinction between some of these options is mostly very fine
and strongly doubt whether people would use them if they existed. *But*
I think we can add them at a later stage of development if requirements
genuinely exist once all the benefits *and* costs are understood.
  


The above 'commit;' behaviour difference - whether it completes when the 
commit is permanent (it definitely will be applied for certain to all 
replicas - it just may take time to apply to all replicas), or when the 
commit has actually taken effect (two-phase commit on all replicas - and 
both phases have completed on all replicas - what happens if second 
phase commit fails on one or more servers?), or when the commit is 
guaranteed to be visible from all existing and new sessionss (two-phase 
commit plus additional signalling required?) might be such an option.


I'm doubtful, though - as the difference in implementation between the 
first and second is pretty significant.


I'm curious about your suggestion to direct queries that need the latest 
snapshot to the 'primary'. I might have misunderstood it - but it seems 
that the expectation from some is that *all* sessions see the latest 
snapshot, so would this not imply that all sessions would be redirect to 
the 'primary'? I don't think it is reasonable myself, but I might be 
misunderstanding something...


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Robert Haas
 We can make the reply to a commit message when any of the following
 events have occurred

 1. We sent the message to standby
 2. We received the message on standby
 3. We wrote the WAL to the WAL file
 4. We fsync'd the WAL file
 5. We CRC checked the WAL commit record
 6. We applied the WAL commit record

Also

0. The same time we would have done so if replication had not been
configured at all.

I think the basic problem here is that we can talk about asynchronous
replication and synchronous replication but there are n2
possible/useful behaviors (I would guess principally 0, 2, 4, and 6,
but YMMV).  So we're going to need some way to clarify what we mean.

BTW, in case my previous emails on this topic might have given someone
the contrary impression, I'm not really that worked up about this
either.  Interesting?  Yes.  Have opinions?  Yes.  Lie awake nights
worrying about it?  Nope.  :-)

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke

Mark Mielke wrote:
Forget replication - even for the exact same server - I don't expect 
that if I commit from one session, I will be able to see the change 
immediately from my other session or a new session that I just opened. 
Perhaps this is often stable to rely on this, and it is useful for the 
database server to minimize the window during which the commit becomes 
visible to others, but I think it's a false expectation from the start 
that it absolutely will be immediately visible to another session. I'm 
thinking of situations where some part of the table is in cache. The 
only way the commit can communicate that the new transaction is 
available is by during communication between the processes or threads, 
or between the multiple CPUs on the machine. Do I want every commit to 
force each session to become fully in alignment before my commit 
completes? Does PostgreSQL make this guarantee today? I bet it doesn't 
if you look far enough into the guts. It might be very fast - I don't 
think it is infinitely fast.


FYI: I haven't been able to prove this. Multiple sessions running on my 
dual-core CPU seem to be able to see the latest commits before they 
begin executing. Am I wrong about this? Does PostgreSQL provide a 
intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)? Or is the machine and algorithms just fast 
enough that by the time it executes the query (up to 1 ms later) the 
commit is always visible in practice?


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Heikki Linnakangas

Mark Mielke wrote:

Mark Mielke wrote:
Forget replication - even for the exact same server - I don't expect 
that if I commit from one session, I will be able to see the change 
immediately from my other session or a new session that I just opened. 
Perhaps this is often stable to rely on this, and it is useful for the 
database server to minimize the window during which the commit becomes 
visible to others, but I think it's a false expectation from the start 
that it absolutely will be immediately visible to another session. I'm 
thinking of situations where some part of the table is in cache. The 
only way the commit can communicate that the new transaction is 
available is by during communication between the processes or threads, 
or between the multiple CPUs on the machine. Do I want every commit to 
force each session to become fully in alignment before my commit 
completes? Does PostgreSQL make this guarantee today? I bet it doesn't 
if you look far enough into the guts. It might be very fast - I don't 
think it is infinitely fast.


FYI: I haven't been able to prove this. Multiple sessions running on my 
dual-core CPU seem to be able to see the latest commits before they 
begin executing. Am I wrong about this? Does PostgreSQL provide a 
intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)?


Yes. PostgreSQL does guarantee that, and I would expect any other DBMS 
to do the same.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Dimitri Fontaine

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Le 14 déc. 08 à 16:48, Simon Riggs a écrit :

I am truly lost to understand why the *name* synchronous replication
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.


It seems that the easy parts are the one the more people will  
participate into. Maybe that's that simple.



We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record


Ok, so let's talk about this easy part: my understanding of  
synchronous replication is that it gives to its users the strong  
guarantee that at commit time the transaction is secured to the  
slave(s). That means you get the D of ACID on more than one server.


Why synchronous? Because you know the durability is ensured exactly  
when you receive the COMMIT ack.


So I'm with Simon on this, the term Synchronous Replication does  
describe accurately what's being implemented here, and on the other  
hand, as so many of us are saying, it's true that it tells very little  
about it. Those 6 options are all in the scope of the infamous naming,  
just different guarantees level, from almost strong to very strong,  
with some almost, but not quite, entirely unlike the strong I want.  
Pick your naming here too.


At least, that's how I'm understanding this, the bottom line of why  
care sending this email is that maybe it'll help some people to  
recover from sleep deprivation ;)


My 2¢,
- --
dim

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAklFcEsACgkQlBXRlnbh1bk0YwCfa+zGBKTK5EoH/Nmu0x+R6vKI
buAAniyL6Z+3MdT4rim5/xZQvdr4QOIQ
=iHnY
-END PGP SIGNATURE-

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Ron Mayer

Robert Haas wrote:

We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record


Perhaps it'd be useful if the failure modes these are trying to
protect against were described too.

If I understand right.

1. Protects all the transactions from the failure of the
   master; so long as neither the network nor the slave
   machine die soon?

2. Protects all the transactions from the failure of the
   master and the network between the slave and master,
   so long as the slave doesn't die soon?

3. Same as #2?

4. Protects against the failure of the master, the network,
   and parts of the slave; so long as the slave's disk
   survives the failure?

5. Protects against all of the above, and bit-errors in the
   memories of the slave machine (except the slave's disk
   controller?)?   Or are we reading-back the CRC from the
   slave's disk and comparing to the CRC computed on the
   master where it might protect from even more?

6. Same as 4?

If this is right, #2, #3, #4, and #6 feel similar except
that they're protecting against failures of different (but
still all incomplete) subsets of the hardware on the slave, right?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke

Heikki Linnakangas wrote:

Mark Mielke wrote:
FYI: I haven't been able to prove this. Multiple sessions running on 
my dual-core CPU seem to be able to see the latest commits before 
they begin executing. Am I wrong about this? Does PostgreSQL provide 
a intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)?
Yes. PostgreSQL does guarantee that, and I would expect any other DBMS 
to do the same.


Where does the expectation come from? I don't recall ever reading it in 
the documentation, and unless the session processes are contending over 
the integers (using some sort of synchronization primitive) in memory 
that represent the latest visible commit on every single select, I'm 
wondering how it is accomplished? If they are contending over these 
integers, doesn't that represent a scaling limitation, in the sense that 
on a 32-core machine, they're going to be fighting with each other to 
get the latest version of these shared integers into the CPU for 
processing? Maybe it's such a small penalty that we don't care? :-)


I was never instilled with the logic that 'commit in one session 
guarantees visibility of the effects in another session'. But, as I say 
above, I wasn't able to make PostgreSQL fail in this regard. So maybe 
I have no clue what I am talking about? :-)


If you happen to know where the code or documentation makes this 
promise, feel free to point it out. I'd like to review the code. If you 
don't know - don't worry about it, I'll find it later...


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Greg Stark
When the database says the data is committed it has to mean the data  
is really committed. Imagine if you looked at a bank account balance  
after withdrawing all the money and saw a balance which didn't reflect  
the withdrawal and allowed you to withdraw more money again...


--
Greg


On 14 Dec 2008, at 14:44, Mark Mielke m...@mark.mielke.cc wrote:


Mark Mielke wrote:
Forget replication - even for the exact same server - I don't  
expect that if I commit from one session, I will be able to see the  
change immediately from my other session or a new session that I  
just opened. Perhaps this is often stable to rely on this, and it  
is useful for the database server to minimize the window during  
which the commit becomes visible to others, but I think it's a  
false expectation from the start that it absolutely will be  
immediately visible to another session. I'm thinking of situations  
where some part of the table is in cache. The only way the commit  
can communicate that the new transaction is available is by during  
communication between the processes or threads, or between the  
multiple CPUs on the machine. Do I want every commit to force each  
session to become fully in alignment before my commit completes?  
Does PostgreSQL make this guarantee today? I bet it doesn't if you  
look far enough into the guts. It might be very fast - I don't  
think it is infinitely fast.


FYI: I haven't been able to prove this. Multiple sessions running on  
my dual-core CPU seem to be able to see the latest commits before  
they begin executing. Am I wrong about this? Does PostgreSQL provide  
a intentional guarantee that a commit from one session that  
completes immediately followed by a query from another session will  
always find the commit effect visible (provide the transaction  
isolation level doesn't get in the way)? Or is the machine and  
algorithms just fast enough that by the time it executes the query  
(up to 1 ms later) the commit is always visible in practice?


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Robert Haas
 If this is right, #2, #3, #4, and #6 feel similar except
 that they're protecting against failures of different (but
 still all incomplete) subsets of the hardware on the slave, right?

Right.  Actually, the biggest difference with #6 has nothing to do
with protecting against failures.  It has rather to do with the ease
of writing applications in the context of hot standby.  You can close
your connection, open a connection to a different server, and know
that your transactions will be reflected there.  On the other hand,
I'd be surprised if it didn't come with a substantial performance
penalty, so it may not be too practical in real life even if it sounds
good on paper.

#1 , #3, and #5 don't feel that useful to me.  In the case of #1,
sending your WAL over the network and then not checking that it got
there is sort of silly: the likelihood of packet loss on the network
has got to be several orders of magnitude more likely than a failure
on the master.  #3 and #5 just don't seem to provide any real benefits
over their immediate predecessors.

Honestly, I think the most useful thing is probably going to be
asynchronous replication: in other words, when a commit is requested
on the master, we write WAL and return success.  In the background, we
stream the WAL to a secondary, which writes it and applies it.  This
will give us a secondary which is mostly up to date (and can run
queries, with hot standby) without killing performance.  The other
options are going to be for environments where losing a transaction is
really, really bad, or (in the case of #6) read-mostly environments
where it's useful to spread the query load out across several servers,
but the overhead associated with waiting for the rare write
transactions to apply everywhere is tolerable.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke

Greg Stark wrote:
When the database says the data is committed it has to mean the data 
is really committed. Imagine if you looked at a bank account balance 
after withdrawing all the money and saw a balance which didn't reflect 
the withdrawal and allowed you to withdraw more money again...


Within the same session - sure. From different sessions? PostgeSQL MVCC 
let's you see an older snapshot, although it does prefer to have the 
latest snapshot with each command.


For allowing to withdraw more money again, I would expect some sort of 
locking SELECT ... FOR UPDATE; to be used. This lock then forces the 
two transactions to become serialized and the second will either wait 
for the first to complete or fail. Any banking program that assumed that 
it could SELECT to confirm a balance and then UPDATE to withdraw the 
money as separate instructions would be a bad banking program. To 
exploit it, I would just have to start both operations at the same time 
- they both SELECT, they both see I have money, they both give me the 
money and UPDATE, and I get double the money (although my balance would 
show a big negative value - but I'm already gone...). Database 101.


When I asked for does PostgreSQL guarantee this? I didn't mean hand 
waving examples or hand waving expectations. I meant a pointer into the 
code that has some comment that says we want to guarantee that a commit 
in one session will be immediately visible to other sessions, and that a 
later select issued in the other sessions will ALWAYS see the commit 
whether 1 nanosecond later or 200 seconds later Robert's expectation 
and yours seem like taking this guarantee for granted rather than 
being justified with design intent and proof thus far. :-) Given my 
experiment to try and force it to fail, I can see why this would be 
taken for granted. Is this a real promise, though? Or just a unlikely 
scenario that never seems to be hit?


To me, the question is relevant in terms of the expectations of a 
multi-replica solution. We know people have the expectation. We know it 
can be convenient. Is the expectation valid in the first place?


I've probably drawn this question out too long and should do my own 
research and report back... Sorry... :-)


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Heikki Linnakangas

Mark Mielke wrote:
When I asked for does PostgreSQL guarantee this? I didn't mean hand 
waving examples or hand waving expectations. I meant a pointer into the 
code that has some comment that says we want to guarantee that a commit 
in one session will be immediately visible to other sessions, and that a 
later select issued in the other sessions will ALWAYS see the commit 
whether 1 nanosecond later or 200 seconds later Robert's expectation 
and yours seem like taking this guarantee for granted rather than 
being justified with design intent and proof thus far. :-) Given my 
experiment to try and force it to fail, I can see why this would be 
taken for granted. Is this a real promise, though? 


Yes.

In a nutshell, commit works like this:

1. Write and flush WAL record about the commit
2. Mark the transaction as committed in clog
3. Remove the xid from the shared memory ProcArray.
4. Release locks and other resources
5. Reply to client that the transaction has been committed.

After step 3, any backend taking a snapshot will see the transaction as 
committed. Since we only reply to the client at step 5, it is guaranteed 
that a transaction beginning after step 5, as well as an already open 
transaction taking a new snapshot (ie. running a new command in read 
committed mode) after that will see the transaction as committed.


The relevant code is in CommitTransaction() in xact.c.

To me, the question is relevant in terms of the expectations of a 
multi-replica solution. We know people have the expectation.


Yeah, I think Robert is right. We should reserve the term synchronous 
replication for the mode where that guarantee holds for the slave as well.


In fact, waiting for reply from standby server before acknowledging a 
commit to the client is a bit pointless otherwise. It puts you in a 
strange situation, where you're waiting for the commits in normal 
operation, but if there's a network glitch or the standby goes down, 
you're willing to go ahead without it. You get a high guarantee that 
your data is up-to-date in the standby, except when it isn't. Which 
isn't much of a guarantee.


But with hot standby, it makes a lot of sense. The guarantee is that if 
the standby is accepting queries, it's up-to-date with the primary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Simon Riggs

On Sat, 2008-12-13 at 00:00 +0100, Markus Wanner wrote:
 Hi,
 
 Fujii Masao wrote:
  I'd like to define the meaning of synch rep again. synch rep means:
  
  (1) Transaction commit waits for WAL records to be replicated to the standby
before the command returns a success indication to the client.
  
  (2) The standby has (can read) all WAL files indispensable for recovery.
 
 Let me point out that - very much like the original Postgres-R algorithm
 - this guarantees committed transactions to be durable and consistent
 (no late aborts of conflicting transactions), but it does not guarantee
 that a transaction committed on one node is immediately visible on the
 other node. In that sense, it is not synchronous as commonly understood,
 because it does not operate with all their parts in synchrony [1], as
 implied by the term synchronous. This might (and often has in the
 past) lead to confusion.

You're right that neither the data transfer nor data availability is
entirely synchronous, but data transfer is synchronous at time of
*commit*: it is recorded on multiple nodes at the same time.

The term synchronous replication is already well used in the industry
to mean synchronous commit, so I don't think we should change the name
now. The project here is also known to everybody as synch rep.

* Oracle Data Guard calls it synchronous redo transport
* MS Exchange calls it synchronous replication
* MS SQL Server has Database Mirroring, Log Shipping and
Replication. Database Mirroring provides synchronous mechanism, with
Replication meaning data transfer to other databases,
publishsubscribe.
* DB2 HADR provides synchronous replication
* MySQL call it synchronous replication

What is confusing is that replication itself is a much abused term and
is used to describe technologies for HA, DR and data movement.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Simon Riggs wrote:
 You're right that neither the data transfer nor data availability is
 entirely synchronous, but data transfer is synchronous at time of
 *commit*: it is recorded on multiple nodes at the same time.

I'm unsure what you mean by a data transfer being synchronous. To what
other process or state should the data transfer be synchronous to?

 The term synchronous replication is already well used in the industry
 to mean synchronous commit, so I don't think we should change the name
 now. The project here is also known to everybody as synch rep.

I understand very well, that you don't want to change the name. I've
been hesitant to relabel Postgres-R from synchronous to asynchronous
to eager.

However, that is a marketing decision [1], which should not be mixed
with the technical discussion here. Speaking of a synchronous commit
is utterly misleading, because the commit itself is exactly the thing
that's *not* synchronous.

It *is* an optimization to fully synchronous replication to defer commit
on the slave and only make sure that the transaction *can* be applied
at some time in the future.

However, this *does* have the drawback of transactions not being
immediately visible on the slave. Often enough, this is acceptable. But
it certainly matters to some applications developers.

 What is confusing is that replication itself is a much abused term and
 is used to describe technologies for HA, DR and data movement.

I absolutely agree to that. And I'm thus recommending to at least be
consistent and honest with the term synchronous and point out that WAL
writing is synchronous for the log shipping approach here (AFAIK). But
that the commit is asynchronous for performance reasons. In other words:
this approach is certainly (and hopefully, for performance reasons)
different from a fully synchronous approach. Even for marketing reasons,
it might make sense to point out that difference (.. no, we are faster
than fully sync rep.).

Regards

Markus Wanner

[1]: Some people like the term virtually synchronous for marketing
purposes. That's at least half-ways technically correct.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Grzegorz Jaskiewicz


On 2008-12-13, at 13:07, Markus Wanner wrote:



However, that is a marketing decision [1], which should not be mixed
with the technical discussion here. Speaking of a synchronous commit
is utterly misleading, because the commit itself is exactly the thing
that's *not* synchronous.




[1]: Some people like the term virtually synchronous for marketing
purposes. That's at least half-ways technically correct.


Marketing people are virtually trustworthy, from my life experience.
If you ask me, this is just preposterous.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Simon Riggs

On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:

 Speaking of a synchronous commit
 is utterly misleading, because the commit itself is exactly the thing
 that's *not* synchronous.

Not really sure where you're going here. synchronous replication is
used exactly as described in the Wikipedia entry here:
http://en.wikipedia.org/wiki/Database_replication

No two word phrase is going to accurately sum up the complexity and
potential for data loss in these situations. DRBD saw that too and just
called them A, B and C and then describe them more accurately. 

But I don't think we should say PostgreSQL just implemented algorithm
B which is just unhelpful. I don't think its marketing to refer to it
by the phrase most commonly used for the technology we are building.
Nobody suggested we call it wizrep or suchlike...

The docs can contain the exact description of data loss and timing
windows.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Simon Riggs wrote:
 On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:
 Speaking of a synchronous commit
 is utterly misleading, because the commit itself is exactly the thing
 that's *not* synchronous.
 
 Not really sure where you're going here.

I'm pointing to a potential misunderstanding, trying to help to prevent
you from running into the same issues and discussions as I did.

I've learned the hard way, that the Postgres-R algorithm is not fully
synchronous (in the strict sense). This caused confusion for people who
take the word synchronous by its original meaning. The algorithm
proposed here seems similar enough to potentially cause the same confusion.

As I see it now, I think it's well worth to point out the difference,
from both, the technical as well as from the marketing perspective. The
former for better understanding, the later to prevent users from
thinking it must be slow per definition. Arguing that your approach is
not fully synchronous definitely helps defending that concern.

However, I'm just now realizing, that the difference is only relevant as
soon as you begin to allow read-only access on the slave. AFAIK that's
among the goals of this effort, no?

 synchronous replication is
 used exactly as described in the Wikipedia entry here:
 http://en.wikipedia.org/wiki/Database_replication

That article describes pretty much all variants of replication, what
exactly are you referring to?

Under Database Replication  Multi-Master replication it describes
eager vs lazy variants, which is IMO a more appropriate and useful
distinction than sync vs async. (But that's admittedly a sentence I've
contributed myself, IIRC).

Under Storage Replication  Synchronous Replication one can read:
Write is not considered complete until acknowledgement by both local
and remote storage. For the proposed approach this might hold true for
WAL writing. However, the user certainly doesn't care how synchronous
the log is shipped nor written, is as long as she doesn't see the
changes on the slave.

That's the difference between fully synchronous and eager (or virtually
or approximately synchronous) algorithms. You seem to refer to both as
synchronous. Phrases like synchronous commit or synchronous data
transfer do not help me to understand what exactly you are talking about.

Explaining that the slave commits (and therefore makes the transactions
visible) asynchronously would help. And it would prevent disappointment
for users who expect changes to be immediately visible on the slave.

 No two word phrase is going to accurately sum up the complexity and
 potential for data loss in these situations. DRBD saw that too and just
 called them A, B and C and then describe them more accurately.

Agreed. I've chosen lazy, eager and sync, so far. I'm open for better
terms, and I leave it up to you to call your variants whatever you like.
But to understand what you are talking about, I'd prefer to get to know
these distinctions crisp and clear.

 But I don't think we should say PostgreSQL just implemented algorithm
 B which is just unhelpful. I don't think its marketing to refer to it
 by the phrase most commonly used for the technology we are building.

I certainly agree to using such terms. Unfortunately, in my experience,
synchronous replication is commonly used to mean that transactions are
guaranteed to be immediately visible on remote nodes after the client
got commit acknowledgment. That's the cause for confusion I'm envisioning.


I'm hoping to be somewhat helpful to this effort of getting a log
shipping replication variant into Postgres. It can only be beneficial
for Postgres-R in that we gain field experience with ..uhm.. this
special kind of replication, however we name it.

I'm already on xmas vacation, so I won't bother you any further on this
issue. Have fun coding and make sure to enjoy this time of the year.

All the best.

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas
 I certainly agree to using such terms. Unfortunately, in my experience,
 synchronous replication is commonly used to mean that transactions are
 guaranteed to be immediately visible on remote nodes after the client
 got commit acknowledgment. That's the cause for confusion I'm envisioning.

I think that's a very important point.  It's very possible that 8.4
may support both this feature and Hot Standby (although the latter
seems to have stalled a bit...).  That makes me think oh, great, I
can offload any subset of my read-only queries to the standby.  Not
so fast.

I think we need to reserve the term synchronous replication for a
system where transactions that begin at the same time on the primary
and standby see the same tuples.  Clearly that is more synchronous
than what is being proposed here; if we call this synchronous
replication, what will we call that?  Really Synchronous, Honest, No
Kidding?   Admittedly, we may never implement that feature, but that
seems irrelevant.

It would be useful to have names for all the different possibilities.
 Random ideas:

Log Shipping.  After each log switch, the previous WAL log is copied
to the standby in its entirety.

WAL Streaming - Asynchronous.  The WAL log is streamed from master to
standby as it is written, but transactions on the master never wait.

WAL Streaming - Synchronous Receive.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges receipt of the WAL.

WAL Streaming - Synchronous Write.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges that the WAL has been written to
disk.

WAL Streaming - Synchronous Apply.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges that WAL has been written to disk
and applied.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I think we need to reserve the term synchronous replication for a
 system where transactions that begin at the same time on the primary
 and standby see the same tuples.  Clearly that is more synchronous
 than what is being proposed here; if we call this synchronous
 replication, what will we call that?  Really Synchronous, Honest, No
 Kidding?   Admittedly, we may never implement that feature, but that
 seems irrelevant.

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events
at distinct locations isn't even well-defined, because observers at yet
other locations will disagree about what is simultaneous.  And I'm
not just making a joke here --- speed-of-light delays in a WAN are
meaningful compared to current computer speeds.  In practice, the
slave and the master will never commit at exactly the same time.

I agree with the point made upthread that we should use the term
synchronous replication the way it's commonly used in the industry.
Inventing our own terminology might be fun but it's not really going
to result in less confusion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Aidan Van Dyk
Synchronous replication, sync rep is *not* intersted in the slave's
visiblity of the commit, because PostgreSQL doesn't serve requests
when in recovery (wal receiving) mode *now*.

This sync rep patch/proposal/discution is *strictly* (at this point yet,
hot standby may eventually or hopefully soon change that) the means to
get the data safely in 2 seperate places, before the COMMIT returns,
by means of wal streaming.  That safely in 2 places can have various
implementation options (like received, on disk, or applied), and
Fujii-san explained some of the options as to what to consider safe
and their trade-offs at his presentation at last year.

Once both sync-rep (the wal-streaming get changes in two places) and
hot-standby (run queries while WAL is being applied) are available in
PostgreSQL, at that point we might need to start other client
visibility, but even then, we still don't need to worry about
multi-master options...

a.


* Markus Wanner mar...@bluegap.ch [081213 12:17]:
 Hi,
 
 Simon Riggs wrote:
  On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:
  Speaking of a synchronous commit
  is utterly misleading, because the commit itself is exactly the thing
  that's *not* synchronous.
  
  Not really sure where you're going here.
 
 I'm pointing to a potential misunderstanding, trying to help to prevent
 you from running into the same issues and discussions as I did.
 
 I've learned the hard way, that the Postgres-R algorithm is not fully
 synchronous (in the strict sense). This caused confusion for people who
 take the word synchronous by its original meaning. The algorithm
 proposed here seems similar enough to potentially cause the same confusion.
 
 As I see it now, I think it's well worth to point out the difference,
 from both, the technical as well as from the marketing perspective. The
 former for better understanding, the later to prevent users from
 thinking it must be slow per definition. Arguing that your approach is
 not fully synchronous definitely helps defending that concern.
 
 However, I'm just now realizing, that the difference is only relevant as
 soon as you begin to allow read-only access on the slave. AFAIK that's
 among the goals of this effort, no?
 
  synchronous replication is
  used exactly as described in the Wikipedia entry here:
  http://en.wikipedia.org/wiki/Database_replication
 
 That article describes pretty much all variants of replication, what
 exactly are you referring to?
 
 Under Database Replication  Multi-Master replication it describes
 eager vs lazy variants, which is IMO a more appropriate and useful
 distinction than sync vs async. (But that's admittedly a sentence I've
 contributed myself, IIRC).
 
 Under Storage Replication  Synchronous Replication one can read:
 Write is not considered complete until acknowledgement by both local
 and remote storage. For the proposed approach this might hold true for
 WAL writing. However, the user certainly doesn't care how synchronous
 the log is shipped nor written, is as long as she doesn't see the
 changes on the slave.
 
 That's the difference between fully synchronous and eager (or virtually
 or approximately synchronous) algorithms. You seem to refer to both as
 synchronous. Phrases like synchronous commit or synchronous data
 transfer do not help me to understand what exactly you are talking about.
 
 Explaining that the slave commits (and therefore makes the transactions
 visible) asynchronously would help. And it would prevent disappointment
 for users who expect changes to be immediately visible on the slave.
 
  No two word phrase is going to accurately sum up the complexity and
  potential for data loss in these situations. DRBD saw that too and just
  called them A, B and C and then describe them more accurately.
 
 Agreed. I've chosen lazy, eager and sync, so far. I'm open for better
 terms, and I leave it up to you to call your variants whatever you like.
 But to understand what you are talking about, I'd prefer to get to know
 these distinctions crisp and clear.
 
  But I don't think we should say PostgreSQL just implemented algorithm
  B which is just unhelpful. I don't think its marketing to refer to it
  by the phrase most commonly used for the technology we are building.
 
 I certainly agree to using such terms. Unfortunately, in my experience,
 synchronous replication is commonly used to mean that transactions are
 guaranteed to be immediately visible on remote nodes after the client
 got commit acknowledgment. That's the cause for confusion I'm envisioning.
 
 
 I'm hoping to be somewhat helpful to this effort of getting a log
 shipping replication variant into Postgres. It can only be beneficial
 for Postgres-R in that we gain field experience with ..uhm.. this
 special kind of replication, however we name it.
 
 I'm already on xmas vacation, so I won't bother you any further on this
 issue. Have fun coding and make sure to enjoy this time of the year.
 
 All the best.
 
 Markus Wanner
 

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Simon Riggs

On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote:

 Hot Standby (although the latter
 seems to have stalled a bit...)

It's just being worked on asynchronously. ;-)

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Hannu Krosing
On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote:
  I certainly agree to using such terms. Unfortunately, in my experience,
  synchronous replication is commonly used to mean that transactions are
  guaranteed to be immediately visible on remote nodes after the client
  got commit acknowledgment. That's the cause for confusion I'm envisioning.
 
 I think that's a very important point.  It's very possible that 8.4
 may support both this feature and Hot Standby (although the latter
 seems to have stalled a bit...).  That makes me think oh, great, I
 can offload any subset of my read-only queries to the standby.  Not
 so fast.
 
 I think we need to reserve the term synchronous replication for a
 system where transactions that begin at the same time on the primary
 and standby see the same tuples.

Define same time. 

You can have a variantof sync rep + hot standby where the master does
not return committed before the slave has both synced the data and
replied the transaction so that it is visible on slave but in that case
you may have a usecase, where it is actually visible on slave _before_
it is visible on master.

actually you can't have that same time guarantee even on single
system, that is, if you start two transactions connections at the same
time, you still cant be sure there is not third transaction which has
committed between those two and which makes the visible data on those
two different.


  Clearly that is more synchronous
 than what is being proposed here; if we call this synchronous
 replication, what will we call that?  Really Synchronous, Honest, No
 Kidding?   Admittedly, we may never implement that feature, but that
 seems irrelevant.
 
 It would be useful to have names for all the different possibilities.
  Random ideas:
 
 Log Shipping.  After each log switch, the previous WAL log is copied
 to the standby in its entirety.
 
 WAL Streaming - Asynchronous.  The WAL log is streamed from master to
 standby as it is written, but transactions on the master never wait.
 
 WAL Streaming - Synchronous Receive.  The WAL log is streamed from
 master to standby as it is written, and transactions on the master
 wait until the standby acknowledges receipt of the WAL.
 
 WAL Streaming - Synchronous Write.  The WAL log is streamed from
 master to standby as it is written, and transactions on the master
 wait until the standby acknowledges that the WAL has been written to
 disk.
 
 WAL Streaming - Synchronous Apply.  The WAL log is streamed from
 master to standby as it is written, and transactions on the master
 wait until the standby acknowledges that WAL has been written to disk
 and applied.

We still could call Sync Rep as a feature synchronous replication on
basis that WAL Streaming - Synchronous Write is the highest security
level achievable using the feature.

And maybe have Sync Hot Standby as a feature on top of that which
provides WAL Streaming - Synchronous Apply



--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Hannu Krosing
On Sat, 2008-12-13 at 21:35 +0200, Hannu Krosing wrote:

 We still could call Sync Rep as a feature synchronous replication on
 basis that WAL Streaming - Synchronous Write is the highest security
 level achievable using the feature.
 
 And maybe have Sync Hot Standby as a feature on top of that which
 provides WAL Streaming - Synchronous Apply

Or maybe better call it Serializable Hot Standby, as the actual
guarantee that can be achieved is that when one client does something on
master and after committing on master starts another transaction on
slave, then the effects of query on master are visible on slave.


-- 
--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Tom Lane wrote:
 We won't call it anything, because we never will or can implement that.
 See the theory of relativity: the notion of exactly simultaneous events
 at distinct locations isn't even well-defined

That has never been the point of the discussion. It's rather about the
question if changes from transactions are guaranteed to be visible on
remote nodes immediately after commit acknowledgment. Whether or not
this is guaranteed, in both cases the term synchronous replication is
commonly used, which is causing confusion.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Simon Riggs wrote:
 Hot Standby (although the latter
 seems to have stalled a bit...)
 
 It's just being worked on asynchronously. ;-)

LOL, thanks for bringing humor into this discussion :-)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Hannu Krosing wrote:
 You can have a variantof sync rep + hot standby where the master does
 not return committed before the slave has both synced the data and
 replied the transaction so that it is visible on slave but in that case
 you may have a usecase, where it is actually visible on slave _before_
 it is visible on master.

As long as it's not visible *before* the client requests a COMMIT, that
certainly doesn't matter (because the application cannot check that).

What matters is, that an application might expect a node to show the
changes of a transaction which has previously (seen from the application
itself) been committed and acknowledged by another node.

AFAICT the common understanding of synchronous replication is, that all
nodes confirm to have committed the changes of a transaction *before*
acknowledging COMMIT to the application (and obviously only *after* the
application requested to COMMIT the transaction, so the guarantee is
that all nodes commit *sometime* within that time frame, which is
certainly possible to guarantee, see 2PC approaches).

This guarantee is not provided by the Postgres-R algorithm, nor by the
approach presented. Both only guarantee, that the transaction *will* get
committed (and thus get visible) on all nodes *sometime* *after* the
application requested to commit it (even in case of various failures,
that is) [1]. As cited before, that has been enough of a reason for Jan
Wieck to call Postgres-R asynchronous, and I certainly see his point.

Note that the amount of time that passes between the commit
acknowledgment and the actual commit on remote nodes may theoretically
be infinitely long. And in practice certainly long enough for an
application to notice the difference. However, it still is a practical
optimization, because most applications should cope with it just fine.
But not all...

Do you consider the proposed log shipping approach to be synchronous?
How about the Postgres-R algorithm?

Regards

Markus Wanner

[1]: of course these approaches also guarantee that the transaction is
committed on the local node *before* acknowledging commit, so that
subsequent (seen from the application) queries are guaranteed to see the
changes. But that guarantee only holds true for the local node.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Aidan Van Dyk
* Markus Wanner mar...@bluegap.ch [081213 16:33]:
 Hi,
 
 Hannu Krosing wrote:
  You can have a variantof sync rep + hot standby where the master does
  not return committed before the slave has both synced the data and
  replied the transaction so that it is visible on slave but in that case
  you may have a usecase, where it is actually visible on slave _before_
  it is visible on master.
 
 As long as it's not visible *before* the client requests a COMMIT, that
 certainly doesn't matter (because the application cannot check that).

Well, I think the PG MVCC (which wal-streaming just ships across
somewhere else) will save that.  So with hot-standby you could have
another client could see the result *after* the COMMIT has been
requested, but *before* the COMMIT returns...  But we have this
situation in a single current PG instance anyways, so it's nothing
new

But with hot-standby, I could also see that it could be done such that
the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but
because of a current running query, application of it is delayed...  But
this is hot-standby's problem of describing itself, not sync-rep.

IMHO, sync-rep is about getting the change durrably to a slave before
acknoledging the COMMIT.  That slave could be any number of things:
- A WAL archive type system having the ability to be used for
  recover
- A PG with special recovery mode that reads the stream and applies it
- A full hot-standby recovery

I could see any and all of those (and probably other) being usefull and
used.

But in the current patch, it focusses on the streaming (sending), and
and a receiver recovery mode that can accept/apply them, again,
without worrying about acutally running queries (yet) ...

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Mark Mielke

Markus Wanner wrote:

Tom Lane wrote:
  

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events
at distinct locations isn't even well-defined



That has never been the point of the discussion. It's rather about the
question if changes from transactions are guaranteed to be visible on
remote nodes immediately after commit acknowledgment. Whether or not
this is guaranteed, in both cases the term synchronous replication is
commonly used, which is causing confusion.
  


Might it not be true that anybody unfamiliar would be confused and that 
this is a bit of a straw man?


I don't think synchronous replication guarantees that it will be 
immediately visible. Even if it did push the change to the other 
machine, and the other machine had committed it, that doesn't guarantee 
that any reader sees it any more than if I commit to the same machine 
(no replication), I am guaranteed to see the change from another 
session. Synchronous replication only means that I can be assured that 
my change has been saved permanently by the time my commit completes. It 
doesn't mean anybody else can see my change or is guaranteed to see my 
change if the query from another session.


If my application assumes that it can commit to one server, and then 
read back the commit from another server, and my application breaks as a 
result, it's because I didn't understand the problem. Even if PostgreSQL 
didn't use the word synchronous replication, I could still be 
confused. I need to understand the problem no matter what words are used.


Cheers,
mark

--
Mark Mielke m...@mielke.cc



Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Aidan Van Dyk wrote:
 Well, I think the PG MVCC (which wal-streaming just ships across
 somewhere else) will save that.  So with hot-standby you could have
 another client could see the result *after* the COMMIT has been
 requested, but *before* the COMMIT returns...  But we have this
 situation in a single current PG instance anyways, so it's nothing
 new

AFAIU the proposed algorithm only waits until WAL is written on the
slave before acknowledging COMMIT. Application of the changes may be
deferred, so it's not necessarily immediately visible on the slave.

 But with hot-standby, I could also see that it could be done such that
 the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but
 because of a current running query, application of it is delayed...  But
 this is hot-standby's problem of describing itself, not sync-rep.

I'm thinking of the overall system and don't care much if it's
hot-standby's or sync-rep's problem. But it's certainly the master which
needs to await certain acknowledgments of the slaves. That has so far
been discussed within this sync-rep thread.

 IMHO, sync-rep is about getting the change durrably to a slave before
 acknoledging the COMMIT.  That slave could be any number of things:
 - A WAL archive type system having the ability to be used for
   recover
 - A PG with special recovery mode that reads the stream and applies it
 - A full hot-standby recovery
 
 I could see any and all of those (and probably other) being usefull and
 used.

I certainly agree to that.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner
Hi,

Mark Mielke wrote:
 Might it not be true that anybody unfamiliar would be confused and that
 this is a bit of a straw man?

Might be. I've neglected the issue myself for a while.

 I don't think synchronous replication guarantees that it will be
 immediately visible. Even if it did push the change to the other
 machine, and the other machine had committed it, that doesn't guarantee
 that any reader sees it any more than if I commit to the same machine
 (no replication), I am guaranteed to see the change from another
 session.

AFAIK every snapshot taken after a transaction has acknowledged its
commit is guaranteed to see changes from that transaction. Isn't that a
pretty frequent and obvious user expectation?

 Synchronous replication only means that I can be assured that
 my change has been saved permanently by the time my commit completes. It
 doesn't mean anybody else can see my change or is guaranteed to see my
 change if the query from another session.

So you wouldn't be surprised if a transaction from two hours ago isn't
visible on another node, just because that node happens to be rather
busy with lots of other readers and maintenance tasks?

 If my application assumes that it can commit to one server, and then
 read back the commit from another server, and my application breaks as a
 result, it's because I didn't understand the problem.

Well, yeah, depends on user expectations. I'm surprised to hear that you
have that understanding of synchronous replication.

 Even if PostgreSQL
 didn't use the word synchronous replication, I could still be
 confused. I need to understand the problem no matter what words are used.

As said, it depends on what the common understanding of synchronous
replication is. I've so far been under the impression, that these
potential lags are unexpected and confusing. Several people pointed me
at that problem and I've thus relabeled Postgres-R as not being
synchronous. I'm at least surprised to suddenly get pushed into the
other direction. :-)

However, I absolutely agree that it's not that important how we name it.
What is important, is that users and developers understand the difference.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Mark Mielke

Markus Wanner wrote:

I don't think synchronous replication guarantees that it will be
immediately visible. Even if it did push the change to the other
machine, and the other machine had committed it, that doesn't guarantee
that any reader sees it any more than if I commit to the same machine
(no replication), I am guaranteed to see the change from another
session.



AFAIK every snapshot taken after a transaction has acknowledged its
commit is guaranteed to see changes from that transaction. Isn't that a
pretty frequent and obvious user expectation?
  


Yes - but that's only really true while the session continues. From 
another session? I've never assumed that I could reconnect and be 
guaranteed to get the latest snapshot that includes absolutely 
everything that has been committed.


Any system that guaranteed this even when involving multiple machines 
would be guaranteed to be inefficient and difficult to scale in my 
opinion. How could any system promise to have reasonable commit times 
while also guaranteeing that once a commit completes, any session to any 
other server will be able to see the commit? I think this forces some 
sort of serialization between multiple machines and defeats the purpose 
of having multiple machines. Where before it was indeterminate to know 
when the commit would take effect at each replica, it's not 
indeterminate when my commit will succeed. That is, my commit cannot 
succeed until every single server acknowledge that it is has fully 
received and committed my transaction. What happens if there are network 
problems, or what happens if I am replicating over a slower link? What 
if I am committing to 100 servers? Is it reasonable to expect 100 server 
negotiations to complete in full before my own commit will return?



Synchronous replication only means that I can be assured that
my change has been saved permanently by the time my commit completes. It
doesn't mean anybody else can see my change or is guaranteed to see my
change if the query from another session.


So you wouldn't be surprised if a transaction from two hours ago isn't
visible on another node, just because that node happens to be rather
busy with lots of other readers and maintenance tasks?
  


Any system that is two hours behind should fall out of the pool used to 
satisfy reads from. So, if there was a surprise, it would be this. I 
don't believe ACID requires that a commit on one server is immediately 
visible on another server. Any work I do on the behind server would 
still be safe from a transaction and referential integrity perspective. 
However, if I executed 'commit' on this behind server, I would expect 
the commit to wait until it catches up, or in the case of a 2 hour 
behind, I would expect the commit to fail. Look at the alternative - all 
commits to any server in the pool would be locked up waiting for this 
one machine to catch up on 2 hours of transaction. This emphasizes that 
the problem is that a server two hours of date is still in the pool, 
rather than the problem being keeping things up-to-date.




If my application assumes that it can commit to one server, and then
read back the commit from another server, and my application breaks as a
result, it's because I didn't understand the problem.


Well, yeah, depends on user expectations. I'm surprised to hear that you
have that understanding of synchronous replication.
  


I've seen people face it in the past. Most recently we had a 
presentation from the developer of digg.com, and he described how he had 
this problem with MySQL and that he had to work around it.


On a smaller scale and slightly unrelated, I had this problem frequently 
between memcache and PostgreSQL. That is, memcache would always be 
latest, but PostgreSQL might not be latest, because the commit had not 
occurred.


It seems like a standard enough problem to me. I don't expect Postgres-R 
to do the impossible. As with my previous paragraph, I don't expect 
Postgres-R to wait 2-hours to commit just because one server is falling 
behind.



Even if PostgreSQL
didn't use the word synchronous replication, I could still be
confused. I need to understand the problem no matter what words are used.



As said, it depends on what the common understanding of synchronous
replication is. I've so far been under the impression, that these
potential lags are unexpected and confusing. Several people pointed me
at that problem and I've thus relabeled Postgres-R as not being
synchronous. I'm at least surprised to suddenly get pushed into the
other direction. :-)

However, I absolutely agree that it's not that important how we name it.
What is important, is that users and developers understand the difference


I agree they are unexpected and confusing. I don't agree that they are 
unexpected or confusing to those knowledgeable in the domain. So, the 
question becomes - whose expectation is wrong? Should the user learn 
more? Or should we push for a change in 

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas
On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I think we need to reserve the term synchronous replication for a
 system where transactions that begin at the same time on the primary
 and standby see the same tuples.  Clearly that is more synchronous

 We won't call it anything, because we never will or can implement that.
 See the theory of relativity: the notion of exactly simultaneous events

OK, fine.  I'll be more precise.  I think we need to reserve the term
synchronous replication for a system where transactions that begin
on the standby after the transactions has committed on the master see
the effects of the committed transaction.

 at distinct locations isn't even well-defined, because observers at yet
 other locations will disagree about what is simultaneous.  And I'm
 not just making a joke here --- speed-of-light delays in a WAN are
 meaningful compared to current computer speeds.  In practice, the
 slave and the master will never commit at exactly the same time.

 I agree with the point made upthread that we should use the term
 synchronous replication the way it's commonly used in the industry.
 Inventing our own terminology might be fun but it's not really going
 to result in less confusion.

I just googled synchronous replication and read through the first
page of hits.  Most of them do not address the question of whether
synchronous replication can be said to have be completed when WAL has
been received by the standby not but yet applied.  One of the ones
that does is:

http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

...which refers to what we're proposing to call Synchronous
Replication as Semi-Synchronous Replication (or 2-safe replication)
specifically to distinguish it.  The other is:

http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf

...which doesn't specifically examine the issue but seems to take the
opposite position, namely that the server on which the transaction is
executed needs to wait only for one server to apply the changes to the
database (the others need only to know that they need to commit it;
they don't actually need to have done it).  However, that same paper
refers to two-phase commit as a synchronous replication algorithm, and
Wikipedia's discussion of two-phase commit:

http://en.wikipedia.org/wiki/Two-phase_commit_protocol

...clearly implies that the transaction must be applied everywhere
before it can be said to have committed.

The second page of Google results is mostly a further discussion of
the MySQL solution, which is mostly described as semi-synchronous
replication.

Simon Riggs said upthread that Oracle called this synchronous redo
transport.  That is obviously much closer to what we are doing than
synchronous replication.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Jeff Davis
On Sat, 2008-12-13 at 21:35 -0500, Robert Haas wrote:
 On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  I think we need to reserve the term synchronous replication for a
  system where transactions that begin at the same time on the primary
  and standby see the same tuples.  Clearly that is more synchronous
 
  We won't call it anything, because we never will or can implement that.
  See the theory of relativity: the notion of exactly simultaneous events
 
 OK, fine.  I'll be more precise.  I think we need to reserve the term
 synchronous replication for a system where transactions that begin
 on the standby after the transactions has committed on the master see
 the effects of the committed transaction.
 

If it's guaranteed to be visible on the standby after it's committed on
the master, and you don't have any way to make it actually simultaneous,
then that implies that it's visible on the slave for some brief period
of time before it's committed on the master.

That situation is still asymmetric, so why is that a better use of the
term synchronous?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas
 If it's guaranteed to be visible on the standby after it's committed on
 the master, and you don't have any way to make it actually simultaneous,
 then that implies that it's visible on the slave for some brief period
 of time before it's committed on the master.

 That situation is still asymmetric, so why is that a better use of the
 term synchronous?

Because that happens anyway.  If I request a commit on a single,
unreplicated server, the server makes the commit visible to new
transactions and then sends me a message informing me that the commit
has completed.  Since the message takes some finite time to reach me,
there is a window of time after the commit has completed and before I
know that the commit has been completed.

Suppose for the sake of argument that the single, unreplicated server
did these two tasks in the opposite order - namely, first, it sent a
message to the process requesting the commit stating that the commit
had completed, and only then made the transaction visible.  This would
create a race condition: the process requesting the commit might
receive the commit and begin a new transaction before the previous
transaction had been made visible, and would therefore not be able to
see the results of its own previous actions.  I think it's fair to say
that this behavior would be judged totally intolerable.

Therefore, there can't possibly be any applications out there which
are depending on the fact that commits don't become visible until they
are acknowledged, but there very well could be some applications which
depend on the fact that one commits are acknowledged, they are
visible.  If replication is synchronous in this sense, then I can open
a connection to the master, write some data, close the connection,
open a new connection to the master or the slave (not caring which),
and read back the data that I just wrote (assuming no one else has
modified it in the mean time).  If it isn't, then I can't.  Some
people will not care about this, but some will.

The point here is that synchronous replication, at least to some
people, is going to imply that the user-visible states of the two
copies are consistent.  To other people, it is going to imply that
committed transactions will never be lost even in the event of a
catastropic loss of the primary 1 picosecond after the commit is
acknowledged.  We need to choose some word that implies that we are
guaranteeing the latter of these two things but not the former.
Otherwise, we will have confused users, and terminological confusion
when and if we ever implement the former as well.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas
 Might it not be true that anybody unfamiliar would be confused and that this
 is a bit of a straw man?
[...]
 If my application assumes that it can commit to one server, and then read
 back the commit from another server, and my application breaks as a result,
 it's because I didn't understand the problem. Even if PostgreSQL didn't use
 the word synchronous replication, I could still be confused. I need to
 understand the problem no matter what words are used.

That is certainly true.  But there is value in choosing words which
elucidate the situation as much as possible.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Jeff Davis
On Sat, 2008-12-13 at 22:23 -0500, Robert Haas wrote:
  If it's guaranteed to be visible on the standby after it's committed on
  the master, and you don't have any way to make it actually simultaneous,
  then that implies that it's visible on the slave for some brief period
  of time before it's committed on the master.
 
  That situation is still asymmetric, so why is that a better use of the
  term synchronous?
 
 Because that happens anyway.  If I request a commit on a single,
 unreplicated server, the server makes the commit visible to new
 transactions and then sends me a message informing me that the commit
 has completed.  Since the message takes some finite time to reach me,
 there is a window of time after the commit has completed and before I
 know that the commit has been completed.
 

Oh, I see the distinction now.

Thanks for the detailed reply.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Tatsuo Ishii
 The point here is that synchronous replication, at least to some
 people, is going to imply that the user-visible states of the two
 copies are consistent.  To other people, it is going to imply that
 committed transactions will never be lost even in the event of a
 catastropic loss of the primary 1 picosecond after the commit is
 acknowledged.  We need to choose some word that implies that we are
 guaranteeing the latter of these two things but not the former.
 Otherwise, we will have confused users, and terminological confusion
 when and if we ever implement the former as well.

Right. Before watching this thread, I had thought that the log
shipping sync replication behaves former (and I had told so to people
in Japan who are interested in 8.4 development. Of course this is my
fault, though).

Now I understand the log shipping sync replication does not behave
same as other sync replications such as pgpool and PGCluster (there
maybe more, but I don't know)
--
Tatsuo Ishii
SRA OSS, Inc. Japan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas
 The point here is that synchronous replication, at least to some
 people, is going to imply that the user-visible states of the two
 copies are consistent.  To other people, it is going to imply that
 committed transactions will never be lost even in the event of a
 catastropic loss of the primary 1 picosecond after the commit is
 acknowledged.  We need to choose some word that implies that we are
 guaranteeing the latter of these two things but not the former.
 Otherwise, we will have confused users, and terminological confusion
 when and if we ever implement the former as well.

With apologies for replying to my own post:

It's also important to understand that these two invariants are
completely separate and it is possible to guarantee either without the
other.  If you want (1), the standby needs to apply the WAL before
sending an acknowledgment to the primary but does not necessarily need
to write it to disk (of course, it will have to be written to disk
before the modified buffers are written to disk, but that's a separate
issue).  If you want (2), the standby needs to write the WAL to disk
before sending the acknowledgment but does not necessarily need to
apply it.  If you want both, then, you need to wait for both (and it's
worth noting that your performance will probably be nothing to write
home about).

I also did some research on terminology that has been used in the
literature.  As Jim Gray describes it:

1-safe replication.  Transaction is committed when it has been locally
WAL-logged to durable storage.
Group-safe replication.  Transaction is committed when WAL has been
received by all remote servers, but not necessarily written to durable
storage.
Group-safe  1-safe replication.  Transaction is committed when it has
been locally WAL-logged to durable storage and WAL has been received
by all remote servers.
2-safe replication.  Transaction is committed when it has been written
to durable storage on both local and remote servers.
Very safe replication.  As 2-safe, but fails any read-write
transaction if the secondary is down.

(Actually, it appears that Transaction Processing Jim Gray and
Andreas Reuter, 1993 uses 2-safe to refer to either 2-safe or
group-safe; the distinction between the two is a subsequent
development. See e.g. Advances in Database Technology-EDBT 2004
by Elisa Bertino)

The term of art for making sure that transactions committed on the
primary are visible on the secondary seems to be one-copy
serializability (see, for example, a Google Books search on that
term).

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-12 Thread Fujii Masao
Hi,

On Fri, Dec 12, 2008 at 1:34 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 * Fujii Masao masao.fu...@gmail.com [081211 23:00]:
 Hi,

   Or, should I
 create the feature for the user to confirm whether it's in synch rep via 
 SQL?

 I don't need a way to check via SQL, but I'ld love a postgresql.conf
 option that when set would make sure that all connections pretty much
 just hang until a slave has connected and everything is setup for sync
 rep.  I think I saw that youre using normal connection setup to start
 the wal streaming to the slave, so you have to allow connections, but
 I'ld really not want any of my pg-clients able to do anything if
 sync-rep isn't happenning...

How about stopping the request / connection from a client in front of
postgres (e.g. connection pooling software)? Or, we should develop
the feature like OFFLINE of Oracle apart from Synch Rep at first.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-12 Thread Simon Riggs

On Fri, 2008-12-12 at 12:53 +0900, Fujii Masao wrote:
 
  Quite possibly a terminology problem.. I my case I said sync rep
  meaning the mode such that the transaction doesn't commit successfully
  for my PG client until the xlog record has been streamed to the
  client... and I understand that at his presentation at PGcon, Fujii-san
  there could be possible variants on when the streamed is considered
  done based on network, slave ram, disk, application, etc.
 
 I'd like to define the meaning of synch rep again. synch rep means:
 
 (1) Transaction commit waits for WAL records to be replicated to the standby
   before the command returns a success indication to the client.

 (2) The standby has (can read) all WAL files indispensable for recovery.

I would change can read in (2) to has access to. Can read implies
we have read all files and checked CRCs of individual records.


The crux of this is what we mean by synchronous_replication = on.
There are two possible meanings:

1. Commit will wait only if streaming is available and has waited for
all necessary startup conditions.
This provides Highest Availability

2. Commit will wait *until* full sync rep is available. So we don't
allow it until standby fails and also don't allow it if standby goes
down.
This provides Highest Transaction Durability, though is fairly
fragile. Other systems recommend use of multiple standby nodes if this
option is selected.

Perhaps we should add this as a third option to synchronous_replication,
so we have either off, on, only

So far I realise I've been talking exclusively about (1). In that mode
synchronous_replication = on would wait for streaming to complete even
if last WAL file not fully transferred. 

For (2) we need a full interlock. Given that we don't currently support
multiple streamed standby servers, it seems not much point in
implementing the interlock (2) would require. Should we leave that part
for 8.5, or do it now?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >