subject:"\"Re\\\: \\\[HACKERS\\\] Sync Rep\\\: First Thoughts on Code\""

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao

Hi,

I fixed some bugs.

On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs  wrote:
>
> Can we change to IMMEDIATE when it we need the checkpoint?

Perhaps yes, though current patch doesn't care about it.
I'm not sure if we really need the feature. Yes, as you say,
I'd like to also listen to everybody else.

>
> What is bkpCount for?

So far, name of a backup history file consists of only
checkpoint redo location. But, in this patch, since some
backups use the same checkpoint, a backup history file
could be overwritten unfortunately. So, I introduced
bkpCount as ID of backups which use the same checkpoint.

> I think we should discuss whatever that is for
> separately. It isn't used in any if test, AFAICS.

Yes, this patch is testbed. We need to discuss more.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
? GNUmakefile
? config.log
? config.status
? contrib/make.log
? contrib/pgbench/pgbench
? src/Makefile.global
? src/backend/postgres
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/snowball/snowball_create.sql
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.2
? src/port/pg_config_paths.h
? src/test/regress/log
? src/test/regress/pg_regress
? src/test/regress/results
? src/test/regress/testtablespace
? src/test/regress/tmp_check
? src/test/regress/expected/constraints.out
? src/test/regress/expected/copy.out
? src/test/regress/expected/create_function_1.out
? src/test/regress/expected/create_function_2.out
? src/test/regress/expected/largeobject.out
? src/test/regress/expected/largeobject_1.out
? src/test/regress/expected/misc.out
? src/test/regress/expected/tablespace.out
? src/test/regress/sql/constraints.sql
? src/test/regress/sql/copy.sql
? src/test/regress/sql/create_function_1.sql
? src/test/regress/sql/create_function_2.sql
? src/test/regress/sql/largeobject.sql
? src/test/regress/sql/misc.sql
? src/test/regress/sql/tablespace.sql
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.324
diff -c -r1.324 xlog.c
*** src/backend/access/transam/xlog.c	17 Dec 2008 01:39:03 -	1.324
--- src/backend/access/transam/xlog.c	24 Dec 2008 18:13:45 -
***
*** 295,300 
--- 295,302 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
+ 	uint32		bkpCount;		/* ID of bkp using the same ckpt */
+ 	bool		bkpForceCkpt;	/* reset full_page_writes since last ckpt? */
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
***
*** 318,323 
--- 320,332 
  static XLogCtlData *XLogCtl = NULL;
  
  /*
+  * We don't allow more than MAX_BKP_COUNT backups to use the same checkpoint.
+  * If XLogCtl->bkpCount > MAX_BKP_COUNT, we force new checkpoint at pg_standby
+  * even if there are all indispensable full pages since last checkpoint.
+  */
+ #define MAX_BKP_COUNT 256
+ 
+ /*
   * We maintain an image of pg_control in shared memory.
   */
  static ControlFileData *ControlFile = NULL;
***
*** 6025,6036 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
  		xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl->ckptXid = checkPoint.nextXid;
  		SpinLockRelease(&xlogctl->info_lck);
--- 6034,6050 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* 
! 	 * Update shared-memory copy of checkpoint XID/epoch and reset the
! 	 * variables of backup ID/flag.
! 	 */
  	{
  		/* use volati

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Simon Riggs


On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote:
> Hi,
> 
> On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao  wrote:
> > Hi,
> >
> > On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs  wrote:
> >> Yes, OK. So I think it would only work when full_page_writes = on, and
> >> has been on since last checkpoint. So two changes:
> >>
> >> * We just need a boolean that starts at true every checkpoint and gets
> >> set to false anytime someone resets full_page_writes or archive_command.
> >> If the flag is set && full_page_writes = on then we skip the checkpoint
> >> entirely and use the value from the last checkpoint.
> >
> > Sounds good.
> 
> I attached the self-contained patch to skip checkpoint at pg_start_backup.

Good.

Can we change to IMMEDIATE when it we need the checkpoint?

What is bkpCount for? I think we should discuss whatever that is for
separately. It isn't used in any if test, AFAICS.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao

Hi,

On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao  wrote:
> Hi,
>
> On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs  wrote:
>> Yes, OK. So I think it would only work when full_page_writes = on, and
>> has been on since last checkpoint. So two changes:
>>
>> * We just need a boolean that starts at true every checkpoint and gets
>> set to false anytime someone resets full_page_writes or archive_command.
>> If the flag is set && full_page_writes = on then we skip the checkpoint
>> entirely and use the value from the last checkpoint.
>
> Sounds good.

I attached the self-contained patch to skip checkpoint at pg_start_backup.

>
> pg_start_backup on the standby (probably you are planning?) also needs
> this logic? If so, resetting full_page_writes or archive_command should
> generate its xlog.

Now, the patch doesn't care about this.

>
> I have another thought: should we forbid the reset of archive_command
> during online backup? Currently we can do. If we don't need to do so,
> we also don't need to track the reset of archiving for fast pg_start_backup.

Now, doesn't care too.

Happy Holidays!

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
? GNUmakefile
? config.log
? config.status
? contrib/pgbench/pgbench
? src/Makefile.global
? src/backend/postgres
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/postmaster/walreceiver.c
? src/backend/postmaster/walsender.c
? src/backend/snowball/snowball_create.sql
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/include/postmaster/walreceiver.h
? src/include/postmaster/walsender.h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.2
? src/port/pg_config_paths.h
? src/test/regress/pg_regress
? src/test/regress/testtablespace
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.324
diff -c -r1.324 xlog.c
*** src/backend/access/transam/xlog.c	17 Dec 2008 01:39:03 -	1.324
--- src/backend/access/transam/xlog.c	24 Dec 2008 14:57:27 -
***
*** 295,300 
--- 295,302 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
+ 	uint32		bkpCount;		/* ID of bkp using the same ckpt */
+ 	bool		bkpForceCkpt;	/* reset full_page_writes since last ckpt? */
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
***
*** 6025,6036 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
  		xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl->ckptXid = checkPoint.nextXid;
  		SpinLockRelease(&xlogctl->info_lck);
--- 6027,6043 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* 
! 	 * Update shared-memory copy of checkpoint XID/epoch and reset the
! 	 * variables of backup ID/flag.
! 	 */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(&xlogctl->info_lck);
+ 		xlogctl->bkpCount = 0;
+ 		xlogctl->bkpForceCkpt = true;
  		xlogctl->ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl->ckptXid = checkPoint.nextXid;
  		SpinLockRelease(&xlogctl->info_lck);
***
*** 6502,6507 
--- 6509,6535 
  	}
  }
  
+ bool
+ assign_full_page_writes(bool newval, bool doit, GucSource source)
+ {
+ 	/*
+ 	 * If full_page_writes is reset, since all indispensable full pages
+ 	 * might not be written since last checkpoint, we force a checkpoint
+ 	 * at pg_start_backup.
+ 	 */
+ 	if (doit && fullPageWrites != newval)
+ 	{
+ 		/* use volati

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao

Hi,

On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs  wrote:
> Yes, OK. So I think it would only work when full_page_writes = on, and
> has been on since last checkpoint. So two changes:
>
> * We just need a boolean that starts at true every checkpoint and gets
> set to false anytime someone resets full_page_writes or archive_command.
> If the flag is set && full_page_writes = on then we skip the checkpoint
> entirely and use the value from the last checkpoint.

Sounds good.

pg_start_backup on the standby (probably you are planning?) also needs
this logic? If so, resetting full_page_writes or archive_command should
generate its xlog.

I have another thought: should we forbid the reset of archive_command
during online backup? Currently we can do. If we don't need to do so,
we also don't need to track the reset of archiving for fast pg_start_backup.

>
> * My "infra" patch also had a modified version of pg_start_backup() that
> allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
> a waste of time, and I want to listen to everybody else now and change
> pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
> there.
>
> Can you work on those also?

Umm.. I'm busy. Of course, I will try it if no one raises his or her hand.
But, I'd like to put coding the core of synch rep ahead of this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote:

> > We might ask why pg_start_backup() needs to perform checkpoint though,
> > since you have remarked that is a problem also.
> >
> > The answer is that it doesn't really need to, we just need to be certain
> > that archiving has been running since whenever we choose as the start
> > time. So we could easily just use the last normal checkpoint time, as
> > long as we had some way of tracking the archiving.
> >
> > ISTM we can solve the checkpoint problem more easily and it would
> > potentially save much more time than "tuning rsync for Postgres", which
> > is what the other idea amounted to. So I do see a solution that is both
> > better and more quickly achievable for 8.4.
> 
> Sounds good. I agree that pg_start_backup basically doesn't need
> checkpoint. But, for full_page_write == off, we probably cannot get
> rid of it. Even if full_page_write == on, since we cannot make out
> whether all indispensable full pages were written after last checkpoint,
> pg_start_backup must do checkpoint with "forcePageWrite = on".

Yes, OK. So I think it would only work when full_page_writes = on, and
has been on since last checkpoint. So two changes:

* We just need a boolean that starts at true every checkpoint and gets
set to false anytime someone resets full_page_writes or archive_command.
If the flag is set && full_page_writes = on then we skip the checkpoint
entirely and use the value from the last checkpoint.

* My "infra" patch also had a modified version of pg_start_backup() that
allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
a waste of time, and I want to listen to everybody else now and change
pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
there.

Can you work on those also?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Mon, Dec 22, 2008 at 1:29 PM, Fujii Masao  wrote:
> Not so simple.
>
> At least the primary has to additionally maintain the byte position the 
> standby
> has already fsynced. The main difference from the current patch is whether
> the standby fsyncs the logfile when it fills even if you don't choose 
> #4(fsync).
> In order to prevent from having to go back and re-open prior logfiles when an
> fsync request comes along later, we would need to ignore the sync mode and
> make the standby fsync the logfile when it fills. This would degrade the
> performance periodically. Is this acceptable?
>
> I think there are four choices. Which do you prefer?
>
> 1) Accept the above change.
> 2) Go back and re-open prior logfiles when a fsync request comes along.
> 3) Stop the sync control by the primary and leave it to the standby.
> 4) Add new option to specify whether to permit optimistic fsync, this option
>makes the standby fsync only the current logfile when a fsync request
>comes along (don't go back and re-open prior logfiles).
>
> 2) would cause another performance degradation. 4) would furthermore
> confuse users about setting a sync mode. So, I prefer 3) though I'm sorry
> for digging up the discussion about transaction control. Please feel free
> to comment!

5) Only allow optimistic fsync

I'm going to adopt 5) for next patch at least for a while.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Wed, Dec 24, 2008 at 2:37 AM, Simon Riggs  wrote:
>
> On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote:
>
>> Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
>> rethink the question? "Why does the failed server always need a fresh
>> backup?" Though we discussed it previously and concluded that it should
>> be done next time.
>> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php
>
> We might ask why pg_start_backup() needs to perform checkpoint though,
> since you have remarked that is a problem also.
>
> The answer is that it doesn't really need to, we just need to be certain
> that archiving has been running since whenever we choose as the start
> time. So we could easily just use the last normal checkpoint time, as
> long as we had some way of tracking the archiving.
>
> ISTM we can solve the checkpoint problem more easily and it would
> potentially save much more time than "tuning rsync for Postgres", which
> is what the other idea amounted to. So I do see a solution that is both
> better and more quickly achievable for 8.4.

Sounds good. I agree that pg_start_backup basically doesn't need
checkpoint. But, for full_page_write == off, we probably cannot get
rid of it. Even if full_page_write == on, since we cannot make out
whether all indispensable full pages were written after last checkpoint,
pg_start_backup must do checkpoint with "forcePageWrite = on".

Problem is that online backup itself is unsafe. Even if there is no
disk failure (i.e. normal case), we can easily produce a partial write
in online backup. So, we always need full pages when recovering
online backup, then pg_start_backup always needs checkpoint
with forcePageWrite = on.

I think that we probably have to track the history of full_page_write,
in order to get rid of checkpoint from pg_start_backup.

On the other hand, the data after crash other than media crash
is "safe". Currently, we can recover it without full page write
as simple crash recovery case. I think that we can use it also for
archive recovery, because there isn't really any distinction between
both. I've not found the corner case yet. Do you have?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Wed, 2008-12-24 at 02:23 +0900, Fujii Masao wrote:

> Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
> rethink the question? "Why does the failed server always need a fresh
> backup?" Though we discussed it previously and concluded that it should
> be done next time.
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

We might ask why pg_start_backup() needs to perform checkpoint though,
since you have remarked that is a problem also.

The answer is that it doesn't really need to, we just need to be certain
that archiving has been running since whenever we choose as the start
time. So we could easily just use the last normal checkpoint time, as
long as we had some way of tracking the archiving.

ISTM we can solve the checkpoint problem more easily and it would
potentially save much more time than "tuning rsync for Postgres", which
is what the other idea amounted to. So I do see a solution that is both
better and more quickly achievable for 8.4.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Wed, Dec 24, 2008 at 12:38 AM, Simon Riggs  wrote:
> Perhaps, but why do you say that?

Since you often pointed out that getting backup is not problem because
of incremental backup (e.g. rsync), I just thought so.

> I've not blocked you from adding
> anything useful to Postgres.

Yes, I see.

> You scare me that you see failover as sufficiently frequent that you are
> worried that being without one of the servers for an extra 60 seconds
> during a failover is a problem. And then say you're not going to add the
> feature after all. I really don't understand. If its important, add the
> feature, the whole feature that is. If not, don't.

Oh, sorry. I don't want to scare you ;) But, yes, it's important. We should
rethink the question? "Why does the failed server always need a fresh
backup?" Though we discussed it previously and concluded that it should
be done next time.
http://archives.postgresql.org/pgsql-hackers/2008-11/msg01612.php

> My expectation is that most failovers are serious ones, that the primary
> system is down and not coming back very fast. Your worries seem to come
> from a scenario where the primary system is still up but Postgres
> bounces/crashes, we can diagnose the cause of the crash, decide the
> crashed server is safe and then wish to recommence operations on it
> again as quickly as possible, where seconds count it doing so.
>
> Are failovers going to be common? Why?

As you say, *all* failovers are not serious ones. I think that a user
would choose most convenient restarting method according to his
or her situation (come back immediately? need careful diagnosis?).

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Mark Mielke


Simon Riggs wrote:

You scare me that you see failover as sufficiently frequent that you are
worried that being without one of the servers for an extra 60 seconds
during a failover is a problem. And then say you're not going to add the
feature after all. I really don't understand. If its important, add the
feature, the whole feature that is. If not, don't.

My expectation is that most failovers are serious ones, that the primary
system is down and not coming back very fast. Your worries seem to come
from a scenario where the primary system is still up but Postgres
bounces/crashes, we can diagnose the cause of the crash, decide the
crashed server is safe and then wish to recommence operations on it
again as quickly as possible, where seconds count it doing so.

Are failovers going to be common? Why?
  


Hi Simon:

I agree with most of your criticism to the "fail over only approach" - 
but don't agree that fail over frequency should really impact 
expectations for the failed system to return to service. I see "soft" 
fails (*not* serious) to potentially be common - somewhere on the 
network, something went down or some packet was lost, and the system 
took a few too many seconds to respond. My expectation is that the 
system can quickly  detect that the node is out of service, be removed 
from the pool, when the situation is resolved (often automatically 
outside of my control) automatically "catch up" and be put back into the 
pool. Having to run some other process such as rsync seems unreliable as 
we already have a mechanism for streaming the data. All that is missing 
is streaming from an earlier point in time to catch up efficiently and 
reliably.


I think I'm talking more about the complete solution though which is in 
line with what you are saying? :-)


Cheers,
mark

--
Mark Mielke 


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 23:31 +0900, Fujii Masao wrote:
> Hi,
> 
> On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs  wrote:
> > I'm happy if that whole feature is added. If we do add it, it will be a
> > utility like "pg_resync". So in admin terms it will be almost identical
> > to using rsync, just a specific version that minimizes effort even more
> > than rsync does currently. The only difference as I see it would be some
> > gain in performance, but we don't need to send the whole database down
> > the wire again in either case.
> 
> I think that the type of your user is different from mine. 

Perhaps, but why do you say that? I've not blocked you from adding
anything useful to Postgres.

> If server fails
> by simple termination of process, I don't want to spend 1min for
> restarting other than catching up itself. For me, getting a fresh backup
> (not only copying backup data but also checkpoint by pg_start_backup)
> is expensive operation.

As I said: "I'm happy if that whole feature is added."

You scare me that you see failover as sufficiently frequent that you are
worried that being without one of the servers for an extra 60 seconds
during a failover is a problem. And then say you're not going to add the
feature after all. I really don't understand. If its important, add the
feature, the whole feature that is. If not, don't.

My expectation is that most failovers are serious ones, that the primary
system is down and not coming back very fast. Your worries seem to come
from a scenario where the primary system is still up but Postgres
bounces/crashes, we can diagnose the cause of the crash, decide the
crashed server is safe and then wish to recommence operations on it
again as quickly as possible, where seconds count it doing so.

Are failovers going to be common? Why?

> Of course, since I'm not planning to tackle that problem in 8.4,

If you change your mind, having it in 8.4 would be good. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Tue, Dec 23, 2008 at 11:31 PM, Fujii Masao  wrote:
> Of course, since I'm not planning to tackle that problem in 8.4,
> I would not add "additional" synchronization point.

Second thought:
For normal shutdown case, we probably should force synchronous
replication in CreateCheckPoint at least.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Tue, Dec 23, 2008 at 10:41 PM, Simon Riggs  wrote:
> I'm happy if that whole feature is added. If we do add it, it will be a
> utility like "pg_resync". So in admin terms it will be almost identical
> to using rsync, just a specific version that minimizes effort even more
> than rsync does currently. The only difference as I see it would be some
> gain in performance, but we don't need to send the whole database down
> the wire again in either case.

I think that the type of your user is different from mine. If server fails
by simple termination of process, I don't want to spend 1min for
restarting other than catching up itself. For me, getting a fresh backup
(not only copying backup data but also checkpoint by pg_start_backup)
is expensive operation.

Of course, since I'm not planning to tackle that problem in 8.4,
I would not add "additional" synchronization point.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 18:36 +0530, Pavan Deolasee wrote:

> Personally, I would like to have a
> simple setup where I can initially setup primary and standby and they
> continue to work in a single-failure mode without any additional
> administrative overhead (such as rsync). But that's just me and I
> don't know what the preferred option in the field.

If you want a tripod, you need to turn up with all 3 legs. :-) 

PostgreSQL is a working product, not a framework or a function library.
We're not going to add code that has no function at all other than as
part of a larger feature, unless we add the whole feature.

I'm happy if that whole feature is added. If we do add it, it will be a
utility like "pg_resync". So in admin terms it will be almost identical
to using rsync, just a specific version that minimizes effort even more
than rsync does currently. The only difference as I see it would be some
gain in performance, but we don't need to send the whole database down
the wire again in either case.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Pavan Deolasee

On Tue, Dec 23, 2008 at 5:55 PM, Simon Riggs  wrote:
>
>
> We stream constantly from primary to standby. That point is not being
> debated. The issue is whether we should add additional synchronisation
> points (i.e. additional times we need to wait) into the WAL stream.
> Currently, I have said no because this has no purpose in the current
> design: definitely not performance, not robustness, not code clarity.
>
> Specifically, we're talking about slowing down WAL flushes required
> because of dirty page replacement, amongst others. That's not something
> I want to see slowed down on a server that has specifically opted for
> asynchronous replication, presumably because of a slow link. The other
> call points are also potential contention points.

So we would still be sending WAL to standby at XLogWrite time (and I
think that's necessary). The question is whether we should wait for
standby ack at XLogFlush time, right ? Hmm. I think the argument for
that would be what Fujii-san described for maintaining consistency
between data and WAL. I agree with you that we should add additional
synchronization points only if they give us any real value in
administrating replication setup. Personally, I would like to have a
simple setup where I can initially setup primary and standby and they
continue to work in a single-failure mode without any additional
administrative overhead (such as rsync). But that's just me and I
don't know what the preferred option in the field.

BTW, I won't be too much worried about dirty buffer case because the
WAL synchronization at that point usually occurs much later than the
WAL is actually sent to the standby. I would imagine that most of the
time WAL would have made to standby by that time.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 16:54 +0530, Pavan Deolasee wrote:
> On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao  wrote:
> >
> > But, since I cannot obtain consensus from hackers including you,
> > I would change my course, and forbid XLogFlush (called from other
> > than RecordTransactionCommit) to replicate xlog synchronously
> > if asynchronous replication case.
> 
> Since synchronous/asynchronous behavior of replication is tied to a
> transaction (even if there is global default) , I don't understand why
> we should not ship the xlogs to the standby when xlogs are written on
> primary outside of a transaction context.  This is quite same as we do
> with asynchronous_commit where we flush the xlog to disk at certain
> points irrespective of the synchronization set.

We stream constantly from primary to standby. That point is not being
debated. The issue is whether we should add additional synchronisation
points (i.e. additional times we need to wait) into the WAL stream.
Currently, I have said no because this has no purpose in the current
design: definitely not performance, not robustness, not code clarity.

Specifically, we're talking about slowing down WAL flushes required
because of dirty page replacement, amongst others. That's not something
I want to see slowed down on a server that has specifically opted for
asynchronous replication, presumably because of a slow link. The other
call points are also potential contention points.

> > Yes, switchover is one of case example I care. Typically, I care
> > about restarting the failed server (original primary) after failover:
> >
> 
> I think this is a very important requirement because it's quite
> unrealistic to expect that every time there is a failover, fresh
> backup is required for the old primary to join back the replication.

I personally don't expect that, because we have rsync.

If that is a very important requirement then the current software needs
to include all the aspects of a feature, not just some of them. Either
we include a whole feature or we leave it out. A release will need to
stand for 5+ years, so supporting extraneous features is troublesome and
wasteful.

Currently, Fujii-san has stated he is not planning to allow fast
resynchronization in 8.4, so why would we need this?

If we were to add fast resynchronisation as a feature in 8.4, then I
will be happy to have *all* required changes included. People mention it
enough that I would be happy to see the whole feature added in this
release

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Pavan Deolasee

On Tue, Dec 23, 2008 at 4:23 PM, Fujii Masao  wrote:
>
>
> But, since I cannot obtain consensus from hackers including you,
> I would change my course, and forbid XLogFlush (called from other
> than RecordTransactionCommit) to replicate xlog synchronously
> if asynchronous replication case.
>

Since synchronous/asynchronous behavior of replication is tied to a
transaction (even if there is global default) , I don't understand why
we should not ship the xlogs to the standby when xlogs are written on
primary outside of a transaction context.  This is quite same as we do
with asynchronous_commit where we flush the xlog to disk at certain
points irrespective of the synchronization set.

> Yes, switchover is one of case example I care. Typically, I care
> about restarting the failed server (original primary) after failover:
>

I think this is a very important requirement because it's quite
unrealistic to expect that every time there is a failover, fresh
backup is required for the old primary to join back the replication.

> -
> 1. a dirty buffer page is chosen as victim of buffer replacement
> 2. flush xlog up to the buffer's LSN on only primary
> 3. write out the dirty buffer page
> 4. primary fails
>(replication up to buffer's LSN is not performed)
>
> The above case produces inconsistency between data on the
> original primary (failed server) and xlogs on the original standby
> (new primary after failover). Isn't this right?
>

Yes, it would create inconsistency which I don't think can be
corrected without a fresh backup.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Tue, Dec 23, 2008 at 6:28 PM, Simon Riggs  wrote:
>
> On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote:
>> > I don't get this argument. Why would we care what happens on the
>> failed server?
>>
>> It's because, in the future, I'd like to use the data on the failed
>> server when making it catch up with new primary. This desire might be
>> violated by the inconsistency which I described.
>
> I don't really understand why you would put something in there that has
> no use at all. Why make every server in the world do extra
> synchronisation?
>
> Whatever you build in the future can include this, if that is still a
> required point at the time you add the new feature.

Right. But since it's difficult to change the once fixed specification,
I ruminate about it from now for future.

But, since I cannot obtain consensus from hackers including you,
I would change my course, and forbid XLogFlush (called from other
than RecordTransactionCommit) to replicate xlog synchronously
if asynchronous replication case.

BTW, here is the callers other than RecordTransactionCommit.
- CreateCheckPoint()
- EndPrepare()
- FlushBuffer()
- RecordTransactionAbortPrepared()
- RecordTransactionCommitPrepared()
- RelationTruncate()
- SlruPhysicalWritePage()
- WriteTruncateXlogRec()
- XLogAsyncCommitFlush()

>
> Are you thinking about switchover rather than failover? I'm sure a
> graceful switchover doesn't need this.

Yes, switchover is one of case example I care. Typically, I care
about restarting the failed server (original primary) after failover:

-
1. a dirty buffer page is chosen as victim of buffer replacement
2. flush xlog up to the buffer's LSN on only primary
3. write out the dirty buffer page
4. primary fails
(replication up to buffer's LSN is not performed)

The above case produces inconsistency between data on the
original primary (failed server) and xlogs on the original standby
(new primary after failover). Isn't this right?

5. restart the failed server and make it catch up with new primary

We cannot recycle the existing data on the failed server because
of that inconsistency. I think this restriction should be removed.
-

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Tue, 2008-12-23 at 18:00 +0900, Fujii Masao wrote:
> > I don't get this argument. Why would we care what happens on the
> failed server?
> 
> It's because, in the future, I'd like to use the data on the failed
> server when making it catch up with new primary. This desire might be
> violated by the inconsistency which I described.

I don't really understand why you would put something in there that has
no use at all. Why make every server in the world do extra
synchronisation? 

Whatever you build in the future can include this, if that is still a
required point at the time you add the new feature.

Are you thinking about switchover rather than failover? I'm sure a
graceful switchover doesn't need this.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Fujii Masao

Hi,

On Tue, Dec 23, 2008 at 5:22 PM, Simon Riggs  wrote:
> On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote:
>
>> > XLogFlush() flushes because of an interlock between a dirty buffer write
>> > and an outstanding WAL write. Dirty buffer writes are not replicated, so
>> > there is no need to have a similar interlock on WAL streaming.
>> >
>> > So making those call points synchronous is possible, but neither
>> > necessary or IMHO desirable.
>>
>> Yes in upcoming 8.4, but probably no in the future.
>>
>> What if the primary fails after writing the dirty data buffer before sending
>> the corresponding logs? This would make data on the primary and logs
>> on the standby inconsistent. In 8.4, such inconsistency might not matter
>> because we don't use the data on the failed primary for recovery (when
>> restarting the failed server, we always need a fresh backup). But, since
>> this restriction is not good for some people, in the future, the failed 
>> server
>> should restart without a fresh backup, and the inconsistency would be
>> problem. So, I think that the inconsistency should be removed even if
>> asynchronous replication case, and we should enforce "WAL rule" over
>> some servers.
>
> I don't get this argument. Why would we care what happens on the failed 
> server?

It's because, in the future, I'd like to use the data on the failed server when
making it catch up with new primary. This desire might be violated by the
inconsistency which I described.

>
> The additional synchronizations you suggest are neither necessary, nor
> IMHO desirable.

Not additional. It's quite analogous to synchronous_commit.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-23 Thread Simon Riggs

On Sun, 2008-12-21 at 14:46 +0900, Fujii Masao wrote:

> > XLogFlush() flushes because of an interlock between a dirty buffer write
> > and an outstanding WAL write. Dirty buffer writes are not replicated, so
> > there is no need to have a similar interlock on WAL streaming.
> >
> > So making those call points synchronous is possible, but neither
> > necessary or IMHO desirable.
> 
> Yes in upcoming 8.4, but probably no in the future.
> 
> What if the primary fails after writing the dirty data buffer before sending
> the corresponding logs? This would make data on the primary and logs
> on the standby inconsistent. In 8.4, such inconsistency might not matter
> because we don't use the data on the failed primary for recovery (when
> restarting the failed server, we always need a fresh backup). But, since
> this restriction is not good for some people, in the future, the failed server
> should restart without a fresh backup, and the inconsistency would be
> problem. So, I think that the inconsistency should be removed even if
> asynchronous replication case, and we should enforce "WAL rule" over
> some servers.

I don't get this argument. Why would we care what happens on the failed server?

The additional synchronizations you suggest are neither necessary, nor
IMHO desirable.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-21 Thread Fujii Masao

Hi,

On Wed, Dec 17, 2008 at 12:07 PM, Fujii Masao  wrote:
>> No, we've been through that loop already a few months back:
>> Transaction-controlled robustness.
>>
>> It should be up to the client on the primary to decide how much waiting
>> they would like to perform in order to provide a guarantee. A change of
>> setting on the standby should not be allowed to alter the performance or
>> durability on the primary.
>
> OK. I will extend synchronous_replication, make walsender send XLOG
> with synchronization mode flag and make walreceiver perform according
> to the flag.

Not so simple.

At least the primary has to additionally maintain the byte position the standby
has already fsynced. The main difference from the current patch is whether
the standby fsyncs the logfile when it fills even if you don't choose #4(fsync).
In order to prevent from having to go back and re-open prior logfiles when an
fsync request comes along later, we would need to ignore the sync mode and
make the standby fsync the logfile when it fills. This would degrade the
performance periodically. Is this acceptable?

I think there are four choices. Which do you prefer?

1) Accept the above change.
2) Go back and re-open prior logfiles when a fsync request comes along.
3) Stop the sync control by the primary and leave it to the standby.
4) Add new option to specify whether to permit optimistic fsync, this option
makes the standby fsync only the current logfile when a fsync request
comes along (don't go back and re-open prior logfiles).

2) would cause another performance degradation. 4) would furthermore
confuse users about setting a sync mode. So, I prefer 3) though I'm sorry
for digging up the discussion about transaction control. Please feel free
to comment!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-21 Thread Markus Wanner

Hi,

Simon Riggs wrote:
> The second way can be done by taking a snapshot on the primary, with an
> associated LSN, then using that snapshot on the standby. That is
> somewhat complex, but possible. I see the requirement for getting the
> same answer on multiple nodes as a further extension of "transaction
> isolation mode" and think that not all people will want this, so we
> should allow that as an option.

I've been thinking a bit about this pretty interesting idea. It's
certainly of interest for Postgres-R as well.

AFAIK a function could simply wait, until the node which is being
queried reaches a given point in time of application of transactions (an
LSN, in the Sync-Rep world). Calling such a waiting function just after
BEGIN would ensure to see (at least) the given snapshot. If that
snapshot has already been reached or passed, the function does nothing.

What I like is, that it's optimistic in that the wait is only enforced
when needed by the reader. However, unlike enforcing the wait before
COMMIT, it requires changing the application to cope with this behavior
of the distributed database system. And knowing when to require which
snapshot sounds rather difficult from the point of view of the
application developer.

Also note, that it might be the issuer of the transaction who wants to
ensure "his" transaction got propagated to the remote nodes.

> I'm not going to worry about this at the moment. Hot standby will be
> useful without this and so I regard this as a secondary objective. Rome
> wasn't built in a single release, or something like that.

Sounds like a decent plan. Good luck.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Fujii Masao

Hi,

On Fri, Dec 19, 2008 at 5:50 PM, Simon Riggs  wrote:
>
> On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote:
>
>> > Yes, please check the call points for ForceSyncCommit.
>> >
>> > Do I think every xlog flush should be synchronous, no, I don't.
>> That's why we have a user settable parameter for it.
>>
>> Umm.. I focus attention on XLogFlush() called except
>> RecordTransactionCommit().
>> For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
>> XLogFlush() might
>> flush XLOG synchronously even if asynchronous commit case.
>
> XLogFlush() flushes because of an interlock between a dirty buffer write
> and an outstanding WAL write. Dirty buffer writes are not replicated, so
> there is no need to have a similar interlock on WAL streaming.
>
> So making those call points synchronous is possible, but neither
> necessary or IMHO desirable.

Yes in upcoming 8.4, but probably no in the future.

What if the primary fails after writing the dirty data buffer before sending
the corresponding logs? This would make data on the primary and logs
on the standby inconsistent. In 8.4, such inconsistency might not matter
because we don't use the data on the failed primary for recovery (when
restarting the failed server, we always need a fresh backup). But, since
this restriction is not good for some people, in the future, the failed server
should restart without a fresh backup, and the inconsistency would be
problem. So, I think that the inconsistency should be removed even if
asynchronous replication case, and we should enforce "WAL rule" over
some servers.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner

Hi,

Mark Mielke wrote:
> Good answers, Markus. Thanks.

You are welcome.

> So it looks like there is value to both ends of the spectrum, and while
> I feel the most value would be in providing a very fast system that
> scales near linear to the number of nodes in the system, even at the
> expense of immediately visible transactions from all servers, I can
> accept that sometimes the expectations are stricter and would appreciate
> seeing an option to let me choose based upon my requirements.

I absolutely agree to that. The original Postgres-R algorithm covers the
eager (or virtually synchronous) part. I'm planning to extend it with a
(fully) synchronous mode and let the user choose per transaction.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner

Hi,

Josh Berkus wrote:
> Peter Eisentraut wrote:
>> It's the color of the bikeshed ...

Agreed. It's why I've decided to support various modes for Postgres-R.
I'm glad to see that the current "Sync Rep" approach does the same.

> Hmmm.  I thought this was pretty clear.  There's three levels of synch
> which are useful features:
> 
> 1) "synchronus" standby which is really asynchronous, but only has a gap
> of < 100ms.

A synchronous standby which is really asynchronous? That's exactly the
naming challenge I've been pointing to.

Commonly used terms are: "virtually synchronous", "approximately
synchronous", "near-real-time replication" or "eager replication", but
for most users, this is not "synchronous" (enough).

(BTW: there's no such "< 100 ms" guarantee. It may be typically below
100 ms, or even below 10 ms on average. But replication is not about the
typical or average case. It's much more about failures and uncommon
cases. The guarantee you can get in such a system (by declaring a node
as dead) is much more likely to be within the range of several seconds
and more, be it network, disk or whatever other failure-timeout that
applies here.)

> 2) Synchronous standby which guarentees that all committed transactions
> are on the failover node and that no data will be lost for failover, but
> the failover node is still in standby mode.

What's the difference to 1) here? I'm not following.

> 3) Synchronous replication where the standby node has identical
> transactions to the master node, and is queryable read-only.

So, a synchronous standby is different from synchronous replication in
that it's asynchronous?

Sorry for bugging with naming, but I think it is important for an
understanding during development.

> Any of these levels would be useful and allow a certain number of our
> users to deploy PostgreSQL in an environment where it wasn't used
> before.

I absolutely agree to that statement.

However, please do not confuse future users (and today's hackers), but
instead use existing terms consistently and clearly. Something that lags
behind, potentially by several seconds (in case of failure) is commonly
considered asynchronous, no matter how close to "immediate" it is on
average.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner

Hi,

Mark Mielke wrote:
> Robert Haas wrote:
>> On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane  wrote:
>>> We won't call it anything, because we never will or can implement that.
>>> See the theory of relativity: the notion of exactly simultaneous events
>>
>> OK, fine.  I'll be more precise.  I think we need to reserve the term
>> "synchronous replication" for a system where transactions that begin
>> on the standby after the transactions has committed on the master see
>> the effects of the committed transaction.

I agree with Robert here. As far as I know this is the common
understanding of "synchronous replication". Everything less - including
Postgres-R - is considered to be asynchronous.

> I'd like to see proof of some sort that PostgreSQL guarantees that the
> instant a 'commit' returns, any transactions already open with the
> appropriate transaction isolation level, or any new sessions *will* see
> the results of the commit.

Given within this thread, here [1].

> Two phase commit doesn't imply that the transaction is guaranteed to be
> immediately visible.

Just for the record: that's plain wrong. As with any other transaction,
a COMMIT of a prepared transaction guarantees visibility from all
subsequent snapshots (at least for Postgres and other serious RDBSen).

Systems based on 2PC are the typical synchronous replication solution:
works, resistant to failures, consistent across nodes (WRT visibility),
but unusably slow. This is what people have in mind and expect when they
hear "synchronous replication" for databases. (And which is why I'm
thinking it's better for an optimized solution not to call itself
"synchronous").

> Unless transactions are
> locked from starting until they are able to prove that they have the
> latest commit

See the cited README. It already happens for (single node) Postgres
systems, because the action of snapshot taking and committing are
serialized.

> (a feat which I'm going to theorize as impossible -
> because the moment you wait for a commit, and you begin again, you
> really have no guarantee that another commit has not occurred in the
> mean time)

This problem is solved by locking.

Regards

Markus Wanner

[1]: Hints to docs and source, that COMMIT actually ensures subsequent
snapshots "include" changes of the committed transaction:
http://archives.postgresql.org/message-id/494c.2060...@bluegap.ch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Mark Mielke


Good answers, Markus. Thanks.

I've bought the thinking of several here that the user should have some 
control over what they expect (and what optimizations they are willing 
to accept as a good choice), but that commit should still be able to 
have a capped time limit.


I can think of many of my own applications where I would choose one mode 
vs another mode, even within the same application, depending on the 
operation itself. The most important requirement is that transactions 
are durable. It becomes convenient, though, to provide additional 
guarantees for some operation sequences.


I still see the requirement for seat reservation, bank account, or stock 
trading, as synchronizing using read-write locks before starting the 
select, rather than enforcing latest on every select.


For my own bank, when I do an online transaction, operations don't 
always immediately appear in my list of transactions. They appear to 
sometimes be batched, sometimes in near real time, and sometimes as part 
of some sort of day end processing.


For seat reservation, the time the seat layout is shown on the screen is 
not usually locked during a transaction. Between the time the travel 
agent brings up the seats on the plane, and the time they select the 
seat, the seat could be taken. What's important is that the reservation 
is durable, and that conflicts are not introduced. The commit must fail 
if another person has chosen the seat already already. The commit does 
not need to wait until the reservation is pushed out to all systems 
before completing. The same is true of stock trading.


However, it can be very convenient for commits to be immediately visible 
after the commit completes. This allows for lazier models, such as a web 
site that reloads the view on the reservations or recent trades and 
expects to see recent commits no matter which server it accesses, rather 
than taking into account that the commit succeeded when presenting the 
next view.


If I look at sites like Google - they take the opposite extreme. I can 
post a message, and it remembers that I posted the message and makes it 
immediately visible, however, I might not see other new messages in a 
thread until a minute or more later.


So it looks like there is value to both ends of the spectrum, and while 
I feel the most value would be in providing a very fast system that 
scales near linear to the number of nodes in the system, even at the 
expense of immediately visible transactions from all servers, I can 
accept that sometimes the expectations are stricter and would appreciate 
seeing an option to let me choose based upon my requirements.


Cheers,
mark


Markus Wanner wrote:

Hi,

Mark Mielke wrote:
  

Where does the expectation come from?



I find the seat reservation, bank account or stock trading examples
pretty obvious WRT user expectations.

Nonetheless, I've compiled some hints from the documentation and sources:

"Since in Read Committed mode each new command starts with a new
snapshot that includes all transactions committed up to that instant" [1].

"This [SERIALIZABLE ISOLATION] level emulates serial transaction
execution, as if transactions had been executed one after another,
serially, rather than concurrently." [1].  (IMO this implies, that a
transaction "sees" changes from all preceding transactions).

"All changes made by the transaction become visible to others and are
guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not
overly clear here, when exactly the changes become visible. OTOH,
there's no warning, that another session doesn't immediately see
committed transactions. Not sure where you got that from).

  

I don't recall ever reading it in
the documentation, and unless the session processes are contending over
the integers (using some sort of synchronization primitive) in memory
that represent the "latest visible commit" on every single select, I'm
wondering how it is accomplished?



See the transaction system's README [3]. It documents the process of
snapshot taking and transaction isolation pretty well. Around line 226
it says: "What we actually enforce is strict serialization of commits
and rollbacks with snapshot-taking". (So the outcome of your experiment
is no surprise at all).

And a bit later: "This rule is stronger than necessary for consistency,
but is relatively simple to enforce, and it assists with some other
issues as explained below.". While this implies, that an optimization is
theoretically possible, I very much doubt it would be worth it (for a
single node system).

In a distributed system, things are a bit different. Network latency is
an order of magnitude higher than memory latency (for IPC). So a similar
optimization is very well worth it. However, the application (or the
load balancer or both) need to know about this potential lag between
nodes. And as you've outlined elsewhere, a limit for how much a single
node may lag behind needs to be established.

(As a side not

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-20 Thread Markus Wanner

Hi,

Mark Mielke wrote:
> Where does the expectation come from?

I find the seat reservation, bank account or stock trading examples
pretty obvious WRT user expectations.

Nonetheless, I've compiled some hints from the documentation and sources:

"Since in Read Committed mode each new command starts with a new
snapshot that includes all transactions committed up to that instant" [1].

"This [SERIALIZABLE ISOLATION] level emulates serial transaction
execution, as if transactions had been executed one after another,
serially, rather than concurrently." [1].  (IMO this implies, that a
transaction "sees" changes from all preceding transactions).

"All changes made by the transaction become visible to others and are
guaranteed to be durable if a crash occurs." [2]. (Agreed, it's not
overly clear here, when exactly the changes become visible. OTOH,
there's no warning, that another session doesn't immediately see
committed transactions. Not sure where you got that from).

> I don't recall ever reading it in
> the documentation, and unless the session processes are contending over
> the integers (using some sort of synchronization primitive) in memory
> that represent the "latest visible commit" on every single select, I'm
> wondering how it is accomplished?

See the transaction system's README [3]. It documents the process of
snapshot taking and transaction isolation pretty well. Around line 226
it says: "What we actually enforce is strict serialization of commits
and rollbacks with snapshot-taking". (So the outcome of your experiment
is no surprise at all).

And a bit later: "This rule is stronger than necessary for consistency,
but is relatively simple to enforce, and it assists with some other
issues as explained below.". While this implies, that an optimization is
theoretically possible, I very much doubt it would be worth it (for a
single node system).

In a distributed system, things are a bit different. Network latency is
an order of magnitude higher than memory latency (for IPC). So a similar
optimization is very well worth it. However, the application (or the
load balancer or both) need to know about this potential lag between
nodes. And as you've outlined elsewhere, a limit for how much a single
node may lag behind needs to be established.

(As a side note: for a multi-master system like Postgres-R, it's
beneficial to keep the lag time as low as possible, because the larger
the lag, the higher the probability for a conflict between two
transactions on different nodes.)

Regards

Markus Wanner

[1]: Pg 8.3 Docu: Concurrency Control:
http://www.postgresql.org/docs/8.3/static/transaction-iso.html

[2]: Pg 8.3 Docu: COMMIT command:
http://www.postgresql.org/docs/8.3/static/sql-commit.html

[3]: README of transam (src/backend/access/transam/README):
https://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/backend/access/transam/README#L224

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-19 Thread Simon Riggs


On Fri, 2008-12-19 at 11:04 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On a related but different point: We don't need an interlock between
> > dirty buffers and WAL during recovery because the WAL has already been
> > written.
> 
> Assuming the WAL has also been fsync'd.

True, so this will need to change for 8.4 also

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-19 Thread Heikki Linnakangas


Simon Riggs wrote:

On a related but different point: We don't need an interlock between
dirty buffers and WAL during recovery because the WAL has already been
written.


Assuming the WAL has also been fsync'd.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-19 Thread Simon Riggs

On Fri, 2008-12-19 at 09:43 +0900, Fujii Masao wrote:

> > Yes, please check the call points for ForceSyncCommit.
> >
> > Do I think every xlog flush should be synchronous, no, I don't.
> That's why we have a user settable parameter for it.
> 
> Umm.. I focus attention on XLogFlush() called except
> RecordTransactionCommit().
> For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
> XLogFlush() might
> flush XLOG synchronously even if asynchronous commit case.

XLogFlush() flushes because of an interlock between a dirty buffer write
and an outstanding WAL write. Dirty buffer writes are not replicated, so
there is no need to have a similar interlock on WAL streaming.

So making those call points synchronous is possible, but neither
necessary or IMHO desirable.

On a related but different point: We don't need an interlock between
dirty buffers and WAL during recovery because the WAL has already been
written.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-18 Thread Fujii Masao

Hi,

On Thu, Dec 18, 2008 at 6:35 PM, Simon Riggs  wrote:
>
> On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote:
>
>> >> Agreed, I also think that hard code is better. But I'm nervous that "off"
>> >> keeps us waiting for replication in cases other than DDL, e.g. flush
>> >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
>> >> is quite similar to synchronous_commit = off. If we would hard code #4,
>> >> the performance might degrade although it's asynchronous replication.
>> >> So, I'd like to hard code #3. What is your opinion?
>> >
>> > We don't do that when we flush buffer, truncate clog or checkpoint, not
>> > sure why you mention those.
>> >
>> > We ForceSyncCommit when we
>> > * VACUUM FULL
>> > * CREATE/DROP DATABASE or USER
>> > * Create/Drop Tablespace
>> >
>> > I don't see a problem in forcing an fsync for those. I will sleep safer
>> > knowing those guys are on disk even in async mode.
>>
>> If my understanding is correct, XLOG flush is forced up to buffer's LSN
>> when flushing buffer even if asynchronous commit case. Am I missing
>> something?
>
> Yes, please check the call points for ForceSyncCommit.
>
> Do I think every xlog flush should be synchronous, no, I don't. That's
> why we have a user settable parameter for it.

Umm.. I focus attention on XLogFlush() called except RecordTransactionCommit().
For example, FlushBuffer(), WriteTruncateXlogRec().. etc. These
XLogFlush() might
flush XLOG synchronously even if asynchronous commit case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-18 Thread Simon Riggs


On Thu, 2008-12-18 at 12:08 +0900, Fujii Masao wrote:

> >> Agreed, I also think that hard code is better. But I'm nervous that "off"
> >> keeps us waiting for replication in cases other than DDL, e.g. flush
> >> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
> >> is quite similar to synchronous_commit = off. If we would hard code #4,
> >> the performance might degrade although it's asynchronous replication.
> >> So, I'd like to hard code #3. What is your opinion?
> >
> > We don't do that when we flush buffer, truncate clog or checkpoint, not
> > sure why you mention those.
> >
> > We ForceSyncCommit when we
> > * VACUUM FULL
> > * CREATE/DROP DATABASE or USER
> > * Create/Drop Tablespace
> >
> > I don't see a problem in forcing an fsync for those. I will sleep safer
> > knowing those guys are on disk even in async mode.
> 
> If my understanding is correct, XLOG flush is forced up to buffer's LSN
> when flushing buffer even if asynchronous commit case. Am I missing
> something?

Yes, please check the call points for ForceSyncCommit.

Do I think every xlog flush should be synchronous, no, I don't. That's
why we have a user settable parameter for it.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Fujii Masao

Hi,

On Thu, Dec 18, 2008 at 11:19 AM, Simon Riggs  wrote:
>
> On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote:
>> Hi,
>>
>> Thanks for the helpful comments!
>>
>> On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs  wrote:
>> >
>> > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
>> >
>> >> OK. I will extend synchronous_replication, make walsender send XLOG
>> >> with synchronization mode flag and make walreceiver perform according
>> >> to the flag.
>> >
>> > Sounds good.
>> >
>> >> > My perspective is that synchronous_replication specifies how long to
>> >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
>> >> > until point #3). So I think we should change this to a list of options
>> >> > to allow people to more carefully select how much waiting is required.
>> >>
>> >> In the latest patch, "off" keeps us waiting for replication in some
>> >> cases, e.g. forceSyncCommit = true. This is analogous to the way
>> >> synchronous_commit works. When "off" keeps us waiting for
>> >> replication, which option (#1-#6) should we choose? Should it be
>> >> user-configurable (though the parameter values are doubled)?
>> >> hardcode #3? "off" always should not keep us waiting for
>> >> replication?
>> >
>> > I would hard code #4, i.e. make it fsync, so that DDL changes are
>> > regarded as "high value transactions".
>> >
>> > A parameter sounds like overkill. We'd need to explain what
>> > forceSyncCommit does to users then, which is easier to avoid.
>>
>> Agreed, I also think that hard code is better. But I'm nervous that "off"
>> keeps us waiting for replication in cases other than DDL, e.g. flush
>> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
>> is quite similar to synchronous_commit = off. If we would hard code #4,
>> the performance might degrade although it's asynchronous replication.
>> So, I'd like to hard code #3. What is your opinion?
>
> We don't do that when we flush buffer, truncate clog or checkpoint, not
> sure why you mention those.
>
> We ForceSyncCommit when we
> * VACUUM FULL
> * CREATE/DROP DATABASE or USER
> * Create/Drop Tablespace
>
> I don't see a problem in forcing an fsync for those. I will sleep safer
> knowing those guys are on disk even in async mode.

If my understanding is correct, XLOG flush is forced up to buffer's LSN
when flushing buffer even if asynchronous commit case. Am I missing
something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Simon Riggs


On Thu, 2008-12-18 at 11:03 +0900, Fujii Masao wrote:
> Hi,
> 
> Thanks for the helpful comments!
> 
> On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs  wrote:
> >
> > On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
> >
> >> OK. I will extend synchronous_replication, make walsender send XLOG
> >> with synchronization mode flag and make walreceiver perform according
> >> to the flag.
> >
> > Sounds good.
> >
> >> > My perspective is that synchronous_replication specifies how long to
> >> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
> >> > until point #3). So I think we should change this to a list of options
> >> > to allow people to more carefully select how much waiting is required.
> >>
> >> In the latest patch, "off" keeps us waiting for replication in some
> >> cases, e.g. forceSyncCommit = true. This is analogous to the way
> >> synchronous_commit works. When "off" keeps us waiting for
> >> replication, which option (#1-#6) should we choose? Should it be
> >> user-configurable (though the parameter values are doubled)?
> >> hardcode #3? "off" always should not keep us waiting for
> >> replication?
> >
> > I would hard code #4, i.e. make it fsync, so that DDL changes are
> > regarded as "high value transactions".
> >
> > A parameter sounds like overkill. We'd need to explain what
> > forceSyncCommit does to users then, which is easier to avoid.
> 
> Agreed, I also think that hard code is better. But I'm nervous that "off"
> keeps us waiting for replication in cases other than DDL, e.g. flush
> buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
> is quite similar to synchronous_commit = off. If we would hard code #4,
> the performance might degrade although it's asynchronous replication.
> So, I'd like to hard code #3. What is your opinion?

We don't do that when we flush buffer, truncate clog or checkpoint, not
sure why you mention those.

We ForceSyncCommit when we
* VACUUM FULL
* CREATE/DROP DATABASE or USER
* Create/Drop Tablespace

I don't see a problem in forcing an fsync for those. I will sleep safer
knowing those guys are on disk even in async mode.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Fujii Masao

Hi,

Thanks for the helpful comments!

On Wed, Dec 17, 2008 at 8:50 PM, Simon Riggs  wrote:
>
> On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:
>
>> OK. I will extend synchronous_replication, make walsender send XLOG
>> with synchronization mode flag and make walreceiver perform according
>> to the flag.
>
> Sounds good.
>
>> > My perspective is that synchronous_replication specifies how long to
>> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
>> > until point #3). So I think we should change this to a list of options
>> > to allow people to more carefully select how much waiting is required.
>>
>> In the latest patch, "off" keeps us waiting for replication in some
>> cases, e.g. forceSyncCommit = true. This is analogous to the way
>> synchronous_commit works. When "off" keeps us waiting for
>> replication, which option (#1-#6) should we choose? Should it be
>> user-configurable (though the parameter values are doubled)?
>> hardcode #3? "off" always should not keep us waiting for
>> replication?
>
> I would hard code #4, i.e. make it fsync, so that DDL changes are
> regarded as "high value transactions".
>
> A parameter sounds like overkill. We'd need to explain what
> forceSyncCommit does to users then, which is easier to avoid.

Agreed, I also think that hard code is better. But I'm nervous that "off"
keeps us waiting for replication in cases other than DDL, e.g. flush
buffer, truncate clog, checkpoint.. etc. synchronous_replication = off
is quite similar to synchronous_commit = off. If we would hard code #4,
the performance might degrade although it's asynchronous replication.
So, I'd like to hard code #3. What is your opinion?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-17 Thread Simon Riggs


On Wed, 2008-12-17 at 12:07 +0900, Fujii Masao wrote:

> OK. I will extend synchronous_replication, make walsender send XLOG
> with synchronization mode flag and make walreceiver perform according
> to the flag.

Sounds good.

> > My perspective is that synchronous_replication specifies how long to
> > wait. Current settings are "off" (don't wait) or "on" (meaning wait
> > until point #3). So I think we should change this to a list of options
> > to allow people to more carefully select how much waiting is required.
> 
> In the latest patch, "off" keeps us waiting for replication in some
> cases, e.g. forceSyncCommit = true. This is analogous to the way
> synchronous_commit works. When "off" keeps us waiting for
> replication, which option (#1-#6) should we choose? Should it be
> user-configurable (though the parameter values are doubled)?
> hardcode #3? "off" always should not keep us waiting for
> replication?

I would hard code #4, i.e. make it fsync, so that DDL changes are
regarded as "high value transactions".

A parameter sounds like overkill. We'd need to explain what
forceSyncCommit does to users then, which is easier to avoid.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-16 Thread Fujii Masao

Hi,

On Tue, Dec 16, 2008 at 7:21 PM, Simon Riggs  wrote:
>
> On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote:
>
>> > So from my previous list
>> >
>> > 1. We sent the message to standby (A)
>> > 2. We received the message on standby
>> > 3. We wrote the WAL to the WAL file (B)
>> > 4. We fsync'd the WAL file (C)
>> > 5. We CRC checked the WAL commit record
>> > 6. We applied the WAL commit record
>> >
>> > Please could you also add an option #4, i.e. add the *option* to fsync
>> > the WAL to disk at commit time also. That requires us to add a third
>> > option to synchronous_replication parameter.
>>
>> The above option should be configured on the primary? or standby?
>> The primary is suitable to vary it from transaction to transaction. On
>> the other hand, it should be configured on the standby in order to
>> choose it for every standby (in the future).
>>
>> I prefer the latter, and thought that it should be added into recovery.conf.
>> I mean, synchronous_replication identifies only whether commit waits for
>> replication (if the name is confusing, I would rename it). The above
>> options (#1-#6) are chosen in recovery.conf. What is your opion?
>
> No, we've been through that loop already a few months back:
> Transaction-controlled robustness.
>
> It should be up to the client on the primary to decide how much waiting
> they would like to perform in order to provide a guarantee. A change of
> setting on the standby should not be allowed to alter the performance or
> durability on the primary.

OK. I will extend synchronous_replication, make walsender send XLOG
with synchronization mode flag and make walreceiver perform according
to the flag.

>
> My perspective is that synchronous_replication specifies how long to
> wait. Current settings are "off" (don't wait) or "on" (meaning wait
> until point #3). So I think we should change this to a list of options
> to allow people to more carefully select how much waiting is required.

In the latest patch, "off" keeps us waiting for replication in some
cases, e.g. forceSyncCommit = true. This is analogous to the way
synchronous_commit works. When "off" keeps us waiting for
replication, which option (#1-#6) should we choose? Should it be
user-configurable (though the parameter values are doubled)?
hardcode #3? "off" always should not keep us waiting for
replication?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-16 Thread Simon Riggs

On Tue, 2008-12-16 at 12:36 +0900, Fujii Masao wrote:

> > So from my previous list
> >
> > 1. We sent the message to standby (A)
> > 2. We received the message on standby
> > 3. We wrote the WAL to the WAL file (B)
> > 4. We fsync'd the WAL file (C)
> > 5. We CRC checked the WAL commit record
> > 6. We applied the WAL commit record
> >
> > Please could you also add an option #4, i.e. add the *option* to fsync
> > the WAL to disk at commit time also. That requires us to add a third
> > option to synchronous_replication parameter.
> 
> The above option should be configured on the primary? or standby?
> The primary is suitable to vary it from transaction to transaction. On
> the other hand, it should be configured on the standby in order to
> choose it for every standby (in the future).
> 
> I prefer the latter, and thought that it should be added into recovery.conf.
> I mean, synchronous_replication identifies only whether commit waits for
> replication (if the name is confusing, I would rename it). The above
> options (#1-#6) are chosen in recovery.conf. What is your opion?

No, we've been through that loop already a few months back:
Transaction-controlled robustness.

It should be up to the client on the primary to decide how much waiting
they would like to perform in order to provide a guarantee. A change of
setting on the standby should not be allowed to alter the performance or
durability on the primary.

My perspective is that synchronous_replication specifies how long to
wait. Current settings are "off" (don't wait) or "on" (meaning wait
until point #3). So I think we should change this to a list of options
to allow people to more carefully select how much waiting is required.

This feature is then analogous to the way synchronous_commit works. It
also provides a level of application control not seen in any other RDBMS
in the industry, which makes it very suitable for large and important
applications that need a fine mix of robustness and performance.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Fujii Masao

Hi,

Sorry for this late reply. And, thanks for the hot discussion ;)

On Tue, Dec 16, 2008 at 1:24 AM, Simon Riggs  wrote:
>
> Fujii-san,
>
> Just repeating this in case you lost this comment:
>
> On Mon, 2008-12-15 at 09:40 +, Simon Riggs wrote:
>
>> Fujii-san, please can we incorporate those two options, rather than just
>> one choice "synchronous_replication = on". They look like two commonly
>> requested options.
>
> I see the comment in line 230+ of walreceiver.c, so understand that you
> have implemented option #3 from the following list.
>
> So from my previous list
>
> 1. We sent the message to standby (A)
> 2. We received the message on standby
> 3. We wrote the WAL to the WAL file (B)
> 4. We fsync'd the WAL file (C)
> 5. We CRC checked the WAL commit record
> 6. We applied the WAL commit record
>
> Please could you also add an option #4, i.e. add the *option* to fsync
> the WAL to disk at commit time also. That requires us to add a third
> option to synchronous_replication parameter.

The above option should be configured on the primary? or standby?
The primary is suitable to vary it from transaction to transaction. On
the other hand, it should be configured on the standby in order to
choose it for every standby (in the future).

I prefer the latter, and thought that it should be added into recovery.conf.
I mean, synchronous_replication identifies only whether commit waits for
replication (if the name is confusing, I would rename it). The above
options (#1-#6) are chosen in recovery.conf. What is your opion?

>> #6 is an additional synchronization step in Hot Standby. I would say
>> that people won't want that when they see how it performs (they probably
>> won't want #4 either for that same reason, but that is for robustness).
>
> We can jointly add option #6 once we have both sync rep and hot standby
> committed, or at a late stage of hot standby development. There's not
> much point looking at it before then.

Agreed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Mon, 2008-12-15 at 13:06 -0800, Josh Berkus wrote:
> Peter Eisentraut wrote:
> > Simon Riggs wrote:
> >> I am truly lost to understand why the *name* "synchronous replication"
> >> causes so much discussion, yet nobody has discussed what they would
> >> actually like the software to *do*
> > 
> > It's the color of the bikeshed ...
> 
> Hmmm.  I thought this was pretty clear.  There's three levels of synch 
> which are useful features:
> 
> 1) "synchronus" standby which is really asynchronous, but only has a gap 
> of < 100ms.
> 
> 2) Synchronous standby which guarentees that all committed transactions 
> are on the failover node and that no data will be lost for failover, but 
> the failover node is still in standby mode.
> 
> 3) Synchronous replication where the standby node has identical 
> transactions to the master node, and is queryable read-only.

> Any of these levels would be useful and allow a certain number of our 
> users to deploy PostgreSQL in an environment where it wasn't used 
> before.  So if we can only do (2) for 8.4, that's still very useful for 
> telecoms and banks.

The (2) mentioned here could be any of sync points #2-5 referred to
upthread. Different people have requested different levels of
robustness. Looking at DRBD and Oracle, they both subdivide (2) into at
least two further levels of option. So (2) is too broad a brush to paint
with.

I don't believe that (2) as stated is sufficient for banks, though is
reasonable for many telco applications. But #4 or #5 would be suitable
for banks, i.e. we must fsync to disk for very high value transactions.

The extra code to do this is minor, which is why I've asked Fujii-san to
include it now within the patch.

All of this is controllable by the parameter synchronous_replication,
which it is important to note can be set for each individual transaction
rather than just fixed for the whole server. This is identical to the
way we can mix synchronous commit and asynchronous commit transactions.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Josh Berkus


Simon,


I've explained this twice now on different parts of this thread. Could I
politely direct your attention to those posts?


Chill.  I was just explaining that the *goal* of sync standby was not 
complicated or really something to be argued about.  It's pretty clear.


--Josh


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Mon, 2008-12-15 at 13:43 -0800, Josh Berkus wrote:
> > Isn't the "queryable read-only" feature totally orthogonal with
> > how synchronous the replication is?
> 
> Yes.  However, it introduces specific difficult issues which an 
> unreadable synchronous slave does not have.

Don't think it's hugely difficult, but there are multiple ways of doing
this. But it is irrelevant until we have the basic ability to run
queries.

I've explained this twice now on different parts of this thread. Could I
politely direct your attention to those posts?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Josh Berkus




Isn't the "queryable read-only" feature totally orthogonal with
how synchronous the replication is?


Yes.  However, it introduces specific difficult issues which an 
unreadable synchronous slave does not have.


--Josh

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Ron Mayer


Josh Berkus wrote:


Hmmm.  I thought this was pretty clear.  There's three levels of synch 
which are useful features:


1) "synchronus" standby which is really asynchronous, but only has a gap 
of < 100ms.


2) Synchronous standby which guarentees that all committed transactions 
are on the failover node and that no data will be lost for failover, but 
the failover node is still in standby mode.


3) Synchronous replication where the standby node has identical 
transactions to the master node, and is queryable read-only.


Any of these levels would be useful


Isn't the "queryable read-only" feature totally orthogonal with
how synchronous the replication is?

For one reporting system I have, where new data is continually
being added every second; I'd love to have a read-only-slave
even if that system has the "100ms" gap you mentioned in #1.

Heck I don't care if the queries it runs even have a 100 *minute*
gap; but I sure would like it to be synchronous in the sense
that all the transactions to survive a failure of the primary.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Josh Berkus


Peter Eisentraut wrote:

Simon Riggs wrote:

I am truly lost to understand why the *name* "synchronous replication"
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do*


It's the color of the bikeshed ...


Hmmm.  I thought this was pretty clear.  There's three levels of synch 
which are useful features:


1) "synchronus" standby which is really asynchronous, but only has a gap 
of < 100ms.


2) Synchronous standby which guarentees that all committed transactions 
are on the failover node and that no data will be lost for failover, but 
the failover node is still in standby mode.


3) Synchronous replication where the standby node has identical 
transactions to the master node, and is queryable read-only.


Any of these levels would be useful and allow a certain number of our 
users to deploy PostgreSQL in an environment where it wasn't used 
before.  So if we can only do (2) for 8.4, that's still very useful for 
telecoms and banks.


--Josh


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Jeff Davis

On Mon, 2008-12-15 at 09:19 -0500, Robert Haas wrote:
> I understand you're point, but I think there's still a use case.   The
> idea is that declaring the secondary dead is a rare event, and there's
> some mechanism by which you're enabled to page your network staff, and
> they hightail it into the office to fix the problem.  It might not be
> the way that you want to run your system, but I don't think it's
> unreasonable for someone else to want it.
> 

Agreed: there's an analogy to RAID here. When a disk goes out, it still
allows writes, but moves to a degraded state. Hopefully your monitoring
system notifies you, and you fix it.

Also, let's say that the standby suffers catastrophic storage failure.
Now you only have your data on one server anyway (the primary).
Rejecting new transactions from committing doesn't save all the old
transactions in the event of a subsequent storage failure on the
primary.

I'm not advocating this option in particular, other than saying that it
seems like a reasonable option to me.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

Fujii-san,

Just repeating this in case you lost this comment:

On Mon, 2008-12-15 at 09:40 +, Simon Riggs wrote:

> Fujii-san, please can we incorporate those two options, rather than just
> one choice "synchronous_replication = on". They look like two commonly
> requested options.

I see the comment in line 230+ of walreceiver.c, so understand that you
have implemented option #3 from the following list.

So from my previous list

1. We sent the message to standby (A)
2. We received the message on standby
3. We wrote the WAL to the WAL file (B)
4. We fsync'd the WAL file (C)
5. We CRC checked the WAL commit record
6. We applied the WAL commit record

Please could you also add an option #4, i.e. add the *option* to fsync
the WAL to disk at commit time also. That requires us to add a third
option to synchronous_replication parameter.

That then means we will have robustness options that map directly to
DRBD algorithms A, B and C (shown in brackets in the above list). I
believe these map also to Data Guard options Maximum Performance and
Maximum Availability.

AFAICS if we implement the additional items I've requested over the last
few days, then the architecture is now at a good point for 8.4 and we
can begin to look at low level implementation details. Or put another
way, I'm not expecting to come up with more architecture changes.

> #6 is an additional synchronization step in Hot Standby. I would say
> that people won't want that when they see how it performs (they probably
> won't want #4 either for that same reason, but that is for robustness).

We can jointly add option #6 once we have both sync rep and hot standby
committed, or at a late stage of hot standby development. There's not
much point looking at it before then.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Robert Haas

> So you'd want all commits to wait until the transaction is safely replicated
> in the standby. But if there's a network glitch, or the standby is
> restarted, you're happy to reply to the client that it's committed if it's
> only safely committed in the primary. Essentially, you wait for the reply as
> long the standby responds within X seconds, but if it takes more then Y
> seconds, you don't wait. I know that people do that, but it seems
> counterintuitive to me. In that case, when the primary acks the transaction
> as committed, you only know that it's safely committed in the primary; it
> doesn't give any hard guarantee about the state in the standby.

I understand you're point, but I think there's still a use case.   The
idea is that declaring the secondary dead is a rare event, and there's
some mechanism by which you're enabled to page your network staff, and
they hightail it into the office to fix the problem.  It might not be
the way that you want to run your system, but I don't think it's
unreasonable for someone else to want it.

> But when you consider the possibility to use the standby for queries, the
> synchronous mode makes sense too.
> I'm not opposed to providing all the options, but the synchronous mode where
> we can guarantee that if you query the standby, you will see the effects of
> all transactions committed in the primary, makes the synchronous mode much
> more interesting. If you don't need that property, you're most likely more
> happy with asynchronous mode anyway.

I agree that asynchronous mode will be the right solution for a very
large subset of our users.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Aidan Van Dyk

* Robert Haas  [081215 07:32]:
> > In fact, waiting for reply from standby server before acknowledging a commit
> > to the client is a bit pointless otherwise. It puts you in a strange
> > situation, where you're waiting for the commits in normal operation, but if
> > there's a network glitch or the standby goes down, you're willing to go
> > ahead without it. You get a high guarantee that your data is up-to-date in
> > the standby, except when it isn't. Which isn't much of a guarantee.
> 
> It protects you against a catastrophic loss of the primary, which is a
> non-trivial consideration.  At the risk of being ghoulish, imagine
> that you are a large financial company headquartered in the world
> trade center.

This was exacty my original point - I want the transaction durably on
the slave before the commit is acknowledged (to build as much local
redunancy as I can), but I certatily *don't* want to loose the ability
to use WAL archiving, because I ship my WAL off-site too...

The ability to have an extra local copy is good.  But I'm certainly not
going to want to give up my off-site backup/WAL for it...

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Heikki Linnakangas


Robert Haas wrote:

In fact, waiting for reply from standby server before acknowledging a commit
to the client is a bit pointless otherwise. It puts you in a strange
situation, where you're waiting for the commits in normal operation, but if
there's a network glitch or the standby goes down, you're willing to go
ahead without it. You get a high guarantee that your data is up-to-date in
the standby, except when it isn't. Which isn't much of a guarantee.


It protects you against a catastrophic loss of the primary, which is a
non-trivial consideration.  At the risk of being ghoulish, imagine
that you are a large financial company headquartered in the world
trade center.


So you'd want all commits to wait until the transaction is safely 
replicated in the standby. But if there's a network glitch, or the 
standby is restarted, you're happy to reply to the client that it's 
committed if it's only safely committed in the primary. Essentially, you 
wait for the reply as long the standby responds within X seconds, but if 
it takes more then Y seconds, you don't wait. I know that people do 
that, but it seems counterintuitive to me. In that case, when the 
primary acks the transaction as committed, you only know that it's 
safely committed in the primary; it doesn't give any hard guarantee 
about the state in the standby.


But when you consider the possibility to use the standby for queries, 
the synchronous mode makes sense too.


I'm not opposed to providing all the options, but the synchronous mode 
where we can guarantee that if you query the standby, you will see the 
effects of all transactions committed in the primary, makes the 
synchronous mode much more interesting. If you don't need that property, 
you're most likely more happy with asynchronous mode anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Greg Stark

It's a real promise. The reason you're getting hand-wavy answers is  
because it's such a basic requirement that I'm trying to point out  
just how fundamental a requirement it is.


If you want to see the actual code which guarantees this take a look  
around the code for procarray - in particular the code for taking a  
snapshot. There are comments there about what locks are needed when  
committing and when taking a snapshot and why. But it's quite technical.


--
Greg


On 15 Dec 2008, at 02:03, Mark Mielke  wrote:


Greg Stark wrote:
When the database says the data is committed it has to mean the  
data is really committed. Imagine if you looked at a bank account  
balance after withdrawing all the money and saw a balance which  
didn't reflect the withdrawal and allowed you to withdraw more  
money again...


Within the same session - sure. From different sessions? PostgeSQL  
MVCC let's you see an older snapshot, although it does prefer to  
have the latest snapshot with each command.


For allowing to withdraw more money again, I would expect some sort  
of locking "SELECT ... FOR UPDATE;" to be used. This lock then  
forces the two transactions to become serialized and the second will  
either wait for the first to complete or fail. Any banking program  
that assumed that it could SELECT to confirm a balance and then  
UPDATE to withdraw the money as separate instructions would be a bad  
banking program. To exploit it, I would just have to start both  
operations at the same time - they both SELECT, they both see I have  
money, they both give me the money and UPDATE, and I get double the  
money (although my balance would show a big negative value - but I'm  
already gone...). Database 101.


When I asked for "does PostgreSQL guarantee this?" I didn't mean  
hand waving examples or hand waving expectations. I meant a pointer  
into the code that has some comment that says "we want to guarantee  
that a commit in one session will be immediately visible to other  
sessions, and that a later select issued in the other sessions will  
ALWAYS see the commit whether 1 nanosecond later or 200 seconds  
later" Robert's expectation and yours seem like taking this  
"guarantee" for granted rather than being justified with design  
intent and proof thus far. :-) Given my experiment to try and force  
it to fail, I can see why this would be taken for granted. Is this a  
real promise, though? Or just a unlikely scenario that never seems  
to be hit?


To me, the question is relevant in terms of the expectations of a  
multi-replica solution. We know people have the expectation. We know  
it can be convenient. Is the expectation valid in the first place?


I've probably drawn this question out too long and should do my own  
research and report back... Sorry... :-)


Cheers,
mark

--
Mark Mielke 



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Robert Haas

> In fact, waiting for reply from standby server before acknowledging a commit
> to the client is a bit pointless otherwise. It puts you in a strange
> situation, where you're waiting for the commits in normal operation, but if
> there's a network glitch or the standby goes down, you're willing to go
> ahead without it. You get a high guarantee that your data is up-to-date in
> the standby, except when it isn't. Which isn't much of a guarantee.

It protects you against a catastrophic loss of the primary, which is a
non-trivial consideration.  At the risk of being ghoulish, imagine
that you are a large financial company headquartered in the world
trade center.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Peter Eisentraut


Simon Riggs wrote:

I am truly lost to understand why the *name* "synchronous replication"
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do*


It's the color of the bikeshed ...


We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record


In DRBD tradition, I suggest you implement all of them, or at least 
factor the code so that each of them can be a one line change.  (We can 
probably later drop one or two options.)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Sun, 2008-12-14 at 21:41 -0500, Robert Haas wrote:
> > If this is right, #2, #3, #4, and #6 feel similar except
> > that they're protecting against failures of different (but
> > still all incomplete) subsets of the hardware on the slave, right?
> 
> Right.  Actually, the biggest difference with #6 has nothing to do
> with protecting against failures.  It has rather to do with the ease
> of writing applications in the context of hot standby.  You can close
> your connection, open a connection to a different server, and know
> that your transactions will be reflected there.  On the other hand,
> I'd be surprised if it didn't come with a substantial performance
> penalty, so it may not be too practical in real life even if it sounds
> good on paper.
> 
> #1 , #3, and #5 don't feel that useful to me. 

Yes, looks that way for me also.

Good analysis Ron. I agree with Robert that #6 is there for other
reasons.

#2 corresponds to DRBD algorithm B

#4 corresponds to DRBD algorithm C

Fujii-san, please can we incorporate those two options, rather than just
one choice "synchronous_replication = on". They look like two commonly
requested options.

#6 is an additional synchronization step in Hot Standby. I would say
that people won't want that when they see how it performs (they probably
won't want #4 either for that same reason, but that is for robustness).

Also, I would point out that the class of synch_rep is selected by the
user on the primary and can vary from transaction to transaction. That
is a very good thing, as far as I am concerned. We would need to enforce
#6 for all transactions (if we implemented synchronisation in this way).

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Simon Riggs

On Sun, 2008-12-14 at 12:57 -0500, Mark Mielke wrote:

> I'm curious about your suggestion to direct queries that need the
> latest 
> snapshot to the 'primary'. I might have misunderstood it - but it
> seems 
> that the expectation from some is that *all* sessions see the latest 
> snapshot, so would this not imply that all sessions would be redirect
> to 
> the 'primary'? I don't think it is reasonable myself, but I might be 
> misunderstanding something...

I said "a snapshot taken on the primary", but the query would run on the
standby.

Synchronising primary and standby so that they are identical from the
perspective of a query requires some synchronisation delay. I'm pointing
out that the synchronisation delay can occur 

* at the time we apply WAL - which will slow down commits (i.e. #6 on my
previous list of options)
* at the time we run a query that needs to see primary and standby
synchronised

So the same effect can be achieved in various ways.

The first way would require *all* transactions to be applied on standby,
i.e. option #6 for all transactions. That is a performance disaster and
I would not force that onto everybody.

The second way can be done by taking a snapshot on the primary, with an
associated LSN, then using that snapshot on the standby. That is
somewhat complex, but possible. I see the requirement for getting the
same answer on multiple nodes as a further extension of "transaction
isolation mode" and think that not all people will want this, so we
should allow that as an option.

I'm not going to worry about this at the moment. Hot standby will be
useful without this and so I regard this as a secondary objective. Rome
wasn't built in a single release, or something like that.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-15 Thread Heikki Linnakangas

Mark Mielke wrote:
Where does the expectation come from? I don't recall ever reading it in 
the documentation, and unless the session processes are contending over 
the integers (using some sort of synchronization primitive) in memory 
that represent the "latest visible commit" on every single select, I'm 
wondering how it is accomplished? 

The "integers" you're imagining are the ProcArray. Every backend has an 
entry there, and among other things it contains the current XID the 
backend is running. When a backend takes a new snapshot (on every single 
select in read committed mode), it locks the ProcArray, scans all the 
entries and collects all the XIDs listed there in the snapshot. Those 
are the set of transactions that were running when the snapshot was 
taken, and is used in the visibility checks.

> If they are contending over these
> integers, doesn't that represent a scaling limitation, in the sense that
> on a 32-core machine, they're going to be fighting with each other to
> get the latest version of these shared integers into the CPU for
> processing? Maybe it's such a small penalty that we don't care? :-)

The ProcArrayLock is indeed quite busy on systems with a lot of CPUs. 
It's held for such short times that it's not a problem usually, but it 
can become a bottleneck with a machine like that with all backends 
running small transactions.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Heikki Linnakangas


Mark Mielke wrote:
When I asked for "does PostgreSQL guarantee this?" I didn't mean hand 
waving examples or hand waving expectations. I meant a pointer into the 
code that has some comment that says "we want to guarantee that a commit 
in one session will be immediately visible to other sessions, and that a 
later select issued in the other sessions will ALWAYS see the commit 
whether 1 nanosecond later or 200 seconds later" Robert's expectation 
and yours seem like taking this "guarantee" for granted rather than 
being justified with design intent and proof thus far. :-) Given my 
experiment to try and force it to fail, I can see why this would be 
taken for granted. Is this a real promise, though? 


Yes.

In a nutshell, commit works like this:

1. Write and flush WAL record about the commit
2. Mark the transaction as committed in clog
3. Remove the xid from the shared memory ProcArray.
4. Release locks and other resources
5. Reply to client that the transaction has been committed.

After step 3, any backend taking a snapshot will see the transaction as 
committed. Since we only reply to the client at step 5, it is guaranteed 
that a transaction beginning after step 5, as well as an already open 
transaction taking a new snapshot (ie. running a new command in read 
committed mode) after that will see the transaction as committed.


The relevant code is in CommitTransaction() in xact.c.

To me, the question is relevant in terms of the expectations of a 
multi-replica solution. We know people have the expectation.


Yeah, I think Robert is right. We should reserve the term "synchronous 
replication" for the mode where that guarantee holds for the slave as well.


In fact, waiting for reply from standby server before acknowledging a 
commit to the client is a bit pointless otherwise. It puts you in a 
strange situation, where you're waiting for the commits in normal 
operation, but if there's a network glitch or the standby goes down, 
you're willing to go ahead without it. You get a high guarantee that 
your data is up-to-date in the standby, except when it isn't. Which 
isn't much of a guarantee.


But with hot standby, it makes a lot of sense. The guarantee is that if 
the standby is accepting queries, it's up-to-date with the primary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke


Greg Stark wrote:
When the database says the data is committed it has to mean the data 
is really committed. Imagine if you looked at a bank account balance 
after withdrawing all the money and saw a balance which didn't reflect 
the withdrawal and allowed you to withdraw more money again...


Within the same session - sure. From different sessions? PostgeSQL MVCC 
let's you see an older snapshot, although it does prefer to have the 
latest snapshot with each command.


For allowing to withdraw more money again, I would expect some sort of 
locking "SELECT ... FOR UPDATE;" to be used. This lock then forces the 
two transactions to become serialized and the second will either wait 
for the first to complete or fail. Any banking program that assumed that 
it could SELECT to confirm a balance and then UPDATE to withdraw the 
money as separate instructions would be a bad banking program. To 
exploit it, I would just have to start both operations at the same time 
- they both SELECT, they both see I have money, they both give me the 
money and UPDATE, and I get double the money (although my balance would 
show a big negative value - but I'm already gone...). Database 101.


When I asked for "does PostgreSQL guarantee this?" I didn't mean hand 
waving examples or hand waving expectations. I meant a pointer into the 
code that has some comment that says "we want to guarantee that a commit 
in one session will be immediately visible to other sessions, and that a 
later select issued in the other sessions will ALWAYS see the commit 
whether 1 nanosecond later or 200 seconds later" Robert's expectation 
and yours seem like taking this "guarantee" for granted rather than 
being justified with design intent and proof thus far. :-) Given my 
experiment to try and force it to fail, I can see why this would be 
taken for granted. Is this a real promise, though? Or just a unlikely 
scenario that never seems to be hit?


To me, the question is relevant in terms of the expectations of a 
multi-replica solution. We know people have the expectation. We know it 
can be convenient. Is the expectation valid in the first place?


I've probably drawn this question out too long and should do my own 
research and report back... Sorry... :-)


Cheers,
mark

--
Mark Mielke 


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Robert Haas

> If this is right, #2, #3, #4, and #6 feel similar except
> that they're protecting against failures of different (but
> still all incomplete) subsets of the hardware on the slave, right?

Right.  Actually, the biggest difference with #6 has nothing to do
with protecting against failures.  It has rather to do with the ease
of writing applications in the context of hot standby.  You can close
your connection, open a connection to a different server, and know
that your transactions will be reflected there.  On the other hand,
I'd be surprised if it didn't come with a substantial performance
penalty, so it may not be too practical in real life even if it sounds
good on paper.

#1 , #3, and #5 don't feel that useful to me.  In the case of #1,
sending your WAL over the network and then not checking that it got
there is sort of silly: the likelihood of packet loss on the network
has got to be several orders of magnitude more likely than a failure
on the master.  #3 and #5 just don't seem to provide any real benefits
over their immediate predecessors.

Honestly, I think the most useful thing is probably going to be
asynchronous replication: in other words, when a commit is requested
on the master, we write WAL and return success.  In the background, we
stream the WAL to a secondary, which writes it and applies it.  This
will give us a secondary which is mostly up to date (and can run
queries, with hot standby) without killing performance.  The other
options are going to be for environments where losing a transaction is
really, really bad, or (in the case of #6) read-mostly environments
where it's useful to spread the query load out across several servers,
but the overhead associated with waiting for the rare write
transactions to apply everywhere is tolerable.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Greg Stark

When the database says the data is committed it has to mean the data  
is really committed. Imagine if you looked at a bank account balance  
after withdrawing all the money and saw a balance which didn't reflect  
the withdrawal and allowed you to withdraw more money again...


--
Greg


On 14 Dec 2008, at 14:44, Mark Mielke  wrote:


Mark Mielke wrote:
Forget replication - even for the exact same server - I don't  
expect that if I commit from one session, I will be able to see the  
change immediately from my other session or a new session that I  
just opened. Perhaps this is often stable to rely on this, and it  
is useful for the database server to minimize the window during  
which the commit becomes visible to others, but I think it's a  
false expectation from the start that it absolutely will be  
immediately visible to another session. I'm thinking of situations  
where some part of the table is in cache. The only way the commit  
can communicate that the new transaction is available is by during  
communication between the processes or threads, or between the  
multiple CPUs on the machine. Do I want every commit to force each  
session to become fully in alignment before my commit completes?  
Does PostgreSQL make this guarantee today? I bet it doesn't if you  
look far enough into the guts. It might be very fast - I don't  
think it is infinitely fast.


FYI: I haven't been able to prove this. Multiple sessions running on  
my dual-core CPU seem to be able to see the latest commits before  
they begin executing. Am I wrong about this? Does PostgreSQL provide  
a intentional guarantee that a commit from one session that  
completes immediately followed by a query from another session will  
always find the commit effect visible (provide the transaction  
isolation level doesn't get in the way)? Or is the machine and  
algorithms just fast enough that by the time it executes the query  
(up to 1 ms later) the commit is always visible in practice?


Cheers,
mark

--
Mark Mielke 


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke


Heikki Linnakangas wrote:

Mark Mielke wrote:
FYI: I haven't been able to prove this. Multiple sessions running on 
my dual-core CPU seem to be able to see the latest commits before 
they begin executing. Am I wrong about this? Does PostgreSQL provide 
a intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)?
Yes. PostgreSQL does guarantee that, and I would expect any other DBMS 
to do the same.


Where does the expectation come from? I don't recall ever reading it in 
the documentation, and unless the session processes are contending over 
the integers (using some sort of synchronization primitive) in memory 
that represent the "latest visible commit" on every single select, I'm 
wondering how it is accomplished? If they are contending over these 
integers, doesn't that represent a scaling limitation, in the sense that 
on a 32-core machine, they're going to be fighting with each other to 
get the latest version of these shared integers into the CPU for 
processing? Maybe it's such a small penalty that we don't care? :-)


I was never instilled with the logic that 'commit in one session 
guarantees visibility of the effects in another session'. But, as I say 
above, I wasn't able to make PostgreSQL "fail" in this regard. So maybe 
I have no clue what I am talking about? :-)


If you happen to know where the code or documentation makes this 
promise, feel free to point it out. I'd like to review the code. If you 
don't know - don't worry about it, I'll find it later...


Cheers,
mark

--
Mark Mielke 


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Ron Mayer


Robert Haas wrote:

We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record


Perhaps it'd be useful if the failure modes these are trying to
protect against were described too.

If I understand right.

1. Protects all the transactions from the failure of the
   master; so long as neither the network nor the slave
   machine die soon?

2. Protects all the transactions from the failure of the
   master and the network between the slave and master,
   so long as the slave doesn't die soon?

3. Same as #2?

4. Protects against the failure of the master, the network,
   and parts of the slave; so long as the slave's disk
   survives the failure?

5. Protects against all of the above, and bit-errors in the
   memories of the slave machine (except the slave's disk
   controller?)?   Or are we reading-back the CRC from the
   slave's disk and comparing to the CRC computed on the
   master where it might protect from even more?

6. Same as 4?

If this is right, #2, #3, #4, and #6 feel similar except
that they're protecting against failures of different (but
still all incomplete) subsets of the hardware on the slave, right?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Dimitri Fontaine


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Le 14 déc. 08 à 16:48, Simon Riggs a écrit :

I am truly lost to understand why the *name* "synchronous replication"
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.


It seems that the easy parts are the one the more people will  
participate into. Maybe that's that simple.



We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record


Ok, so let's talk about this easy part: my understanding of  
"synchronous replication" is that it gives to its users the strong  
guarantee that at commit time the transaction is secured to the  
slave(s). That means you get the D of ACID on more than one server.


Why synchronous? Because you know the durability is ensured exactly  
when you receive the COMMIT ack.


So I'm with Simon on this, the term Synchronous Replication does  
describe accurately what's being implemented here, and on the other  
hand, as so many of us are saying, it's true that it tells very little  
about it. Those 6 options are all in the scope of the infamous naming,  
just different guarantees level, from almost strong to very strong,  
with some "almost, but not quite, entirely unlike the strong I want".  
Pick your naming here too.


At least, that's how I'm understanding this, the bottom line of why  
care sending this email is that maybe it'll help some people to  
recover from sleep deprivation ;)


My 2¢,
- --
dim

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAklFcEsACgkQlBXRlnbh1bk0YwCfa+zGBKTK5EoH/Nmu0x+R6vKI
buAAniyL6Z+3MdT4rim5/xZQvdr4QOIQ
=iHnY
-END PGP SIGNATURE-

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Heikki Linnakangas


Mark Mielke wrote:

Mark Mielke wrote:
Forget replication - even for the exact same server - I don't expect 
that if I commit from one session, I will be able to see the change 
immediately from my other session or a new session that I just opened. 
Perhaps this is often stable to rely on this, and it is useful for the 
database server to minimize the window during which the commit becomes 
visible to others, but I think it's a false expectation from the start 
that it absolutely will be immediately visible to another session. I'm 
thinking of situations where some part of the table is in cache. The 
only way the commit can communicate that the new transaction is 
available is by during communication between the processes or threads, 
or between the multiple CPUs on the machine. Do I want every commit to 
force each session to become fully in alignment before my commit 
completes? Does PostgreSQL make this guarantee today? I bet it doesn't 
if you look far enough into the guts. It might be very fast - I don't 
think it is infinitely fast.


FYI: I haven't been able to prove this. Multiple sessions running on my 
dual-core CPU seem to be able to see the latest commits before they 
begin executing. Am I wrong about this? Does PostgreSQL provide a 
intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)?


Yes. PostgreSQL does guarantee that, and I would expect any other DBMS 
to do the same.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke


Mark Mielke wrote:
Forget replication - even for the exact same server - I don't expect 
that if I commit from one session, I will be able to see the change 
immediately from my other session or a new session that I just opened. 
Perhaps this is often stable to rely on this, and it is useful for the 
database server to minimize the window during which the commit becomes 
visible to others, but I think it's a false expectation from the start 
that it absolutely will be immediately visible to another session. I'm 
thinking of situations where some part of the table is in cache. The 
only way the commit can communicate that the new transaction is 
available is by during communication between the processes or threads, 
or between the multiple CPUs on the machine. Do I want every commit to 
force each session to become fully in alignment before my commit 
completes? Does PostgreSQL make this guarantee today? I bet it doesn't 
if you look far enough into the guts. It might be very fast - I don't 
think it is infinitely fast.


FYI: I haven't been able to prove this. Multiple sessions running on my 
dual-core CPU seem to be able to see the latest commits before they 
begin executing. Am I wrong about this? Does PostgreSQL provide a 
intentional guarantee that a commit from one session that completes 
immediately followed by a query from another session will always find 
the commit effect visible (provide the transaction isolation level 
doesn't get in the way)? Or is the machine and algorithms just fast 
enough that by the time it executes the query (up to 1 ms later) the 
commit is always visible in practice?


Cheers,
mark

--
Mark Mielke 


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Robert Haas

> We can make the reply to a commit message when any of the following
> events have occurred
>
> 1. We sent the message to standby
> 2. We received the message on standby
> 3. We wrote the WAL to the WAL file
> 4. We fsync'd the WAL file
> 5. We CRC checked the WAL commit record
> 6. We applied the WAL commit record

Also

0. The same time we would have done so if replication had not been
configured at all.

I think the basic problem here is that we can talk about "asynchronous
replication" and "synchronous replication" but there are n>2
possible/useful behaviors (I would guess principally 0, 2, 4, and 6,
but YMMV).  So we're going to need some way to clarify what we mean.

BTW, in case my previous emails on this topic might have given someone
the contrary impression, I'm not really that worked up about this
either.  Interesting?  Yes.  Have opinions?  Yes.  Lie awake nights
worrying about it?  Nope.  :-)

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke


Simon Riggs wrote:

I am truly lost to understand why the *name* "synchronous replication"
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.
  


I think people have talked about 'like' in the context of user 
expectations. That is, there seems to exist a set of people (probably 
those who've never worked with a multi-replica solution before) who 
expect that once commit completes on one server, they can query any 
other master or slave and be guaranteed visibility of the transaction 
they just committed. These people may theoretically change their 
decision to not use Postgres-R, or at least change their approach to how 
they work with Postgres-R, if the name was in some way more intuitive to 
them in terms of what is actually being provided.


"Synchronous replication" itself says only details about replication, it 
does not say anything about visibility, so to some degree, people are 
focusing on the wrong term as the problem. Even if it says "asynchronous 
replication" - not sure that I care either way - this doesn't improve 
the understanding for the casual user of what is happening behind the 
scenes. Neither synchronous nor asynchronous guarantees that the change 
will be immediately visible from other nodes after I type 'commit;'. 
Asynchronous might err on the side of not immediately visible, where 
synchronous might (incorrectly) imply immediate visibility, but it's not 
an accurate guarantee to provide.


Synchronous does not guarantee visibility immediately after. Some 
indefinite but usually short time must normally pass from when my 
'commit;' completes until when the shared memory visible to my process 
"sees" the transaction. Multiple replicas with network latency or 
reliability issues increases the theoretical minimum size of this window 
to something that would be normally encountered as opposed to something 
that is normally not encountered.


The only way to guarantee visibility is to ensure that the new 
transaction is guaranteed to be visible from a shared memory perspective 
on every machine in the pool, and every active backend process. If my 
'commit;' is going to wait for this to occur, first, I think this forces 
every commit to have numerous network round trips to each machine in the 
pool, it forces each machine in the pool to be network accessible and 
responsive, it forces all commits to be serialized in the sense of "the 
slowest machine in the pool determines the time for my commit to 
complete", and I think it implies some sort of inter-process signalling, 
or at the very least CPU level signalling about shared memory (in the 
case of multiple CPUs).


People such as myself think that a visibility guarantee is unreasonable 
and certain to cause scalability or reliability problems. So, my 'like' 
is an efficient multi-master solution where if I put 10 machines in the 
pool, I expect my normal query/commit loads to approach 10X as fast. My 
like prefers scalability over guarantees that may be difficult to 
provide, and probably are not provided today even in a single server 
scenario.



It is certainly far too early to say what the final exact behaviour will
be and there is no reason at all to pre-suppose that it need only be a
single behaviour. I'm in favour of options, generally, but I would say
that the distinction between some of these options is mostly very fine
and strongly doubt whether people would use them if they existed. *But*
I think we can add them at a later stage of development if requirements
genuinely exist once all the benefits *and* costs are understood.
  


The above 'commit;' behaviour difference - whether it completes when the 
commit is permanent (it definitely will be applied for certain to all 
replicas - it just may take time to apply to all replicas), or when the 
commit has actually taken effect (two-phase commit on all replicas - and 
both phases have completed on all replicas - what happens if second 
phase commit fails on one or more servers?), or when the commit is 
guaranteed to be visible from all existing and new sessionss (two-phase 
commit plus additional signalling required?) might be such an option.


I'm doubtful, though - as the difference in implementation between the 
first and second is pretty significant.


I'm curious about your suggestion to direct queries that need the latest 
snapshot to the 'primary'. I might have misunderstood it - but it seems 
that the expectation from some is that *all* sessions see the latest 
snapshot, so would this not imply that all sessions would be redirect to 
the 'primary'? I don't think it is reasonable myself, but I might be 
misunderstanding something...


Cheers,
mark

--
Mark Mielke 


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postg

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Simon Riggs

On Sun, 2008-12-14 at 13:31 +0900, Tatsuo Ishii wrote:
> > The point here is that synchronous replication, at least to some
> > people, is going to imply that the user-visible states of the two
> > copies are consistent.  To other people, it is going to imply that
> > committed transactions will never be lost even in the event of a
> > catastropic loss of the primary 1 picosecond after the commit is
> > acknowledged.  We need to choose some word that implies that we are
> > guaranteeing the latter of these two things but not the former.
> > Otherwise, we will have confused users, and terminological confusion
> > when and if we ever implement the former as well.
> 
> Right. Before watching this thread, I had thought that the log
> shipping sync replication behaves former (and I had told so to people
> in Japan who are interested in 8.4 development. Of course this is my
> fault, though).
> 
> Now I understand the log shipping sync replication does not behave
> same as other "sync replications" such as pgpool and PGCluster (there
> maybe more, but I don't know)

GENERAL COMMENTS, not to anybody in particular:

'Tis but thy name that is my enemy.
...
What's in a name? That which we call a rose
By any other name would smell as sweet.
...

Juliet, from "Romeo and Juliet"

I am truly lost to understand why the *name* "synchronous replication"
causes so much discussion, yet nobody has discussed what they would
actually like the software to *do* (this being a software discussion
list...). AFAICS we can make the software behave like *any* of the
definitions discussed so far.

It is certainly far too early to say what the final exact behaviour will
be and there is no reason at all to pre-suppose that it need only be a
single behaviour. I'm in favour of options, generally, but I would say
that the distinction between some of these options is mostly very fine
and strongly doubt whether people would use them if they existed. *But*
I think we can add them at a later stage of development if requirements
genuinely exist once all the benefits *and* costs are understood.

I would also point out that the distinction made between various
meanings of synchronous is *only* important if Hot Standby is included
as well. And that is closely linked to the replication feature, which we
really need to complete first. We have much to do yet.

So let's please end the name debate there and think about software.

...

We can make the reply to a commit message when any of the following
events have occurred

1. We sent the message to standby
2. We received the message on standby
3. We wrote the WAL to the WAL file
4. We fsync'd the WAL file
5. We CRC checked the WAL commit record
6. We applied the WAL commit record

Now you might think from what people have said that having synchronised
contents on both primary and standby is the only way to achieve exactly
the same results to queries on both nodes. Another way is to utilise a
snapshot taken on the primary and simply wait until the standby catches
up with that snapshot's LSN. So there is more than one way of achieving
a particular result and it is not dependent upon the exact
synchronisation we employ at commit time.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Emmanuel Cecchet


Robert Haas wrote:

The term of art for making sure that transactions committed on the
primary are visible on the secondary seems to be "one-copy
serializability" (see, for example, a Google Books search on that
term).
Not exactly. 1-copy-serializability which is the standard for 
multi-master solutions, guarantees that transactions are executed in the 
same serializable order at each replica (which means that transactions 
can be executed in different order and committed at different times on 
different replica as long as a consistent serializable view is presented 
to the client).
There are a number of optimizations in that area but in a multi-master 
case, replicas rarely commit at the same time. There are interesting 
papers on the subject (like Tashkent & Tashkent+ based on Postgres) for 
those who want to understand these problems more thoroughly.


Hope this helps,
manu

--
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting

--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-14 Thread Mark Mielke


Robert Haas wrote:

On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane  wrote:
  

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events



OK, fine.  I'll be more precise.  I think we need to reserve the term
"synchronous replication" for a system where transactions that begin
on the standby after the transactions has committed on the master see
the effects of the committed transaction.
  


Wouldn't this be serialized transactions?

I'd like to see proof of some sort that PostgreSQL guarantees that the 
instant a 'commit' returns, any transactions already open with the 
appropriate transaction isolation level, or any new sessions *will* see 
the results of the commit.


I know that most of the time this happens - but what process 
synchronization steps occur to *guarantee* that this happens?



I just googled "synchronous replication" and read through the first
page of hits.  Most of them do not address the question of whether
synchronous replication can be said to have be completed when WAL has
been received by the standby not but yet applied.  One of the ones
that does is:

http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

...which refers to what we're proposing to call "Synchronous
Replication" as "Semi-Synchronous Replication" (or 2-safe replication)
specifically to distinguish it.  The other is:

http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf

...which doesn't specifically examine the issue but seems to take the
opposite position, namely that the server on which the transaction is
executed needs to wait only for one server to apply the changes to the
database (the others need only to know that they need to commit it;
they don't actually need to have done it).  However, that same paper
refers to two-phase commit as a synchronous replication algorithm, and
Wikipedia's discussion of two-phase commit:

http://en.wikipedia.org/wiki/Two-phase_commit_protocol

...clearly implies that the transaction must be applied everywhere
before it can be said to have committed.

The second page of Google results is mostly a further discussion of
the MySQL solution, which is mostly described as "semi-synchronous
replication".

Simon Riggs said upthread that Oracle called this "synchronous redo
transport".  That is obviously much closer to what we are doing than
"synchronous replication".
  


Two phase commit doesn't imply that the transaction is guaranteed to be 
immediately visible. See my previous paragraph. Unless transactions are 
locked from starting until they are able to prove that they have the 
latest commit (a feat which I'm going to theorize as impossible - 
because the moment you wait for a commit, and you begin again, you 
really have no guarantee that another commit has not occurred in the 
mean time), I think it's clear that two phase commit guarantees that the 
commit has taken place, but does *not* guarantee anything about visibility.


It might be a good bet - but guarantee? There is no such guarantee.

Cheers,
mark

--
Mark Mielke

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas

> The point here is that synchronous replication, at least to some
> people, is going to imply that the user-visible states of the two
> copies are consistent.  To other people, it is going to imply that
> committed transactions will never be lost even in the event of a
> catastropic loss of the primary 1 picosecond after the commit is
> acknowledged.  We need to choose some word that implies that we are
> guaranteeing the latter of these two things but not the former.
> Otherwise, we will have confused users, and terminological confusion
> when and if we ever implement the former as well.

With apologies for replying to my own post:

It's also important to understand that these two invariants are
completely separate and it is possible to guarantee either without the
other.  If you want (1), the standby needs to apply the WAL before
sending an acknowledgment to the primary but does not necessarily need
to write it to disk (of course, it will have to be written to disk
before the modified buffers are written to disk, but that's a separate
issue).  If you want (2), the standby needs to write the WAL to disk
before sending the acknowledgment but does not necessarily need to
apply it.  If you want both, then, you need to wait for both (and it's
worth noting that your performance will probably be nothing to write
home about).

I also did some research on terminology that has been used in the
literature.  As Jim Gray describes it:

1-safe replication.  Transaction is committed when it has been locally
WAL-logged to durable storage.
Group-safe replication.  Transaction is committed when WAL has been
received by all remote servers, but not necessarily written to durable
storage.
Group-safe & 1-safe replication.  Transaction is committed when it has
been locally WAL-logged to durable storage and WAL has been received
by all remote servers.
2-safe replication.  Transaction is committed when it has been written
to durable storage on both local and remote servers.
Very safe replication.  As 2-safe, but fails any read-write
transaction if the secondary is down.

(Actually, it appears that "Transaction Processing" Jim Gray and
Andreas Reuter, 1993 uses 2-safe to refer to either 2-safe or
group-safe; the distinction between the two is a subsequent
development. See e.g. Advances in Database Technology-EDBT 2004
by Elisa Bertino)

The term of art for making sure that transactions committed on the
primary are visible on the secondary seems to be "one-copy
serializability" (see, for example, a Google Books search on that
term).

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Tatsuo Ishii

> The point here is that synchronous replication, at least to some
> people, is going to imply that the user-visible states of the two
> copies are consistent.  To other people, it is going to imply that
> committed transactions will never be lost even in the event of a
> catastropic loss of the primary 1 picosecond after the commit is
> acknowledged.  We need to choose some word that implies that we are
> guaranteeing the latter of these two things but not the former.
> Otherwise, we will have confused users, and terminological confusion
> when and if we ever implement the former as well.

Right. Before watching this thread, I had thought that the log
shipping sync replication behaves former (and I had told so to people
in Japan who are interested in 8.4 development. Of course this is my
fault, though).

Now I understand the log shipping sync replication does not behave
same as other "sync replications" such as pgpool and PGCluster (there
maybe more, but I don't know)
--
Tatsuo Ishii
SRA OSS, Inc. Japan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Jeff Davis

On Sat, 2008-12-13 at 22:23 -0500, Robert Haas wrote:
> > If it's guaranteed to be visible on the standby after it's committed on
> > the master, and you don't have any way to make it actually simultaneous,
> > then that implies that it's visible on the slave for some brief period
> > of time before it's committed on the master.
> >
> > That situation is still asymmetric, so why is that a better use of the
> > term "synchronous"?
> 
> Because that happens anyway.  If I request a commit on a single,
> unreplicated server, the server makes the commit visible to new
> transactions and then sends me a message informing me that the commit
> has completed.  Since the message takes some finite time to reach me,
> there is a window of time after the commit has completed and before I
> know that the commit has been completed.
> 

Oh, I see the distinction now.

Thanks for the detailed reply.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas

> Might it not be true that anybody unfamiliar would be confused and that this
> is a bit of a straw man?
[...]
> If my application assumes that it can commit to one server, and then read
> back the commit from another server, and my application breaks as a result,
> it's because I didn't understand the problem. Even if PostgreSQL didn't use
> the word "synchronous replication", I could still be confused. I need to
> understand the problem no matter what words are used.

That is certainly true.  But there is value in choosing words which
elucidate the situation as much as possible.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas

> If it's guaranteed to be visible on the standby after it's committed on
> the master, and you don't have any way to make it actually simultaneous,
> then that implies that it's visible on the slave for some brief period
> of time before it's committed on the master.
>
> That situation is still asymmetric, so why is that a better use of the
> term "synchronous"?

Because that happens anyway.  If I request a commit on a single,
unreplicated server, the server makes the commit visible to new
transactions and then sends me a message informing me that the commit
has completed.  Since the message takes some finite time to reach me,
there is a window of time after the commit has completed and before I
know that the commit has been completed.

Suppose for the sake of argument that the single, unreplicated server
did these two tasks in the opposite order - namely, first, it sent a
message to the process requesting the commit stating that the commit
had completed, and only then made the transaction visible.  This would
create a race condition: the process requesting the commit might
receive the commit and begin a new transaction before the previous
transaction had been made visible, and would therefore not be able to
see the results of its own previous actions.  I think it's fair to say
that this behavior would be judged totally intolerable.

Therefore, there can't possibly be any applications out there which
are depending on the fact that commits don't become visible until they
are acknowledged, but there very well could be some applications which
depend on the fact that one commits are acknowledged, they are
visible.  If replication is synchronous in this sense, then I can open
a connection to the master, write some data, close the connection,
open a new connection to the master or the slave (not caring which),
and read back the data that I just wrote (assuming no one else has
modified it in the mean time).  If it isn't, then I can't.  Some
people will not care about this, but some will.

The point here is that synchronous replication, at least to some
people, is going to imply that the user-visible states of the two
copies are consistent.  To other people, it is going to imply that
committed transactions will never be lost even in the event of a
catastropic loss of the primary 1 picosecond after the commit is
acknowledged.  We need to choose some word that implies that we are
guaranteeing the latter of these two things but not the former.
Otherwise, we will have confused users, and terminological confusion
when and if we ever implement the former as well.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Jeff Davis

On Sat, 2008-12-13 at 21:35 -0500, Robert Haas wrote:
> On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane  wrote:
> > "Robert Haas"  writes:
> >> I think we need to reserve the term "synchronous replication" for a
> >> system where transactions that begin at the same time on the primary
> >> and standby see the same tuples.  Clearly that is "more" synchronous
> >
> > We won't call it anything, because we never will or can implement that.
> > See the theory of relativity: the notion of exactly simultaneous events
> 
> OK, fine.  I'll be more precise.  I think we need to reserve the term
> "synchronous replication" for a system where transactions that begin
> on the standby after the transactions has committed on the master see
> the effects of the committed transaction.
> 

If it's guaranteed to be visible on the standby after it's committed on
the master, and you don't have any way to make it actually simultaneous,
then that implies that it's visible on the slave for some brief period
of time before it's committed on the master.

That situation is still asymmetric, so why is that a better use of the
term "synchronous"?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas

On Sat, Dec 13, 2008 at 1:29 PM, Tom Lane  wrote:
> "Robert Haas"  writes:
>> I think we need to reserve the term "synchronous replication" for a
>> system where transactions that begin at the same time on the primary
>> and standby see the same tuples.  Clearly that is "more" synchronous
>
> We won't call it anything, because we never will or can implement that.
> See the theory of relativity: the notion of exactly simultaneous events

OK, fine.  I'll be more precise.  I think we need to reserve the term
"synchronous replication" for a system where transactions that begin
on the standby after the transactions has committed on the master see
the effects of the committed transaction.

> at distinct locations isn't even well-defined, because observers at yet
> other locations will disagree about what is "simultaneous".  And I'm
> not just making a joke here --- speed-of-light delays in a WAN are
> meaningful compared to current computer speeds.  In practice, the
> slave and the master will never commit at exactly the same time.
>
> I agree with the point made upthread that we should use the term
> "synchronous replication" the way it's commonly used in the industry.
> Inventing our own terminology might be fun but it's not really going
> to result in less confusion.

I just googled "synchronous replication" and read through the first
page of hits.  Most of them do not address the question of whether
synchronous replication can be said to have be completed when WAL has
been received by the standby not but yet applied.  One of the ones
that does is:

http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign

...which refers to what we're proposing to call "Synchronous
Replication" as "Semi-Synchronous Replication" (or 2-safe replication)
specifically to distinguish it.  The other is:

http://www.cnds.jhu.edu/pub/papers/cnds-2002-4.pdf

...which doesn't specifically examine the issue but seems to take the
opposite position, namely that the server on which the transaction is
executed needs to wait only for one server to apply the changes to the
database (the others need only to know that they need to commit it;
they don't actually need to have done it).  However, that same paper
refers to two-phase commit as a synchronous replication algorithm, and
Wikipedia's discussion of two-phase commit:

http://en.wikipedia.org/wiki/Two-phase_commit_protocol

...clearly implies that the transaction must be applied everywhere
before it can be said to have committed.

The second page of Google results is mostly a further discussion of
the MySQL solution, which is mostly described as "semi-synchronous
replication".

Simon Riggs said upthread that Oracle called this "synchronous redo
transport".  That is obviously much closer to what we are doing than
"synchronous replication".

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Mark Mielke


Markus Wanner wrote:

I don't think synchronous replication guarantees that it will be
immediately visible. Even if it did push the change to the other
machine, and the other machine had committed it, that doesn't guarantee
that any reader sees it any more than if I commit to the same machine
(no replication), I am guaranteed to see the change from another
session.



AFAIK every snapshot taken after a transaction has acknowledged its
commit is guaranteed to see changes from that transaction. Isn't that a
pretty frequent and obvious user expectation?
  


Yes - but that's only really true while the session continues. From 
another session? I've never assumed that I could reconnect and be 
guaranteed to get the latest snapshot that includes absolutely 
everything that has been committed.


Any system that guaranteed this even when involving multiple machines 
would be guaranteed to be inefficient and difficult to scale in my 
opinion. How could any system promise to have reasonable commit times 
while also guaranteeing that once a commit completes, any session to any 
other server will be able to see the commit? I think this forces some 
sort of serialization between multiple machines and defeats the purpose 
of having multiple machines. Where before it was indeterminate to know 
when the commit would take effect at each replica, it's not 
indeterminate when my commit will succeed. That is, my commit cannot 
succeed until every single server acknowledge that it is has fully 
received and committed my transaction. What happens if there are network 
problems, or what happens if I am replicating over a slower link? What 
if I am committing to 100 servers? Is it reasonable to expect 100 server 
negotiations to complete in full before my own commit will return?



Synchronous replication only means that I can be assured that
my change has been saved permanently by the time my commit completes. It
doesn't mean anybody else can see my change or is guaranteed to see my
change if the query from another session.


So you wouldn't be surprised if a transaction from two hours ago isn't
visible on another node, just because that node happens to be rather
busy with lots of other readers and maintenance tasks?
  


Any system that is two hours behind should fall out of the pool used to 
satisfy reads from. So, if there was a surprise, it would be this. I 
don't believe ACID requires that a commit on one server is immediately 
visible on another server. Any work I do on the "behind" server would 
still be safe from a transaction and referential integrity perspective. 
However, if I executed 'commit' on this "behind" server, I would expect 
the commit to wait until it catches up, or in the case of a 2 hour 
behind, I would expect the commit to fail. Look at the alternative - all 
commits to any server in the pool would be locked up waiting for this 
one machine to catch up on 2 hours of transaction. This emphasizes that 
the problem is that a server two hours of date is still in the pool, 
rather than the problem being keeping things up-to-date.




If my application assumes that it can commit to one server, and then
read back the commit from another server, and my application breaks as a
result, it's because I didn't understand the problem.


Well, yeah, depends on user expectations. I'm surprised to hear that you
have that understanding of synchronous replication.
  


I've seen people face it in the past. Most recently we had a 
presentation from the developer of digg.com, and he described how he had 
this problem with MySQL and that he had to work around it.


On a smaller scale and slightly unrelated, I had this problem frequently 
between memcache and PostgreSQL. That is, memcache would always be 
latest, but PostgreSQL might not be latest, because the commit had not 
occurred.


It seems like a standard enough problem to me. I don't expect Postgres-R 
to do the impossible. As with my previous paragraph, I don't expect 
Postgres-R to wait 2-hours to commit just because one server is falling 
behind.



Even if PostgreSQL
didn't use the word "synchronous replication", I could still be
confused. I need to understand the problem no matter what words are used.



As said, it depends on what the common understanding of "synchronous
replication" is. I've so far been under the impression, that these
potential lags are unexpected and confusing. Several people pointed me
at that problem and I've thus "relabeled" Postgres-R as not being
synchronous. I'm at least surprised to suddenly get pushed into the
other direction. :-)

However, I absolutely agree that it's not that important how we name it.
What is important, is that users and developers understand the difference


I agree they are unexpected and confusing. I don't agree that they are 
unexpected or confusing to those knowledgeable in the domain. So, the 
question becomes - whose expectation is wrong? Should the user learn 
more? Or should we push for a c

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Mark Mielke wrote:
> Might it not be true that anybody unfamiliar would be confused and that
> this is a bit of a straw man?

Might be. I've neglected the issue myself for a while.

> I don't think synchronous replication guarantees that it will be
> immediately visible. Even if it did push the change to the other
> machine, and the other machine had committed it, that doesn't guarantee
> that any reader sees it any more than if I commit to the same machine
> (no replication), I am guaranteed to see the change from another
> session.

AFAIK every snapshot taken after a transaction has acknowledged its
commit is guaranteed to see changes from that transaction. Isn't that a
pretty frequent and obvious user expectation?

> Synchronous replication only means that I can be assured that
> my change has been saved permanently by the time my commit completes. It
> doesn't mean anybody else can see my change or is guaranteed to see my
> change if the query from another session.

So you wouldn't be surprised if a transaction from two hours ago isn't
visible on another node, just because that node happens to be rather
busy with lots of other readers and maintenance tasks?

> If my application assumes that it can commit to one server, and then
> read back the commit from another server, and my application breaks as a
> result, it's because I didn't understand the problem.

Well, yeah, depends on user expectations. I'm surprised to hear that you
have that understanding of synchronous replication.

> Even if PostgreSQL
> didn't use the word "synchronous replication", I could still be
> confused. I need to understand the problem no matter what words are used.

As said, it depends on what the common understanding of "synchronous
replication" is. I've so far been under the impression, that these
potential lags are unexpected and confusing. Several people pointed me
at that problem and I've thus "relabeled" Postgres-R as not being
synchronous. I'm at least surprised to suddenly get pushed into the
other direction. :-)

However, I absolutely agree that it's not that important how we name it.
What is important, is that users and developers understand the difference.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Aidan Van Dyk wrote:
> Well, I think the PG MVCC (which wal-streaming just ships across
> somewhere else) will save that.  So with hot-standby you could have
> another client could see the result *after* the COMMIT has been
> requested, but *before* the COMMIT returns...  But we have this
> situation in a single current PG instance anyways, so it's nothing
> new

AFAIU the proposed algorithm only waits until WAL is written on the
slave before acknowledging COMMIT. Application of the changes may be
deferred, so it's not necessarily immediately visible on the slave.

> But with hot-standby, I could also see that it could be done such that
> the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but
> because of a current running query, application of it is delayed...  But
> this is hot-standby's problem of describing itself, not sync-rep.

I'm thinking of the overall system and don't care much if it's
hot-standby's or sync-rep's problem. But it's certainly the master which
needs to await certain acknowledgments of the slaves. That has so far
been discussed within this sync-rep thread.

> IMHO, sync-rep is about getting the change "durrably to a slave" before
> acknoledging the COMMIT.  That slave could be any number of things:
> - A "WAL archive" type system having the ability to be used for
>   recover
> - A PG with special "recovery mode" that reads the stream and applies it
> - A full hot-standby recovery
> 
> I could see any and all of those (and probably other) being usefull and
> used.

I certainly agree to that.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Mark Mielke


Markus Wanner wrote:

Tom Lane wrote:
  

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events
at distinct locations isn't even well-defined



That has never been the point of the discussion. It's rather about the
question if changes from transactions are guaranteed to be visible on
remote nodes immediately after commit acknowledgment. Whether or not
this is guaranteed, in both cases the term "synchronous replication" is
commonly used, which is causing confusion.
  


Might it not be true that anybody unfamiliar would be confused and that 
this is a bit of a straw man?


I don't think synchronous replication guarantees that it will be 
immediately visible. Even if it did push the change to the other 
machine, and the other machine had committed it, that doesn't guarantee 
that any reader sees it any more than if I commit to the same machine 
(no replication), I am guaranteed to see the change from another 
session. Synchronous replication only means that I can be assured that 
my change has been saved permanently by the time my commit completes. It 
doesn't mean anybody else can see my change or is guaranteed to see my 
change if the query from another session.


If my application assumes that it can commit to one server, and then 
read back the commit from another server, and my application breaks as a 
result, it's because I didn't understand the problem. Even if PostgreSQL 
didn't use the word "synchronous replication", I could still be 
confused. I need to understand the problem no matter what words are used.


Cheers,
mark

--
Mark Mielke

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Aidan Van Dyk

* Markus Wanner  [081213 16:33]:
> Hi,
> 
> Hannu Krosing wrote:
> > You can have a variantof sync rep + hot standby where the master does
> > not return committed before the slave has both synced the data and
> > replied the transaction so that it is visible on slave but in that case
> > you may have a usecase, where it is actually visible on slave _before_
> > it is visible on master.
> 
> As long as it's not visible *before* the client requests a COMMIT, that
> certainly doesn't matter (because the application cannot check that).

Well, I think the PG MVCC (which wal-streaming just ships across
somewhere else) will save that.  So with hot-standby you could have
another client could see the result *after* the COMMIT has been
requested, but *before* the COMMIT returns...  But we have this
situation in a single current PG instance anyways, so it's nothing
new

But with hot-standby, I could also see that it could be done such that
the wal-stream is fsynced to disk (i.e. xlog) and acknowledged, but
because of a current running query, application of it is delayed...  But
this is hot-standby's problem of describing itself, not sync-rep.

IMHO, sync-rep is about getting the change "durrably to a slave" before
acknoledging the COMMIT.  That slave could be any number of things:
- A "WAL archive" type system having the ability to be used for
  recover
- A PG with special "recovery mode" that reads the stream and applies it
- A full hot-standby recovery

I could see any and all of those (and probably other) being usefull and
used.

But in the current patch, it focusses on the streaming (sending), and
and a receiver "recovery" mode that can accept/apply them, again,
without worrying about acutally running queries (yet) ...

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.

signature.asc
Description: Digital signature

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Hannu Krosing wrote:
> You can have a variantof sync rep + hot standby where the master does
> not return committed before the slave has both synced the data and
> replied the transaction so that it is visible on slave but in that case
> you may have a usecase, where it is actually visible on slave _before_
> it is visible on master.

As long as it's not visible *before* the client requests a COMMIT, that
certainly doesn't matter (because the application cannot check that).

What matters is, that an application might expect a node to show the
changes of a transaction which has previously (seen from the application
itself) been committed and acknowledged by another node.

AFAICT the common understanding of synchronous replication is, that all
nodes confirm to have committed the changes of a transaction *before*
acknowledging COMMIT to the application (and obviously only *after* the
application requested to COMMIT the transaction, so the guarantee is
that all nodes commit *sometime* within that time frame, which is
certainly possible to guarantee, see 2PC approaches).

This guarantee is not provided by the Postgres-R algorithm, nor by the
approach presented. Both only guarantee, that the transaction *will* get
committed (and thus get visible) on all nodes *sometime* *after* the
application requested to commit it (even in case of various failures,
that is) [1]. As cited before, that has been enough of a reason for Jan
Wieck to call Postgres-R asynchronous, and I certainly see his point.

Note that the amount of time that passes between the commit
acknowledgment and the actual commit on remote nodes may theoretically
be infinitely long. And in practice certainly long enough for an
application to notice the difference. However, it still is a practical
optimization, because most applications should cope with it just fine.
But not all...

Do you consider the proposed log shipping approach to be synchronous?
How about the Postgres-R algorithm?

Regards

Markus Wanner

[1]: of course these approaches also guarantee that the transaction is
committed on the local node *before* acknowledging commit, so that
subsequent (seen from the application) queries are guaranteed to see the
changes. But that guarantee only holds true for the local node.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Simon Riggs wrote:
>> Hot Standby (although the latter
>> seems to have stalled a bit...)
> 
> It's just being worked on asynchronously. ;-)

LOL, thanks for bringing humor into this discussion :-)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Tom Lane wrote:
> We won't call it anything, because we never will or can implement that.
> See the theory of relativity: the notion of exactly simultaneous events
> at distinct locations isn't even well-defined

That has never been the point of the discussion. It's rather about the
question if changes from transactions are guaranteed to be visible on
remote nodes immediately after commit acknowledgment. Whether or not
this is guaranteed, in both cases the term "synchronous replication" is
commonly used, which is causing confusion.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Hannu Krosing

On Sat, 2008-12-13 at 21:35 +0200, Hannu Krosing wrote:

> We still could call Sync Rep as a feature "synchronous replication" on
> basis that "WAL Streaming - Synchronous Write" is the highest security
> level achievable using the feature.
> 
> And maybe have Sync Hot Standby as a feature on top of that which
> provides "WAL Streaming - Synchronous Apply"

Or maybe better call it Serializable Hot Standby, as the actual
guarantee that can be achieved is that when one client does something on
master and after committing on master starts another transaction on
slave, then the effects of query on master are visible on slave.


-- 
--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Hannu Krosing

On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote:
> > I certainly agree to using such terms. Unfortunately, in my experience,
> > synchronous replication is commonly used to mean that transactions are
> > guaranteed to be immediately visible on remote nodes after the client
> > got commit acknowledgment. That's the cause for confusion I'm envisioning.
> 
> I think that's a very important point.  It's very possible that 8.4
> may support both this feature and Hot Standby (although the latter
> seems to have stalled a bit...).  That makes me think "oh, great, I
> can offload any subset of my read-only queries to the standby".  Not
> so fast.
> 
> I think we need to reserve the term "synchronous replication" for a
> system where transactions that begin at the same time on the primary
> and standby see the same tuples.

Define "same time". 

You can have a variantof sync rep + hot standby where the master does
not return committed before the slave has both synced the data and
replied the transaction so that it is visible on slave but in that case
you may have a usecase, where it is actually visible on slave _before_
it is visible on master.

actually you can't have that "same time" guarantee even on single
system, that is, if you start two transactions connections "at the same
time", you still cant be sure there is not third transaction which has
committed between those two and which makes the visible data on those
two different.


>  Clearly that is "more" synchronous
> than what is being proposed here; if we call this "synchronous
> replication", what will we call that?  "Really Synchronous, Honest, No
> Kidding"?   Admittedly, we may never implement that feature, but that
> seems irrelevant.
> 
> It would be useful to have names for all the different possibilities.
>  Random ideas:
> 
> Log Shipping.  After each log switch, the previous WAL log is copied
> to the standby in its entirety.
> 
> WAL Streaming - Asynchronous.  The WAL log is streamed from master to
> standby as it is written, but transactions on the master never wait.
> 
> WAL Streaming - Synchronous Receive.  The WAL log is streamed from
> master to standby as it is written, and transactions on the master
> wait until the standby acknowledges receipt of the WAL.
> 
> WAL Streaming - Synchronous Write.  The WAL log is streamed from
> master to standby as it is written, and transactions on the master
> wait until the standby acknowledges that the WAL has been written to
> disk.
> 
> WAL Streaming - Synchronous Apply.  The WAL log is streamed from
> master to standby as it is written, and transactions on the master
> wait until the standby acknowledges that WAL has been written to disk
> and applied.

We still could call Sync Rep as a feature "synchronous replication" on
basis that "WAL Streaming - Synchronous Write" is the highest security
level achievable using the feature.

And maybe have Sync Hot Standby as a feature on top of that which
provides "WAL Streaming - Synchronous Apply"



--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Simon Riggs


On Sat, 2008-12-13 at 13:05 -0500, Robert Haas wrote:

> Hot Standby (although the latter
> seems to have stalled a bit...)

It's just being worked on asynchronously. ;-)

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Aidan Van Dyk

Synchronous replication, "sync rep" is *not* intersted in the "slave's
visiblity of the commit", because PostgreSQL doesn't "serve" requests
when in recovery (wal receiving) mode *now*.

This sync rep patch/proposal/discution is *strictly* (at this point yet,
hot standby may eventually or hopefully soon change that) the means to
get the data "safely in 2 seperate places", before the COMMIT returns,
by means of wal streaming.  That "safely in 2 places" can have various
implementation options (like received, on disk, or applied), and
Fujii-san explained some of the options as to what to consider "safe"
and their trade-offs at his presentation at last year.

Once both sync-rep (the wal-streaming get changes in two places) and
hot-standby (run queries while WAL is being applied) are available in
PostgreSQL, at that point we might need to start "other client
visibility", but even then, we still don't need to worry about
multi-master options...

a.


* Markus Wanner  [081213 12:17]:
> Hi,
> 
> Simon Riggs wrote:
> > On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:
> >> Speaking of a "synchronous commit"
> >> is utterly misleading, because the commit itself is exactly the thing
> >> that's *not* synchronous.
> > 
> > Not really sure where you're going here.
> 
> I'm pointing to a potential misunderstanding, trying to help to prevent
> you from running into the same issues and discussions as I did.
> 
> I've learned the hard way, that the Postgres-R algorithm is not fully
> synchronous (in the strict sense). This caused confusion for people who
> take the word "synchronous" by its original meaning. The algorithm
> proposed here seems similar enough to potentially cause the same confusion.
> 
> As I see it now, I think it's well worth to point out the difference,
> from both, the technical as well as from the marketing perspective. The
> former for better understanding, the later to prevent users from
> thinking it must be slow per definition. Arguing that your approach is
> not fully synchronous definitely helps defending that concern.
> 
> However, I'm just now realizing, that the difference is only relevant as
> soon as you begin to allow read-only access on the slave. AFAIK that's
> among the goals of this effort, no?
> 
> > "synchronous replication" is
> > used exactly as described in the Wikipedia entry here:
> > http://en.wikipedia.org/wiki/Database_replication
> 
> That article describes pretty much all variants of replication, what
> exactly are you referring to?
> 
> Under "Database Replication > Multi-Master replication" it describes
> eager vs lazy variants, which is IMO a more appropriate and useful
> distinction than sync vs async. (But that's admittedly a sentence I've
> contributed myself, IIRC).
> 
> Under "Storage Replication > Synchronous Replication" one can read:
> "Write is not considered complete until acknowledgement by both local
> and remote storage." For the proposed approach this might hold true for
> WAL writing. However, the user certainly doesn't care how synchronous
> the log is shipped nor written, is as long as she doesn't see the
> changes on the slave.
> 
> That's the difference between fully synchronous and eager (or virtually
> or approximately synchronous) algorithms. You seem to refer to both as
> "synchronous". Phrases like "synchronous commit" or "synchronous data
> transfer" do not help me to understand what exactly you are talking about.
> 
> Explaining that the slave commits (and therefore makes the transactions
> visible) asynchronously would help. And it would prevent disappointment
> for users who expect changes to be immediately visible on the slave.
> 
> > No two word phrase is going to accurately sum up the complexity and
> > potential for data loss in these situations. DRBD saw that too and just
> > called them A, B and C and then describe them more accurately.
> 
> Agreed. I've chosen lazy, eager and sync, so far. I'm open for better
> terms, and I leave it up to you to call your variants whatever you like.
> But to understand what you are talking about, I'd prefer to get to know
> these distinctions crisp and clear.
> 
> > But I don't think we should say "PostgreSQL just implemented algorithm
> > B" which is just unhelpful. I don't think its "marketing" to refer to it
> > by the phrase most commonly used for the technology we are building.
> 
> I certainly agree to using such terms. Unfortunately, in my experience,
> synchronous replication is commonly used to mean that transactions are
> guaranteed to be immediately visible on remote nodes after the client
> got commit acknowledgment. That's the cause for confusion I'm envisioning.
> 
> 
> I'm hoping to be somewhat helpful to this effort of getting a log
> shipping replication variant into Postgres. It can only be beneficial
> for Postgres-R in that we gain field experience with ..uhm.. this
> special kind of replication, however we name it.
> 
> I'm already on xmas vacation, so I won't bother you any fu

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Tom Lane

"Robert Haas"  writes:
> I think we need to reserve the term "synchronous replication" for a
> system where transactions that begin at the same time on the primary
> and standby see the same tuples.  Clearly that is "more" synchronous
> than what is being proposed here; if we call this "synchronous
> replication", what will we call that?  "Really Synchronous, Honest, No
> Kidding"?   Admittedly, we may never implement that feature, but that
> seems irrelevant.

We won't call it anything, because we never will or can implement that.
See the theory of relativity: the notion of exactly simultaneous events
at distinct locations isn't even well-defined, because observers at yet
other locations will disagree about what is "simultaneous".  And I'm
not just making a joke here --- speed-of-light delays in a WAN are
meaningful compared to current computer speeds.  In practice, the
slave and the master will never commit at exactly the same time.

I agree with the point made upthread that we should use the term
"synchronous replication" the way it's commonly used in the industry.
Inventing our own terminology might be fun but it's not really going
to result in less confusion.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Robert Haas

> I certainly agree to using such terms. Unfortunately, in my experience,
> synchronous replication is commonly used to mean that transactions are
> guaranteed to be immediately visible on remote nodes after the client
> got commit acknowledgment. That's the cause for confusion I'm envisioning.

I think that's a very important point.  It's very possible that 8.4
may support both this feature and Hot Standby (although the latter
seems to have stalled a bit...).  That makes me think "oh, great, I
can offload any subset of my read-only queries to the standby".  Not
so fast.

I think we need to reserve the term "synchronous replication" for a
system where transactions that begin at the same time on the primary
and standby see the same tuples.  Clearly that is "more" synchronous
than what is being proposed here; if we call this "synchronous
replication", what will we call that?  "Really Synchronous, Honest, No
Kidding"?   Admittedly, we may never implement that feature, but that
seems irrelevant.

It would be useful to have names for all the different possibilities.
 Random ideas:

Log Shipping.  After each log switch, the previous WAL log is copied
to the standby in its entirety.

WAL Streaming - Asynchronous.  The WAL log is streamed from master to
standby as it is written, but transactions on the master never wait.

WAL Streaming - Synchronous Receive.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges receipt of the WAL.

WAL Streaming - Synchronous Write.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges that the WAL has been written to
disk.

WAL Streaming - Synchronous Apply.  The WAL log is streamed from
master to standby as it is written, and transactions on the master
wait until the standby acknowledges that WAL has been written to disk
and applied.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Simon Riggs wrote:
> On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:
>> Speaking of a "synchronous commit"
>> is utterly misleading, because the commit itself is exactly the thing
>> that's *not* synchronous.
> 
> Not really sure where you're going here.

I'm pointing to a potential misunderstanding, trying to help to prevent
you from running into the same issues and discussions as I did.

I've learned the hard way, that the Postgres-R algorithm is not fully
synchronous (in the strict sense). This caused confusion for people who
take the word "synchronous" by its original meaning. The algorithm
proposed here seems similar enough to potentially cause the same confusion.

As I see it now, I think it's well worth to point out the difference,
from both, the technical as well as from the marketing perspective. The
former for better understanding, the later to prevent users from
thinking it must be slow per definition. Arguing that your approach is
not fully synchronous definitely helps defending that concern.

However, I'm just now realizing, that the difference is only relevant as
soon as you begin to allow read-only access on the slave. AFAIK that's
among the goals of this effort, no?

> "synchronous replication" is
> used exactly as described in the Wikipedia entry here:
> http://en.wikipedia.org/wiki/Database_replication

That article describes pretty much all variants of replication, what
exactly are you referring to?

Under "Database Replication > Multi-Master replication" it describes
eager vs lazy variants, which is IMO a more appropriate and useful
distinction than sync vs async. (But that's admittedly a sentence I've
contributed myself, IIRC).

Under "Storage Replication > Synchronous Replication" one can read:
"Write is not considered complete until acknowledgement by both local
and remote storage." For the proposed approach this might hold true for
WAL writing. However, the user certainly doesn't care how synchronous
the log is shipped nor written, is as long as she doesn't see the
changes on the slave.

That's the difference between fully synchronous and eager (or virtually
or approximately synchronous) algorithms. You seem to refer to both as
"synchronous". Phrases like "synchronous commit" or "synchronous data
transfer" do not help me to understand what exactly you are talking about.

Explaining that the slave commits (and therefore makes the transactions
visible) asynchronously would help. And it would prevent disappointment
for users who expect changes to be immediately visible on the slave.

> No two word phrase is going to accurately sum up the complexity and
> potential for data loss in these situations. DRBD saw that too and just
> called them A, B and C and then describe them more accurately.

Agreed. I've chosen lazy, eager and sync, so far. I'm open for better
terms, and I leave it up to you to call your variants whatever you like.
But to understand what you are talking about, I'd prefer to get to know
these distinctions crisp and clear.

> But I don't think we should say "PostgreSQL just implemented algorithm
> B" which is just unhelpful. I don't think its "marketing" to refer to it
> by the phrase most commonly used for the technology we are building.

I certainly agree to using such terms. Unfortunately, in my experience,
synchronous replication is commonly used to mean that transactions are
guaranteed to be immediately visible on remote nodes after the client
got commit acknowledgment. That's the cause for confusion I'm envisioning.

I'm hoping to be somewhat helpful to this effort of getting a log
shipping replication variant into Postgres. It can only be beneficial
for Postgres-R in that we gain field experience with ..uhm.. this
special kind of replication, however we name it.

I'm already on xmas vacation, so I won't bother you any further on this
issue. Have fun coding and make sure to enjoy this time of the year.

All the best.

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Simon Riggs

On Sat, 2008-12-13 at 14:07 +0100, Markus Wanner wrote:

> Speaking of a "synchronous commit"
> is utterly misleading, because the commit itself is exactly the thing
> that's *not* synchronous.

Not really sure where you're going here. "synchronous replication" is
used exactly as described in the Wikipedia entry here:
http://en.wikipedia.org/wiki/Database_replication

No two word phrase is going to accurately sum up the complexity and
potential for data loss in these situations. DRBD saw that too and just
called them A, B and C and then describe them more accurately. 

But I don't think we should say "PostgreSQL just implemented algorithm
B" which is just unhelpful. I don't think its "marketing" to refer to it
by the phrase most commonly used for the technology we are building.
Nobody suggested we call it "wizrep" or suchlike...

The docs can contain the exact description of data loss and timing
windows.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Grzegorz Jaskiewicz



On 2008-12-13, at 13:07, Markus Wanner wrote:



However, that is a marketing decision [1], which should not be mixed
with the technical discussion here. Speaking of a "synchronous commit"
is utterly misleading, because the commit itself is exactly the thing
that's *not* synchronous.




[1]: Some people like the term "virtually synchronous" for marketing
purposes. That's at least half-ways technically correct.


Marketing people are virtually trustworthy, from my life experience.
If you ask me, this is just preposterous.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Markus Wanner

Hi,

Simon Riggs wrote:
> You're right that neither the data transfer nor data availability is
> entirely synchronous, but data transfer is synchronous at time of
> *commit*: it is recorded on multiple nodes at the same time.

I'm unsure what you mean by a "data transfer being synchronous". To what
other process or state should the data transfer be synchronous to?

> The term "synchronous replication" is already well used in the industry
> to mean synchronous commit, so I don't think we should change the name
> now. The project here is also known to everybody as "synch rep".

I understand very well, that you don't want to change the name. I've
been hesitant to "relabel" Postgres-R from synchronous to asynchronous
to eager.

However, that is a marketing decision [1], which should not be mixed
with the technical discussion here. Speaking of a "synchronous commit"
is utterly misleading, because the commit itself is exactly the thing
that's *not* synchronous.

It *is* an optimization to fully synchronous replication to defer commit
on the "slave" and only make sure that the transaction *can* be applied
at some time in the future.

However, this *does* have the drawback of transactions not being
immediately visible on the slave. Often enough, this is acceptable. But
it certainly matters to some applications developers.

> What is confusing is that "replication" itself is a much abused term and
> is used to describe technologies for HA, DR and data movement.

I absolutely agree to that. And I'm thus recommending to at least be
consistent and honest with the term "synchronous" and point out that WAL
writing is synchronous for the log shipping approach here (AFAIK). But
that the commit is asynchronous for performance reasons. In other words:
this approach is certainly (and hopefully, for performance reasons)
different from a fully synchronous approach. Even for marketing reasons,
it might make sense to point out that difference (.. "no, we are faster
than fully sync rep.").

Regards

Markus Wanner

[1]: Some people like the term "virtually synchronous" for marketing
purposes. That's at least half-ways technically correct.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-13 Thread Simon Riggs

On Sat, 2008-12-13 at 00:00 +0100, Markus Wanner wrote:
> Hi,
> 
> Fujii Masao wrote:
> > I'd like to define the meaning of "synch rep" again. "synch rep" means:
> > 
> > (1) Transaction commit waits for WAL records to be replicated to the standby
> >   before the command returns a "success" indication to the client.
> > 
> > (2) The standby has (can read) all WAL files indispensable for recovery.
> 
> Let me point out that - very much like the original Postgres-R algorithm
> - this guarantees committed transactions to be durable and consistent
> (no late aborts of conflicting transactions), but it does not guarantee
> that a transaction committed on one node is immediately visible on the
> other node. In that sense, it is not synchronous as commonly understood,
> because it does not "operate with all their parts in synchrony" [1], as
> implied by the term "synchronous". This might (and often has in the
> past) lead to confusion.

You're right that neither the data transfer nor data availability is
entirely synchronous, but data transfer is synchronous at time of
*commit*: it is recorded on multiple nodes at the same time.

The term "synchronous replication" is already well used in the industry
to mean synchronous commit, so I don't think we should change the name
now. The project here is also known to everybody as "synch rep".

* Oracle Data Guard calls it "synchronous redo transport"
* MS Exchange calls it "synchronous replication"
* MS SQL Server has "Database Mirroring", "Log Shipping" and
"Replication". "Database Mirroring" provides synchronous mechanism, with
"Replication" meaning data transfer to other databases,
publish&subscribe.
* DB2 HADR provides "synchronous replication"
* MySQL call it "synchronous replication"

What is confusing is that "replication" itself is a much abused term and
is used to describe technologies for HA, DR and data movement.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-12 Thread Markus Wanner

Hi,

Fujii Masao wrote:
> I'd like to define the meaning of "synch rep" again. "synch rep" means:
> 
> (1) Transaction commit waits for WAL records to be replicated to the standby
>   before the command returns a "success" indication to the client.
> 
> (2) The standby has (can read) all WAL files indispensable for recovery.

Let me point out that - very much like the original Postgres-R algorithm
- this guarantees committed transactions to be durable and consistent
(no late aborts of conflicting transactions), but it does not guarantee
that a transaction committed on one node is immediately visible on the
other node. In that sense, it is not synchronous as commonly understood,
because it does not "operate with all their parts in synchrony" [1], as
implied by the term "synchronous". This might (and often has in the
past) lead to confusion.

It's certainly enough of a reason for me to rather use the term "eager
replication". See [2] for a more in-depth explanation. I might also
point out, that Jan Wieck called this very same approch "an asynchronous
replication system by all means" [3].

Regards

Markus Wanner


[1]: Wikipedia on Synchronization
http://en.wikipedia.org/wiki/Synchronization

[2]: Postgres-R general mailing list, by Markus Wanner, subject:
terms for database replication: synchronous vs eager
http://lists.pgfoundry.org/pipermail/postgres-r-general/2008-September/14.html

[3]: Postgres General mailing list, by Jan Wieck, subject:
terms for database replication: synchronous vs eager
http://archives.postgresql.org/pgsql-hackers/2007-09/msg00631.php

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-12 Thread Jeff Davis

On Fri, 2008-12-12 at 14:23 -0500, Aidan Van Dyk wrote:
> So when would I have to call that function? Before begin, after begin,
> before commit, or all, to guarentee that know that my application is
> suppose to "delay" calling commit until when sync-mode is actualyl
> synchronous? And then afterwards, I have to call it again t omake sure
> it didn't fall "out of" mode between my previous call and the commit
> actually working?

I'm not suggesting that applications call the function. It's a way for a
monitoring system to know that you're in a degraded state and notify
you.

I'm not sure I entirely understand the use case you're advocating:

Let's say the standby has a major failure. Now you have a single point
of failure (the primary), so _all_ of your transactions are in jeopardy
anyway -- at least until you get back into sync rep. Rejecting new
transactions won't save your old ones.

The only time it helps is when the failure is temporary, i.e. you didn't
really lose the storage on the standby. But you would need to rely on
some guarantee that the storage is still intact on the standby system
even though the standby is unresponsive.

Is that the use case?

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 169 matches

Mail list logo