Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-17 Thread Fujii Masao
Hi,

On Thu, Jul 16, 2009 at 6:00 PM, Heikki
Linnakangasheikki.linnakan...@enterprisedb.com wrote:
 The archive should not normally contain partial XLOG files, only if you
 manually copy one there after primary has crashed. So I don't think
 that's something we need to support.

You are right. And, if the last valid record exists in the middle of
the restored
file (e.g. by XLOG_SWITCH record), begin should indicate the head of the
next file.

 Hmm. You only need the timeline history file if the base backup was
 taken in an earlier timeline. That situation would only arise if you
 (manually) take a base backup, restore to a server (which creates a new
 timeline), and then create a slave against that server. At least in the
 1st phase, I think we can assume that the standby has access to the same
 archive, and will find the history file from there. If not, throw an
 error. We can add more bells and whistles later.

Okey, I hold the problem about a history file for possible later consideration.

 As the patch stands, new walsender connections are refused when one is
 active already. What if the walsender connection is in a zombie state?
 For example, it's trying to send WAL to the slave, but the network
 connection is down, and the packets are going to a black hole. It will
 take a while for the TCP layer to declare the connection dead, and close
 the socket. During that time, you can't connect a new slave to the
 master, or the same slave using a better network connection.

 The most robust way to fix that is to support multiple walsenders. The
 zombie walsender can take its time to die, while the new walsender
 serves the new connection. You could tweak SO_TIMEOUTs and stuff, but
 even then the standby process could be in some weird hung state.

 And of course, when we get around to add support for multiple slaves,
 we'll have to do that anyway. Better get it right to begin with.

Thanks for the detailed description! I was thinking that a new GUC
replication_timeout and some keepalive parameters would be enough to
help with such trouble. But I agree that the support multiple walsenders
is better solution, so I'll try this problem.

 Even in synchronous replication, a backend should only have to wait when
 it commits. You would only see the difference with very large
 transactions that write more WAL than fits in wal_buffers, though, like
 data loading.

That's right.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-17 Thread Fujii Masao
Hi,

On Fri, Jul 17, 2009 at 2:09 AM, Greg Starkgsst...@mit.edu wrote:
 Only the sysadmin is actually going to know which makes more sense.
 Unless we start tieing WAL parameters to the database size or
 something like that.

Agreed. And, if a user doesn't want to make a new base backup because
of a large database, s/he can manually copy the archived WAL files to the
standby before starting it, and make it use them for its recovery.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Heikki Linnakangas
Dimitri Fontaine wrote:
 Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit :
 2. The primary should have no business reading back from the archive.
 The standby can read from the archive, as it can today.
 
 Sorry to insist, but I'm not sold on your consensus here, yet:
   http://archives.postgresql.org/pgsql-hackers/2009-07/msg00486.php
 
 There's a true need for the solution to be simple to install, and
 providing a side channel for the standby to go read the archives itself
 isn't it.

I think a better way to address that need is to provide a built-in
mechanism for the standby to request a base backup and have it sent over
the wire. That makes the initial setup very easy.

 Furthermore, the counter-argument against having the primary
 able to send data from the archives to some standby is that it should
 still work when primary's dead, but as this is only done in the setup
 phase, I don't see that being able to continue preparing a not-yet-ready
 standby against a dead primary is buying us anything.

The situation arises also when the standby falls badly behind. A simple
solution to that is to add a switch in the master to specify always
keep X MB of WAL in pg_xlog. The standby will then still find it in
pg_xlog, making it harder for a standby to fall so much behind that it
can't find the WAL it needs in the primary anymore. Tom suggested that
we can just give up and re-sync with a new base backup, but that really
requires built-in base backup capability, and is only practical for
small databases.

I think we should definitely have both those features, but it's not
urgent. The replication works without them, although requires that you
set up traditional archiving as well.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Dimitri Fontaine
Hi,

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 I think a better way to address that need is to provide a built-in
 mechanism for the standby to request a base backup and have it sent over
 the wire. That makes the initial setup very easy.

Great idea :) 

So I'll reproduce the sketch I did in this other mail, adding the 'base'
state where the prerequisite base backup is handled, that will help
clarify the next points:

 0. base: slave asks the master for a base-backup, at the end of this it
reaches the base-lsn

 1. init: slave asks the master the current LSN and start streaming WAL

 2. setup: slave asks the master for missing WALs from its base-lsn to
this LSN it just got, and apply them all to reach initial LSN (this
happens in parallel to 1.)

 3. catchup: slave has replayed missing WALs and now is replaying the
stream he received in parallel, and which applies from init LSN
(just reached)

 4. sync: slave is applying the stream as it gets it, either as part of
the master transaction or not depending on the GUC settings

 The situation arises also when the standby falls badly behind. A simple
 solution to that is to add a switch in the master to specify always
 keep X MB of WAL in pg_xlog. The standby will then still find it in
 pg_xlog, making it harder for a standby to fall so much behind that it
 can't find the WAL it needs in the primary anymore. Tom suggested that
 we can just give up and re-sync with a new base backup, but that really
 requires built-in base backup capability, and is only practical for
 small databases.

I think that when the standby is back in business after a connection
glitch (or any other transient error), its current internal state is
still 'sync' and walreceiver asks for next LSN (RedoPTR?). Now, 2 cases
are possible:

 a. primary still has it handy, so the standby is still in sync but
lagging behind (and primary knows how much)

 b. primary is not able to provide the requested WAL entry, so the slave
is back to 'setup' state, with base-lsn the point reached just
before loosing sync (the one walreceiver just asked for).

Now, a standby in 'setup' state isn't ready (yet), and for example
synchronous replication won't be possible in this state: we can't ask
the primary to refuse to COMMIT any transaction (holding it, eg) while a
standby hasn't reached 'sync' state.

The way your talking about the issue make me think there's a mix between
how to handle a lagging standby and an out-of-sync standby. For clarity,
I think we should have very distinct states and responses. And yes, as
Tom and you keep saying, a synced standby by definition should not need
any access to its primary archives. So if it does, it's no more in sync.

 I think we should definitely have both those features, but it's not
 urgent. The replication works without them, although requires that you
 set up traditional archiving as well.

Agreed, it's not essential for the feature as far as hackers are
concerned.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Fujii Masao
Hi,

On Thu, Jul 16, 2009 at 6:03 AM, Heikki
Linnakangasheikki.linnakan...@enterprisedb.com wrote:
 I don't think there's much point assigning more reviewers to Synch Rep
 at this point. I believe we have consensus on four major changes:

Thanks for clarifying the issues! Okey, I'll rework the patch.

 1. Change the way synchronization is done when standby connects to
 primary. After authentication, standby should send a message to primary,
 stating the begin point (where begin is an XLogRecPtr, not a WAL
 segment name). Primary starts streaming WAL starting from that point,
 and keeps streaming forever. pg_read_xlogfile() needs to be removed.

I assume that begin should indicate the location of the last valid record.
In other words, at first the standby tries to recover by using only the XLOG
files which exist in its archive or pg_xlog. When it has reached the last valid
record, it requests the XLOG records which follow begin to the primary.
Is my understanding OK?

http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php
As I described before, the XLOG file which the standby creates should be
recoverable. So, when begin indicates the middle of the XLOG file, the
primary should start sending the records from the head of the file including
begin. Is this OK?

Or, the primary should start from begin? In this case, since we can
expect that the incomplete file including begin would exist in also the
standby, the records following begin need to be appended into it.
And, if that incomplete file is the restored one from archive, it would need
to be renamed from a temporary name before being appended.

A timeline/backup history file is also required for recovery, but it's not
found in the standby. So, they need to be shipped from the primary, and
this capability is provided by pg_read_xlogfile(). If removing the function,
how should we transfer those history files? The function similar to
pg_read_xlogfile() with which the filename needs to be specified is still
necessary?

 2. The primary should have no business reading back from the archive.
 The standby can read from the archive, as it can today.

In this case, a backup history file should be stored in pg_xlog for a while,
because it might be requested by the standby. So far pg_start_backup()
has removed the previous backup history file soon. We should introduce
a new GUC parameter to determine how many backup history files should
exist in pg_xlog?

CHECKPOINT should not recycle the XLOG files following the file which
is requested by the standby in that moment. So, we need to tweak the
recycling policy.

 3. Need to support multiple WALSenders. While multiple slave support
 isn't 1st priority right now, it's not acceptable that a new WALSender
 can't connect while one is active already. That can cause trouble in
 case of network problems etc.

Sorry, I didn't get your point. You think multiple slave support isn't 1st
priority, and yet why should multiple walsender mechanism be necessary?
Can you describe the problem cases in more detail?

 4. It is not acceptable that normal backends have to wait for walsender
 to send data.

Umm... this is true in asynchronous replication case. Also true while the
standby is catching up with the primary. After those servers get into
synchronization, the backend should wait for walsender to send data (and
also walreceiver to write/fsync data) before returning success of COMMIT
to the client. Is my understanding right?

In current Synch Rep, the backend basically doesn't wait for walsender in
asynchronous mode. But only when wal_buffers is filled with unsent data,
the backend waits for walsender to send data because there is no room to
insert new data. You suggest only that this problem case should be solved?

 That means that connecting a standby behind a slow
 connection to the primary can grind the primary to a halt.

This is the fate of *synchronous* replication, isn't it? If a user want to get
around such problem, asynchronous mode should be chosen, I think.

 walsender
 needs to be able to read data from disk, not just from shared memory. (I
 raised this back in December
 http://archives.postgresql.org/message-id/495106fa.1050...@enterprisedb.com)

OK, I'll try it.

 As a hint, I think you'll find it a lot easier if you implement only
 asynchronous replication at first. That reduces the amount of
 inter-process communication a lot. You can then add synchronous
 capability in a later commitfest. I would also suggest that for point 4,
 you implement WAL sender so that it *only* reads from disk at first, and
 only add the capability send from wal_buffers later on, and only if
 performance testing shows that it's needed.

Sounds good. I'll advance development in stages as you suggested.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Thu, Jul 16, 2009 at 6:03 AM, Heikki
 Linnakangasheikki.linnakan...@enterprisedb.com wrote:
 1. Change the way synchronization is done when standby connects to
 primary. After authentication, standby should send a message to primary,
 stating the begin point (where begin is an XLogRecPtr, not a WAL
 segment name). Primary starts streaming WAL starting from that point,
 and keeps streaming forever. pg_read_xlogfile() needs to be removed.
 
 I assume that begin should indicate the location of the last valid record.
 In other words, at first the standby tries to recover by using only the XLOG
 files which exist in its archive or pg_xlog. When it has reached the last 
 valid
 record, it requests the XLOG records which follow begin to the primary.
 Is my understanding OK?

Yes.

 http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php
 As I described before, the XLOG file which the standby creates should be
 recoverable. So, when begin indicates the middle of the XLOG file, the
 primary should start sending the records from the head of the file including
 begin. Is this OK?
 
 Or, the primary should start from begin? In this case, since we can
 expect that the incomplete file including begin would exist in also the
 standby, the records following begin need to be appended into it.

I would expect the standby to append to the partial XLOG file.

 And, if that incomplete file is the restored one from archive, it would need
 to be renamed from a temporary name before being appended.

The archive should not normally contain partial XLOG files, only if you
manually copy one there after primary has crashed. So I don't think
that's something we need to support.

 A timeline/backup history file is also required for recovery, but it's not
 found in the standby. So, they need to be shipped from the primary, and
 this capability is provided by pg_read_xlogfile(). If removing the function,
 how should we transfer those history files? The function similar to
 pg_read_xlogfile() with which the filename needs to be specified is still
 necessary?

Hmm. You only need the timeline history file if the base backup was
taken in an earlier timeline. That situation would only arise if you
(manually) take a base backup, restore to a server (which creates a new
timeline), and then create a slave against that server. At least in the
1st phase, I think we can assume that the standby has access to the same
archive, and will find the history file from there. If not, throw an
error. We can add more bells and whistles later.

 CHECKPOINT should not recycle the XLOG files following the file which
 is requested by the standby in that moment. So, we need to tweak the
 recycling policy.

Yep.

 3. Need to support multiple WALSenders. While multiple slave support
 isn't 1st priority right now, it's not acceptable that a new WALSender
 can't connect while one is active already. That can cause trouble in
 case of network problems etc.
 
 Sorry, I didn't get your point. You think multiple slave support isn't 1st
 priority, and yet why should multiple walsender mechanism be necessary?
 Can you describe the problem cases in more detail?

As the patch stands, new walsender connections are refused when one is
active already. What if the walsender connection is in a zombie state?
For example, it's trying to send WAL to the slave, but the network
connection is down, and the packets are going to a black hole. It will
take a while for the TCP layer to declare the connection dead, and close
the socket. During that time, you can't connect a new slave to the
master, or the same slave using a better network connection.

The most robust way to fix that is to support multiple walsenders. The
zombie walsender can take its time to die, while the new walsender
serves the new connection. You could tweak SO_TIMEOUTs and stuff, but
even then the standby process could be in some weird hung state.

And of course, when we get around to add support for multiple slaves,
we'll have to do that anyway. Better get it right to begin with.

 4. It is not acceptable that normal backends have to wait for walsender
 to send data.
 
 Umm... this is true in asynchronous replication case. Also true while the
 standby is catching up with the primary. After those servers get into
 synchronization, the backend should wait for walsender to send data (and
 also walreceiver to write/fsync data) before returning success of COMMIT
 to the client. Is my understanding right?

Even in synchronous replication, a backend should only have to wait when
it commits. You would only see the difference with very large
transactions that write more WAL than fits in wal_buffers, though, like
data loading.

 In current Synch Rep, the backend basically doesn't wait for walsender in
 asynchronous mode. But only when wal_buffers is filled with unsent data,
 the backend waits for walsender to send data because there is no room to
 insert new data. You suggest only that this 

Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Rick Gigger

On Jul 16, 2009, at 12:07 AM, Heikki Linnakangas wrote:


Dimitri Fontaine wrote:

Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit :
Furthermore, the counter-argument against having the primary
able to send data from the archives to some standby is that it should
still work when primary's dead, but as this is only done in the setup
phase, I don't see that being able to continue preparing a not-yet- 
ready

standby against a dead primary is buying us anything.


The situation arises also when the standby falls badly behind. A  
simple

solution to that is to add a switch in the master to specify always
keep X MB of WAL in pg_xlog. The standby will then still find it in
pg_xlog, making it harder for a standby to fall so much behind that it
can't find the WAL it needs in the primary anymore. Tom suggested that
we can just give up and re-sync with a new base backup, but that  
really

requires built-in base backup capability, and is only practical for
small databases.


If you use an rsync like algorithm for doing the base backups wouldn't  
that increase the size of the database for which it would still be  
practical to just re-sync?  Couldn't you in fact sync a very large  
database if the amount of actual change in the files was a small  
percentage of the total size?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Heikki Linnakangas
Rick Gigger wrote:
 If you use an rsync like algorithm for doing the base backups wouldn't
 that increase the size of the database for which it would still be
 practical to just re-sync?  Couldn't you in fact sync a very large
 database if the amount of actual change in the files was a small
 percentage of the total size?

It would certainly help to reduce the network traffic, though you'd
still have to scan all the data to see what has changed.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Greg Stark
On Thu, Jul 16, 2009 at 4:41 PM, Heikki
Linnakangasheikki.linnakan...@enterprisedb.com wrote:
 Rick Gigger wrote:
 If you use an rsync like algorithm for doing the base backups wouldn't
 that increase the size of the database for which it would still be
 practical to just re-sync?  Couldn't you in fact sync a very large
 database if the amount of actual change in the files was a small
 percentage of the total size?

 It would certainly help to reduce the network traffic, though you'd
 still have to scan all the data to see what has changed.

The fundamental problem with pushing users to start over with a new
base backup is that there's no relationship between the size of the
WAL and the size of the database.

You can plausibly have a system with extremely high transaction rate
generating WAL very quickly, but where the whole database fits in a
few hundred megabytes. In that case you could be behind by only a few
minutes and have it be faster to take a new base backup.

Or you could have a petabyte database which is rarely updated. In
which case it might be faster to apply weeks' worth of logs than to
try to take a base backup.

Only the sysadmin is actually going to know which makes more sense.
Unless we start tieing WAL parameters to the database size or
something like that.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Rick Gigger

On Jul 16, 2009, at 11:09 AM, Greg Stark wrote:


On Thu, Jul 16, 2009 at 4:41 PM, Heikki
Linnakangasheikki.linnakan...@enterprisedb.com wrote:

Rick Gigger wrote:
If you use an rsync like algorithm for doing the base backups  
wouldn't

that increase the size of the database for which it would still be
practical to just re-sync?  Couldn't you in fact sync a very large
database if the amount of actual change in the files was a small
percentage of the total size?


It would certainly help to reduce the network traffic, though you'd
still have to scan all the data to see what has changed.


The fundamental problem with pushing users to start over with a new
base backup is that there's no relationship between the size of the
WAL and the size of the database.

You can plausibly have a system with extremely high transaction rate
generating WAL very quickly, but where the whole database fits in a
few hundred megabytes. In that case you could be behind by only a few
minutes and have it be faster to take a new base backup.

Or you could have a petabyte database which is rarely updated. In
which case it might be faster to apply weeks' worth of logs than to
try to take a base backup.

Only the sysadmin is actually going to know which makes more sense.
Unless we start tieing WAL parameters to the database size or
something like that.


Once again wouldn't an rsync like algorithm help here.  Couldn't you  
have the default be to just create a new base backup for them , but  
then allow you to specify an existing base backup if you've already  
got one?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-16 Thread Robert Haas
On Thu, Jul 16, 2009 at 1:09 PM, Greg Starkgsst...@mit.edu wrote:
 On Thu, Jul 16, 2009 at 4:41 PM, Heikki
 Linnakangasheikki.linnakan...@enterprisedb.com wrote:
 Rick Gigger wrote:
 If you use an rsync like algorithm for doing the base backups wouldn't
 that increase the size of the database for which it would still be
 practical to just re-sync?  Couldn't you in fact sync a very large
 database if the amount of actual change in the files was a small
 percentage of the total size?

 It would certainly help to reduce the network traffic, though you'd
 still have to scan all the data to see what has changed.

 The fundamental problem with pushing users to start over with a new
 base backup is that there's no relationship between the size of the
 WAL and the size of the database.

 You can plausibly have a system with extremely high transaction rate
 generating WAL very quickly, but where the whole database fits in a
 few hundred megabytes. In that case you could be behind by only a few
 minutes and have it be faster to take a new base backup.

 Or you could have a petabyte database which is rarely updated. In
 which case it might be faster to apply weeks' worth of logs than to
 try to take a base backup.

 Only the sysadmin is actually going to know which makes more sense.
 Unless we start tieing WAL parameters to the database size or
 something like that.

I think we need a way for the master to know who its slaves are and
keep any given bit of WAL available until all slaves have succesfully
read it, just as we keep each WAL file until we successfully copy it
to the archive.  Otherwise, there's no way to be sure that a
connection break won't result in the need for a new base backup.  (In
a way, a slave is very similar to an additional archive.)

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-15 Thread Robert Haas
On Wed, Jul 15, 2009 at 12:32 AM, Fujii Masaomasao.fu...@gmail.com wrote:
 If the above is OK, should I update the patch ASAP? or
 suspend that update until many other comments arrive?
 I'm concerned that frequent small updating interferes with
 a review.

I decided (perhaps foolishly), to assign reviewers for the two smaller
patches that you extracted from this first, and to hold off on
assigning a reviewer for the main patch until those reviews were
completed:

http://archives.postgresql.org/message-id/3f0b79eb0907022341m1d36a841x19c3e2a5a6906...@mail.gmail.com
http://archives.postgresql.org/message-id/3f0b79eb0907030037g515f3337o9092279c6234...@mail.gmail.com

So I think you should update ASAP in this case.  As soon as we get
some reviewers freed up from the initial reviewing round, I will
assign one or more reviewers to the main Sync Rep patch.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-15 Thread Heikki Linnakangas
Fujii Masao wrote:
 Hi,
 
 On Wed, Jul 15, 2009 at 8:15 PM, Robert Haasrobertmh...@gmail.com wrote:
 So I think you should update ASAP in this case.
 
 I updated the patch as described in
 http://archives.postgresql.org/pgsql-hackers/2009-07/msg00865.php
 
 All the other parts are still the same.
 
  As soon as we get
 some reviewers freed up from the initial reviewing round, I will
 assign one or more reviewers to the main Sync Rep patch.
 
 Thanks!

I don't think there's much point assigning more reviewers to Synch Rep
at this point. I believe we have consensus on four major changes:

1. Change the way synchronization is done when standby connects to
primary. After authentication, standby should send a message to primary,
stating the begin point (where begin is an XLogRecPtr, not a WAL
segment name). Primary starts streaming WAL starting from that point,
and keeps streaming forever. pg_read_xlogfile() needs to be removed.

2. The primary should have no business reading back from the archive.
The standby can read from the archive, as it can today.

3. Need to support multiple WALSenders. While multiple slave support
isn't 1st priority right now, it's not acceptable that a new WALSender
can't connect while one is active already. That can cause trouble in
case of network problems etc.

4. It is not acceptable that normal backends have to wait for walsender
to send data. That means that connecting a standby behind a slow
connection to the primary can grind the primary to a halt. walsender
needs to be able to read data from disk, not just from shared memory. (I
raised this back in December
http://archives.postgresql.org/message-id/495106fa.1050...@enterprisedb.com)

Those 4 things are big enough changes that I don't think there's much
left to review that won't be affected by those changes.

As a hint, I think you'll find it a lot easier if you implement only
asynchronous replication at first. That reduces the amount of
inter-process communication a lot. You can then add synchronous
capability in a later commitfest. I would also suggest that for point 4,
you implement WAL sender so that it *only* reads from disk at first, and
only add the capability send from wal_buffers later on, and only if
performance testing shows that it's needed.

I'll move this to returned with feedback section, but if you get those
things done quickly we can still give it another round of review in this
commitfest.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-15 Thread Dimitri Fontaine

Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit :

2. The primary should have no business reading back from the archive.
The standby can read from the archive, as it can today.


Sorry to insist, but I'm not sold on your consensus here, yet:
  http://archives.postgresql.org/pgsql-hackers/2009-07/msg00486.php

There's a true need for the solution to be simple to install, and  
providing a side channel for the standby to go read the archives  
itself isn't it. Furthermore, the counter-argument against having the  
primary able to send data from the archives to some standby is that it  
should still work when primary's dead, but as this is only done in the  
setup phase, I don't see that being able to continue preparing a not- 
yet-ready standby against a dead primary is buying us anything.


Now, I tried proposing to implement an archive server as a postmaster  
child to have a reference implementation of an archive command for  
basic cases, and provide the ability to give data from the archive  
to slave(s). But this is getting too much into the implementation  
details for my current understanding of them :)


Regards,
--
dim
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-14 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Fri, Jul 3, 2009 at 1:32 PM, Fujii Masaomasao.fu...@gmail.com wrote:
 This patch no longer applies cleanly.  Can you rebase and resubmit it
 for the upcoming CommitFest?  It might also be good to go through and
 clean up the various places where you have trailing whitespace and/or
 spaces preceding tabs.
 Sure. I'll resubmit the patch after fixing some bugs and finishing
 the documents.
 
 Here is the updated version of Synch Rep patch. I adjusted the patch
 against CVS HEAD, fixed some bugs and updated the documents.
 
 The attached tarball contains some patches which were split to be
 reviewed easily. Description of each patches, a brief procedure to
 set up Synch Rep and the functional overview of it are in wiki.
 http://wiki.postgresql.org/wiki/NTT's_Development_Projects
 
 If you notice anything, please feel free to comment!

Here's one little thing in addition to all the stuff already discussed:

The only caller that doesn't pass XLogSyncReplication as the new 'mode'
argument to XLogFlush is this CreateCheckPoint:

***
*** 6569,6575 
XLOG_CHECKPOINT_ONLINE,
rdata);

!   XLogFlush(recptr);

/*
 * We mustn't write any new WAL after a shutdown checkpoint, or it will
--- 7667,7677 
XLOG_CHECKPOINT_ONLINE,
rdata);

!   /*
!* Don't shutdown until all outstanding xlog records are replicated and
!* fsynced on the standby, regardless of synchronization mode.
!*/
!   XLogFlush(recptr, shutdown ? REPLICATION_MODE_FSYNC :
XLogSyncReplication);

/*
 * We mustn't write any new WAL after a shutdown checkpoint, or it will

If that's the only such caller, let's introduce a new function for that
and keep the XLogFlush() api unchanged.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synch Rep for CommitFest 2009-07

2009-07-14 Thread Fujii Masao
Hi,

On Wed, Jul 15, 2009 at 3:56 AM, Heikki
Linnakangasheikki.linnakan...@enterprisedb.com wrote:
 Here's one little thing in addition to all the stuff already discussed:

Thanks for the comment!

 If that's the only such caller, let's introduce a new function for that
 and keep the XLogFlush() api unchanged.

OK. How about the following function?

--
/*
 * Ensure that shutdown-related XLOG data through the given position is
 * flushed to local disk, and also flushed to the disk in the standby
 * if replication is in progress.
 */
void
XLogShutdownFlush(XLogRecPtr record)
{
  int save_mode = XLogSyncReplication;

  XLogSyncReplication = REPLICATION_MODE_FSYNC;
  XLogFlush(record);

  XLogSyncReplication = save_mode;
}
--

In a shutdown checkpoint case, CreateCheckPoint calls
XLogShutdownFlush, otherwise XLogFlush. And,
XLogFlush uses XLogSyncReplication directly instead of
obsolete 'mode' argument.

If the above is OK, should I update the patch ASAP? or
suspend that update until many other comments arrive?
I'm concerned that frequent small updating interferes with
a review.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers