Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-08 Thread Fujii Masao
On Thu, Jul 8, 2010 at 7:55 AM, Robert Haas robertmh...@gmail.com wrote:
 What was the final decision on behavior if fsync=off?

 I'm not sure we made any decision, per se, but if you use fsync=off in
 combination with SR and experience an unexpected crash-and-reboot on
 the master, you will be sad.

True. But, without SR, an unexpected crash-and-reboot in the master
would make you sad ;) So I'm not sure whether we really need to take
action for the case of SR + fsync=off.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 If it's unsafe to send written but unflushed WAL to the standby, then
 for the same reasons we can't send unwritten WAL either.
[...]
 Having said that, I do think we urgently need some high-level design
 discussion on how sync rep is actually going to handle this issue

Stop me if I'm all wrong already, but I though we said that we should
handle this case by decoupling what we can send to the standby and what
it can apply. We could do this by sending the current WAL fsync'ed
position on the master in the WAL sender protocol, either in the WAL
itself or as out-of-bound messages, I guess.

Now, this can be made safe, how to make it fast (low-latency) is yet to
be addressed.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread Robert Haas
On Wed, Jul 7, 2010 at 4:40 AM, Dimitri Fontaine dfonta...@hi-media.com wrote:
 Stop me if I'm all wrong already, but I though we said that we should
 handle this case by decoupling what we can send to the standby and what
 it can apply. We could do this by sending the current WAL fsync'ed
 position on the master in the WAL sender protocol, either in the WAL
 itself or as out-of-bound messages, I guess.

 Now, this can be made safe, how to make it fast (low-latency) is yet to
 be addressed.

Yeah, that's the trick, isn't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread Tom Lane
Dimitri Fontaine dfonta...@hi-media.com writes:
 Stop me if I'm all wrong already, but I though we said that we should
 handle this case by decoupling what we can send to the standby and what
 it can apply.

What's the point of that?  It won't make the standby apply any faster.
What it will do is make the protocol more complicated, hence slower
(more messages) and more at risk of bugs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Dimitri Fontaine dfonta...@hi-media.com writes:
 Stop me if I'm all wrong already, but I though we said that we should
 handle this case by decoupling what we can send to the standby and what
 it can apply.

 What's the point of that?  It won't make the standby apply any faster.

True, but it allows to send the WAL content before to ack its fsync.

Regards.
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread Josh Berkus
On 7/6/10 4:44 PM, Robert Haas wrote:
 To recap the previous discussion on this thread, we ended up changing
 the behavior of 9.0 so that it only sends WAL which has been written
 to the OS *and flushed*, because sending unflushed WAL to the standby
 is unsafe.  The standby can get ahead of the master while still
 believing that the databases are in sync, due to the fact that after
 an SR reconnect we rewind to the start of the current WAL segment.
 This results in a silently corrupt standby database.

What was the final decision on behavior if fsync=off?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread Robert Haas
On Wed, Jul 7, 2010 at 6:44 PM, Josh Berkus j...@agliodbs.com wrote:
 On 7/6/10 4:44 PM, Robert Haas wrote:
 To recap the previous discussion on this thread, we ended up changing
 the behavior of 9.0 so that it only sends WAL which has been written
 to the OS *and flushed*, because sending unflushed WAL to the standby
 is unsafe.  The standby can get ahead of the master while still
 believing that the databases are in sync, due to the fact that after
 an SR reconnect we rewind to the start of the current WAL segment.
 This results in a silently corrupt standby database.

 What was the final decision on behavior if fsync=off?

I'm not sure we made any decision, per se, but if you use fsync=off in
combination with SR and experience an unexpected crash-and-reboot on
the master, you will be sad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-07 Thread marcin mank
 Having said that, I do think we urgently need some high-level design
 discussion on how sync rep is actually going to handle this issue
 (perhaps on a new thread).  If we can't resolve this issue, sync rep
 is going to be really slow; but there are no easy solutions to this
 problem in sight, so if we want to have sync rep for 9.1 we'd better
 agree on one of the difficult solutions soon so that work can begin.


When standbys reconnect after a crash, they could send the
ahead-of-the-master WAL to the master. This is an alternative to
choosing the most-ahead standby as the new master, as suggested
elsewhere.

Greetings
Marcin Mańk

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-06 Thread Robert Haas
On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 In 9.0, walsender reads WAL always from the disk and sends it to the standby.
 That is, we cannot send WAL until it has been written (and flushed) to the 
 disk.
 This degrades the performance of synchronous replication very much since a
 transaction commit must wait for the WAL write time *plus* the replication 
 time.

 The attached patch enables walsender to read data from WAL buffers in addition
 to the disk. Since we can write and send WAL simultaneously, in synchronous
 replication, a transaction commit has only to wait for either of them. So the
 performance would significantly increase.

To recap the previous discussion on this thread, we ended up changing
the behavior of 9.0 so that it only sends WAL which has been written
to the OS *and flushed*, because sending unflushed WAL to the standby
is unsafe.  The standby can get ahead of the master while still
believing that the databases are in sync, due to the fact that after
an SR reconnect we rewind to the start of the current WAL segment.
This results in a silently corrupt standby database.

If it's unsafe to send written but unflushed WAL to the standby, then
for the same reasons we can't send unwritten WAL either.  Therefore, I
believe that this entire patch in its current form is a nonstarter and
we should mark it Rejected in the CF app so that reviewers don't
unnecessarily spend time on it.

Having said that, I do think we urgently need some high-level design
discussion on how sync rep is actually going to handle this issue
(perhaps on a new thread).  If we can't resolve this issue, sync rep
is going to be really slow; but there are no easy solutions to this
problem in sight, so if we want to have sync rep for 9.1 we'd better
agree on one of the difficult solutions soon so that work can begin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-07-01 Thread Greg Stark
On Wed, Jun 30, 2010 at 12:37 PM, Robert Haas robertmh...@gmail.com wrote:
 One thought that occurred to me is that if the master and standby were
 more tightly coupled, you could recover after a crash by making the
 one with the further-advanced WAL position the master, and the other
 one the standby.  That would get around this problem, though at the
 cost of considerable additional complexity.  But then if one of the
 servers comes up and can't talk to the other, you need some mechanism
 for preventing split-brain syndrome.

Users should be free to build infrastructure to allow that. But we
can't just switch ourselves -- we don't know what other pieces of
their systems need to be updated when the master changes.

We also need to stop thinking in terms of one master and one slave.
They could have dozens of slaves and in case of failover would want to
pick the slave with the most recent WAL position. The way I picture
that happening they're monitoring all their slaves in some monitoring
tool and use that data to pick the new master. Some external tool
picks the new master and tells that host, all the other slaves, and
all the rest of the their infrastructure where to find the new master
and does whatever is necessary to restart or reload configurations.

The question I think is what interfaces do we need in Postgres to make
this easy. The monitoring tool needs a way to find the current WAL
position from the slaves even when the master is down. That means
potentially needing to start up the slaves in read-only mode with no
master at all. It also means making it easy for an external tool to
switch a node from slave to primary and change a slave's master. And
it also means a slave should be able to change master and pick up
where it left off easily. I'm not sure what the recommended interfaces
for these operations would be currently for an external tool.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-30 Thread Fujii Masao
On Wed, Jun 30, 2010 at 11:26 AM, Robert Haas robertmh...@gmail.com wrote:
 Maybe.  As Heikki pointed out upthread, the standby can't even write
 the WAL to back to the OS until it's been fsync'd on the master
 without risking the problem under discussion.

If we change the startup process so that it doesn't go ahead of the
master's fsync location even after the walreceiver is terminated,
we would have no need to worry about that risk. For further robustness,
the walreceiver might be able to zero the WAL records which have not
been fsync'd on the master yet, when being terminated.

But, if the standby crashes after the master crashes, restart of the
standby might replay that non-fsync'd WAL wrongly because it cannot
remember the master's fsync location. In this case, if we promote the
standby to the master, we still don't have to worry about that risk.
But instead of performing a failover, if we restart the master and
make the standby connect to the master again, the database on the standby
would get corrupted.

For now, I don't have good idea to avoid that database corruption by
the double failure (crash of both master and standby)...

 So we can stream the
 WAL from master to standby as long as the standby just buffers it in
 memory (or somewhere other than the usual location in pg_xlog).

Yeah, I was just thinking the same thing. But the problem is that the
buffer size might become too big (might be bigger than 16MB). For
example, synchronous_commit = off and wal_writer_delay = 1ms on
the master would delay the fsync significantly and increase the buffer
size on the standby.

 Before we get too busy frobnicating this gonkulator, I'd like to see a
 little more discussion of what kind of performance people are
 expecting from sync rep.  Sounds to me like the best we can expect
 here is, on every commit: (a) wait for master fsync to complete, (b)
 send message to standby, (c) wait for reply for reply from standby
 indicating that fsync is complete on standby.  Even assuming that the
 network overhead is minimal, that halves the commit rate.  Are the
 people who want sync rep OK with that?  Is there any way to do better?

(c) would depend on the synchronization mode the user chooses:

  #1 Wait for WAL to be received by the standby
  #2 Wait for WAL to be received and flushed by the standby
  #3 Wait for WAL to be received, flushed and replayed by the standby

(a) would depend on synchronous_commit. Personally I'm interested in
disabling synchronous_commit on the master and choosing #1 as the sync
mode. Though this may be very optimistic configuration :)

The point for performance of sync rep is to parallelize (a) and (b)+(c),
I think. If they are performed in a serial manner, the performance
overhead on the master would become high.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-30 Thread Robert Haas
On Wed, Jun 30, 2010 at 5:36 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Before we get too busy frobnicating this gonkulator, I'd like to see a
 little more discussion of what kind of performance people are
 expecting from sync rep.  Sounds to me like the best we can expect
 here is, on every commit: (a) wait for master fsync to complete, (b)
 send message to standby, (c) wait for reply for reply from standby
 indicating that fsync is complete on standby.  Even assuming that the
 network overhead is minimal, that halves the commit rate.  Are the
 people who want sync rep OK with that?  Is there any way to do better?

 (c) would depend on the synchronization mode the user chooses:

  #1 Wait for WAL to be received by the standby
  #2 Wait for WAL to be received and flushed by the standby
  #3 Wait for WAL to be received, flushed and replayed by the standby

 (a) would depend on synchronous_commit. Personally I'm interested in
 disabling synchronous_commit on the master and choosing #1 as the sync
 mode. Though this may be very optimistic configuration :)

 The point for performance of sync rep is to parallelize (a) and (b)+(c),
 I think. If they are performed in a serial manner, the performance
 overhead on the master would become high.

Right.  So we to try to come up with a design that permits that, which
must be robust in the face of any number of crashes on the two
machines, in any order.  Until we have that, we're just going around
in circles.

One thought that occurred to me is that if the master and standby were
more tightly coupled, you could recover after a crash by making the
one with the further-advanced WAL position the master, and the other
one the standby.  That would get around this problem, though at the
cost of considerable additional complexity.  But then if one of the
servers comes up and can't talk to the other, you need some mechanism
for preventing split-brain syndrome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-29 Thread Bruce Momjian
Simon Riggs wrote:
 On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:
 
  The problem is not that the master streams non-fsync'd WAL, but that the
  standby can replay that. So I'm thinking that we can send non-fsync'd WAL
  safely if the standby makes the recovery wait until the master has fsync'd
  WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
  location to walreceiver, and the standby applies only the WAL which the
  master has already fsync'd. Thought?
 
 Yes, good thought. The patch just applied seems too much.
 
 I had the same thought, though it would mean you'd need to send two xlog
 end locations, one for write, one for fsync. Though not really clear why
 we send the current end of WAL on the server anyway, so maybe we can
 just alter that.

Is this a TODO?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-29 Thread Robert Haas
On Tue, Jun 29, 2010 at 10:06 PM, Bruce Momjian br...@momjian.us wrote:
 Simon Riggs wrote:
 On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:

  The problem is not that the master streams non-fsync'd WAL, but that the
  standby can replay that. So I'm thinking that we can send non-fsync'd WAL
  safely if the standby makes the recovery wait until the master has fsync'd
  WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
  location to walreceiver, and the standby applies only the WAL which the
  master has already fsync'd. Thought?

 Yes, good thought. The patch just applied seems too much.

 I had the same thought, though it would mean you'd need to send two xlog
 end locations, one for write, one for fsync. Though not really clear why
 we send the current end of WAL on the server anyway, so maybe we can
 just alter that.

 Is this a TODO?

Maybe.  As Heikki pointed out upthread, the standby can't even write
the WAL to back to the OS until it's been fsync'd on the master
without risking the problem under discussion.  So we can stream the
WAL from master to standby as long as the standby just buffers it in
memory (or somewhere other than the usual location in pg_xlog).

Before we get too busy frobnicating this gonkulator, I'd like to see a
little more discussion of what kind of performance people are
expecting from sync rep.  Sounds to me like the best we can expect
here is, on every commit: (a) wait for master fsync to complete, (b)
send message to standby, (c) wait for reply for reply from standby
indicating that fsync is complete on standby.  Even assuming that the
network overhead is minimal, that halves the commit rate.  Are the
people who want sync rep OK with that?  Is there any way to do better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-21 Thread Fujii Masao
On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus j...@agliodbs.com wrote:
 I wonder if it would be possible to jigger things so that we send the
 WAL to the standby as soon as it is generated, but somehow arrange
 things so that the standby knows the last location that the master has
 fsync'd and never applies beyond that point.

 I can't think of any way which would not require major engineering.  And
 you'd be slowing down replication *in general* to deal with a fairly
 unlikely corner case.

 I think the panic is the way to go.

 I have yet to convince myself of how likely this is to occur.  I tried
 to reproduce this issue by crashing the database, but I think in 9.0
 you need an actual operating system crash to cause this problem, and I
 haven't yet set up an environment in which I can repeatedly crash the
 OS.  I believe, though, that in 9.1, we're going to want to stream
 from WAL buffers as proposed in the patch that started out this
 thread, and then I think this issue can be triggered with just a
 database crash.

 In 9.0, I think we can fix this problem by (1) only streaming WAL that
 has been fsync'd and (2) PANIC-ing if the problem occurs anyway.  But
 in 9.1, with sync rep and the performance demands that entails, I
 think that we're going to need to rethink it.

The problem is not that the master streams non-fsync'd WAL, but that the
standby can replay that. So I'm thinking that we can send non-fsync'd WAL
safely if the standby makes the recovery wait until the master has fsync'd
WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
location to walreceiver, and the standby applies only the WAL which the
master has already fsync'd. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-21 Thread Heikki Linnakangas

On 21/06/10 12:08, Fujii Masao wrote:

On Wed, Jun 16, 2010 at 5:06 AM, Robert Haasrobertmh...@gmail.com  wrote:

In 9.0, I think we can fix this problem by (1) only streaming WAL that
has been fsync'd and (2) PANIC-ing if the problem occurs anyway.  But
in 9.1, with sync rep and the performance demands that entails, I
think that we're going to need to rethink it.


The problem is not that the master streams non-fsync'd WAL, but that the
standby can replay that. So I'm thinking that we can send non-fsync'd WAL
safely if the standby makes the recovery wait until the master has fsync'd
WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
location to walreceiver, and the standby applies only the WAL which the
master has already fsync'd. Thought?


I guess, but you have to be very careful to correctly refrain from 
applying the WAL. For example, a naive implementation might write the 
WAL to disk in walreceiver immediately, but refrain from telling the 
startup process about it. If walreceiver is then killed because the 
connection is broken (and it will be because the master just crashed), 
the startup process will read the streamed WAL from the file in pg_xlog, 
and go ahead to apply it anyway.


So maybe there's some room for optimization there, but given the 
round-trip required for the acknowledgment anyway it might not buy you 
much, and the implementation is not very straightforward. This is 
clearly 9.1 material, if worth optimizing at all.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-21 Thread Greg Stark
On Mon, Jun 21, 2010 at 10:40 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I guess, but you have to be very careful to correctly refrain from applying
 the WAL. For example, a naive implementation might write the WAL to disk in
 walreceiver immediately, but refrain from telling the startup process about
 it. If walreceiver is then killed because the connection is broken (and it
 will be because the master just crashed), the startup process will read the
 streamed WAL from the file in pg_xlog, and go ahead to apply it anyway.

So the goal is that when you *do* failover to the standby it replays
these additional records. So whether the startup process obeys this
limit would have to be conditional on whether it's still in standby
mode.

 So maybe there's some room for optimization there, but given the round-trip
 required for the acknowledgment anyway it might not buy you much, and the
 implementation is not very straightforward. This is clearly 9.1 material, if
 worth optimizing at all.

I don't see any need for a round-trip acknowledgement -- no more than
currently. the master just includes the flush location in every
response. It might have to send additional responses though when
fsyncs happen to update the flush location even if no additional
records are sent. Otherwise a hot standby might spend a long time with
out-dated data even if on failover it would be up to date that seems
nonideal for the hot standby users.

I think this would be a good improvement for databases processing
large batch updates so the standby doesn't have an increased risk of
losing a large amount of data if there's a crash after processing such
a large query. I agree it's 9.1 material.

Earlier we made a change to the WAL streaming protocol on the basis
that we wanted to get the protocol right even if we don't use the
change right away. I'm not sure I understand that -- it's not like
we're going to stream WAL from 9.0 to 9.1. But if that was true then
perhaps we need to add the WAL flush location to the protocol now even
if we're not going to use yet?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-21 Thread Simon Riggs
On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:

 The problem is not that the master streams non-fsync'd WAL, but that the
 standby can replay that. So I'm thinking that we can send non-fsync'd WAL
 safely if the standby makes the recovery wait until the master has fsync'd
 WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
 location to walreceiver, and the standby applies only the WAL which the
 master has already fsync'd. Thought?

Yes, good thought. The patch just applied seems too much.

I had the same thought, though it would mean you'd need to send two xlog
end locations, one for write, one for fsync. Though not really clear why
we send the current end of WAL on the server anyway, so maybe we can
just alter that.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Fujii Masao
On Tue, Jun 15, 2010 at 2:16 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 15/06/10 07:47, Fujii Masao wrote:

 On Tue, Jun 15, 2010 at 12:02 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

 Fujii Masaomasao.fu...@gmail.com  writes:

 Walsender tries to send WAL up to xlogctl-LogwrtResult.Write. OTOH,
 xlogctl-LogwrtResult.Write is updated after XLogWrite() performs fsync.

 Wrong.  LogwrtResult.Write tracks how far we've written out data,
 but it is only (known to be) fsync'd as far as LogwrtResult.Flush.

 Hmm.. I agree that xlogctl-LogwrtResult.Write indicates the byte position
 we've written. But in the current XLogWrite() code, it's updated after
 XLogWrite() calls issue_xlog_fsync(). No?

 issue_xlog_fsync() is only called if the caller requested a flush by
 advancing WriteRqst.Flush.

True. The scenario that I'm concerned about is:

1. A transaction commit causes XLogFlush() to write *and* fsync WAL up to
   the commit record.
2. XLogFlush() calls XLogWrite(), and xlogctl-LogwrtResult.Write is
   updated to indicate the LSN bigger than or equal to that of the commit
   record after XLogWrite() calls issue_xlog_fsync().
3. Then walsender can send WAL up to the commit record.

A transaction commit would need to wait for local fsync and replication
in a serial manner, in synchronous replication. IOW, walsender cannot
send the commit record until it's fsync'd in XLogWrite().

This scenario will not happen? Am I missing something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Robert Haas
On Tue, Jun 15, 2010 at 12:46 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas robertmh...@gmail.com wrote:
 Maybe.  That sounds like a pretty enormous foot-gun to me, considering
 that we have no way of recovering from the situation where the standby
 gets ahead of the master.

 No, we can do that by reconstructing the standby from the backup.

 And, that situation is not a problem for users including me who prefer to
 perform a failover when the master goes down.

 You don't get to pick - if a backend crashes on the master, it will
 restart right away and come up, but the slave will now be hosed...

 You are concerned about the case where postmaster automatically restarts
 the crash recovery, in particular? Yes, this case is more problematic.
 If the standby is ahead of the master, the standby might find an invalid
 record and run into the infinite retry loop, or keep working without
 noticing the inconsistency between the database and the WAL.

 I'm thinking that walreceiver should throw a PANIC when it receives the
 record which is in the LSN older than the last WAL receive location,
 except the beginning of streaming (because the standby always requests
 for streaming from the starting of WAL file at first even if some records
 have already been received in previous time). Thought?

Yeah, that seems like it would be a good safety check.

I wonder if it would be possible to jigger things so that we send the
WAL to the standby as soon as it is generated, but somehow arrange
things so that the standby knows the last location that the master has
fsync'd and never applies beyond that point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Florian Pflug
On Jun 15, 2010, at 10:45 , Fujii Masao wrote:
 A transaction commit would need to wait for local fsync and replication
 in a serial manner, in synchronous replication. IOW, walsender cannot
 send the commit record until it's fsync'd in XLogWrite().

Hm, but since 9.0 won't do synchronous replication anyway, the right thing to 
do for 9.0 is still to send only fsync'ed WAL, no? Without synchronous 
replication the overhead seems negligible.

For synchronous replication (and hence for 9.1) I think there are two basic 
options

a) Stream only fsync'ed WAL, like in the asynchronous case. Depending on 
policy, additionally wait for one or more slaves to fsync before reporting 
success.

b) Stream non-fsync'ed WAL. on COMMIT, wait for at last one node (not 
necessarily the master, exact count depends on policy) to fsync before 
reporting success. During recovery of the master, recover up to the latest LSN 
found on any one of the nodes.

Option (b) requires some additional thought, though. Controlled removal of 
slave nodes and concurrent crashes of more than one node are the most difficult 
areas to handle gracefully, it seems.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Josh Berkus

 I wonder if it would be possible to jigger things so that we send the
 WAL to the standby as soon as it is generated, but somehow arrange
 things so that the standby knows the last location that the master has
 fsync'd and never applies beyond that point.

I can't think of any way which would not require major engineering.  And
you'd be slowing down replication *in general* to deal with a fairly
unlikely corner case.

I think the panic is the way to go.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Robert Haas
On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus j...@agliodbs.com wrote:
 I wonder if it would be possible to jigger things so that we send the
 WAL to the standby as soon as it is generated, but somehow arrange
 things so that the standby knows the last location that the master has
 fsync'd and never applies beyond that point.

 I can't think of any way which would not require major engineering.  And
 you'd be slowing down replication *in general* to deal with a fairly
 unlikely corner case.

 I think the panic is the way to go.

I have yet to convince myself of how likely this is to occur.  I tried
to reproduce this issue by crashing the database, but I think in 9.0
you need an actual operating system crash to cause this problem, and I
haven't yet set up an environment in which I can repeatedly crash the
OS.  I believe, though, that in 9.1, we're going to want to stream
from WAL buffers as proposed in the patch that started out this
thread, and then I think this issue can be triggered with just a
database crash.

In 9.0, I think we can fix this problem by (1) only streaming WAL that
has been fsync'd and (2) PANIC-ing if the problem occurs anyway.  But
in 9.1, with sync rep and the performance demands that entails, I
think that we're going to need to rethink it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Josh Berkus

 I have yet to convince myself of how likely this is to occur.  I tried
 to reproduce this issue by crashing the database, but I think in 9.0
 you need an actual operating system crash to cause this problem, and I
 haven't yet set up an environment in which I can repeatedly crash the
 OS.  I believe, though, that in 9.1, we're going to want to stream
 from WAL buffers as proposed in the patch that started out this
 thread, and then I think this issue can be triggered with just a
 database crash.

Yes, but it still requires:

a) the master must crash with at least one transaction transmitted to
the slave an not yet fsync'd
b) the slave must not crash as well
c) the master must come back up without the slave ever having been
promoted to master

Note that (a) is fairly improbable to begin with due to both our
batching transactions into bundles for transmission, and network latency
vs. disk latency.

So, is it possible?  Yes.  Will it happen anywhere but the
highest-txn-rate sites one in 10,000 times?  No.

This means that we should look for a solution which does not penalize
the common case in order to close a very improbable hole, if such a
solution exists.

 In 9.0, I think we can fix this problem by (1) only streaming WAL that
 has been fsync'd and 

I don't think this is the best solution; it would be a noticeable
performance penalty on replication.  It also would potentially result in
data loss for the user; if the user fails over to the slave in the
corner case, they can rescue the in-flight transaction.  At the least,
this would need to become Yet Another Configuration Option.

(2) PANIC-ing if the problem occurs anyway.  

The question is, is detecting out-of-order WAL records *sufficient* to
detect a failure?  I'm thinking there are possible sequences where there
would be no out-of-sequence, but the slave would still have a
transaction the master doesn't, which the user wouldn't know until a
page update corrupts their data.

 But
 in 9.1, with sync rep and the performance demands that entails, I
 think that we're going to need to rethink it.

All the more reason to avoid dealing with it now, if we can.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Josh Berkus
On 6/15/10 5:09 PM, Josh Berkus wrote:
  In 9.0, I think we can fix this problem by (1) only streaming WAL that
  has been fsync'd and 
 
 I don't think this is the best solution; it would be a noticeable
 performance penalty on replication. 

Actually, there's an even bigger reason not to mandate waiting for
fsync: what if the user turns fsync off?

One can certainly imagine users choosing to rely on their replication
slaves for crash recovery instead of fsync.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-15 Thread Robert Haas
On Tue, Jun 15, 2010 at 8:09 PM, Josh Berkus j...@agliodbs.com wrote:

 I have yet to convince myself of how likely this is to occur.  I tried
 to reproduce this issue by crashing the database, but I think in 9.0
 you need an actual operating system crash to cause this problem, and I
 haven't yet set up an environment in which I can repeatedly crash the
 OS.  I believe, though, that in 9.1, we're going to want to stream
 from WAL buffers as proposed in the patch that started out this
 thread, and then I think this issue can be triggered with just a
 database crash.

 Yes, but it still requires:

 a) the master must crash with at least one transaction transmitted to
 the slave an not yet fsync'd

Bt.  Stop right there.  It only requires the master to crash with
at least one *WAL record* written but not transmitted, not one
transaction.  And most WAL record types are not fsync'd immediately.
So in theory I think that, for example, an OS crash in the middle of a
big bulk insert operation should be sufficient to trigger this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Fujii Masao
On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas robertmh...@gmail.com wrote:
 I think the failover case might be OK.  But if the master crashes and
 restarts, the slave might be left thinking its xlog position is ahead
 of the xlog position on the master.

Right. Unless we perform a failover in this case, the standby might go down
because of inconsistency of WAL after restarting the master. To avoid this
problem, walsender must wait for WAL to be not only written but also *fsynced*
on the master before sending it as 9.0 does. Though this would degrade the
performance, this might be useful for some cases. We should provide the knob
to specify whether to allow the standby to go ahead of the master or not?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Fujii Masao
On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 hmm not sure that is what fujii tried to say - I think his point was
 that in the original case we would have serialized all the operations
 (first write+sync on the master, network afterwards and write+sync on
 the slave) and now we could try parallelizing by sending the wal before
 we have synced locally.

 Well, we're already not waiting for fsync, which is the slowest part.

No, currently walsender waits for fsync.

Walsender tries to send WAL up to xlogctl-LogwrtResult.Write. OTOH,
xlogctl-LogwrtResult.Write is updated after XLogWrite() performs fsync.
As the result, walsender cannot send WAL not fsynced yet. We should
update xlogctl-LogwrtResult.Write before XLogWrite() performs fsync
for 9.0?

But that change would cause the problem that Robert pointed out.
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

 If there's a performance problem, it may be because FADVISE_DONTNEED
 disables kernel buffering so that we're forced to actually read the data
 back from disk before sending it on down the wire.

Currently, if max_wal_senders  0, POSIX_FADV_DONTNEED is not used for
WAL files at all.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Fujii Masao
On Sat, Jun 12, 2010 at 12:15 AM, Stefan Kaltenbrunner
ste...@kaltenbrunner.cc wrote:
 hmm ok - but assuming sync rep we would end up with something like the
 following(hypotetically assuming each operation takes 1 time unit):

 originally:

 write 1
 sync 1
 network 1
 write 1
 sync 1

 total: 5

 whereas in the new case we would basically have the write+sync compete with
 network+write+sync in parallel(total 3 units) and we would only have to wait
 for the slower of those two sets of operations instead of the total time of
 both or am I missing something.

Yeah, this is what I'd like to say. Thanks!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 4:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas robertmh...@gmail.com wrote:
 I think the failover case might be OK.  But if the master crashes and
 restarts, the slave might be left thinking its xlog position is ahead
 of the xlog position on the master.

 Right. Unless we perform a failover in this case, the standby might go down
 because of inconsistency of WAL after restarting the master. To avoid this
 problem, walsender must wait for WAL to be not only written but also *fsynced*
 on the master before sending it as 9.0 does. Though this would degrade the
 performance, this might be useful for some cases. We should provide the knob
 to specify whether to allow the standby to go ahead of the master or not?

Maybe.  That sounds like a pretty enormous foot-gun to me, considering
that we have no way of recovering from the situation where the standby
gets ahead of the master.  Right now, I believe we're still in the
situation where the standby goes into an infinite CPU-chewing,
log-spewing loop, but even after we fix that it's not going to be good
enough to really handle that case sensibly, which we probably need to
do if we want to make this change.

Come to think of it, can this happen already?  Can the master stream
WAL to the standby after it's written but before it's fsync'd?

We should get the open item fixed for 9.0 here before we start
worrying about 9.1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Simon Riggs
On Mon, 2010-06-14 at 17:39 +0900, Fujii Masao wrote:
 On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
  hmm not sure that is what fujii tried to say - I think his point was
  that in the original case we would have serialized all the operations
  (first write+sync on the master, network afterwards and write+sync on
  the slave) and now we could try parallelizing by sending the wal before
  we have synced locally.
 
  Well, we're already not waiting for fsync, which is the slowest part.
 
 No, currently walsender waits for fsync.
 
 Walsender tries to send WAL up to xlogctl-LogwrtResult.Write. OTOH,
 xlogctl-LogwrtResult.Write is updated after XLogWrite() performs fsync.
 As the result, walsender cannot send WAL not fsynced yet. We should
 update xlogctl-LogwrtResult.Write before XLogWrite() performs fsync
 for 9.0?
 
 But that change would cause the problem that Robert pointed out.
 http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

ISTM you just defined some clear objectives for next work.

Copying the data from WAL buffers is mostly irrelevant. The majority of
time is lost waiting for fsync. The biggest issue is about how to allow
WAL write and WALSender to act concurrently and have backend wait for
both.

Sure, copying data from wal_buffers will be faster still, but it will
cause you to address some subtle data structure locking operations that
we could solve at a later time. And it still gives the problem of how
the master resets itself if the standby really is ahead.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Simon Riggs
On Mon, 2010-06-14 at 17:39 +0900, Fujii Masao wrote:
 No, currently walsender waits for fsync.
 ...

 But that change would cause the problem that Robert pointed out.
 http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

Presumably this means that if synchronous_commit = off on primary that
SR in 9.0 will no longer work correctly if the primary crashes?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Fujii Masao
On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas robertmh...@gmail.com wrote:
 Maybe.  That sounds like a pretty enormous foot-gun to me, considering
 that we have no way of recovering from the situation where the standby
 gets ahead of the master.

No, we can do that by reconstructing the standby from the backup.

And, that situation is not a problem for users including me who prefer to
perform a failover when the master goes down. Of course, we can just restart
the master in that case, but it's likely to take longer than a failover
because there would be a cause of the crash. For example, if the master goes
down because of a media crash, the master would never start up unless PITR
is performed. So I'm not sure how many users prefer a restart to a failover.

 We should get the open item fixed for 9.0 here before we start
 worrying about 9.1.

Yep, so I was submitting some patches in these days :)

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Robert Haas
On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas robertmh...@gmail.com wrote:
 Maybe.  That sounds like a pretty enormous foot-gun to me, considering
 that we have no way of recovering from the situation where the standby
 gets ahead of the master.

 No, we can do that by reconstructing the standby from the backup.

 And, that situation is not a problem for users including me who prefer to
 perform a failover when the master goes down.

You don't get to pick - if a backend crashes on the master, it will
restart right away and come up, but the slave will now be hosed...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, we're already not waiting for fsync, which is the slowest part.

 No, currently walsender waits for fsync.

No, you're mistaken.

 Walsender tries to send WAL up to xlogctl-LogwrtResult.Write. OTOH,
 xlogctl-LogwrtResult.Write is updated after XLogWrite() performs fsync.

Wrong.  LogwrtResult.Write tracks how far we've written out data,
but it is only (known to be) fsync'd as far as LogwrtResult.Flush.

 But that change would cause the problem that Robert pointed out.
 http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

Yes.  Possibly walsender should only be allowed to send as far as
LogwrtResult.Flush.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Fujii Masao
On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas robertmh...@gmail.com wrote:
 Maybe.  That sounds like a pretty enormous foot-gun to me, considering
 that we have no way of recovering from the situation where the standby
 gets ahead of the master.

 No, we can do that by reconstructing the standby from the backup.

 And, that situation is not a problem for users including me who prefer to
 perform a failover when the master goes down.

 You don't get to pick - if a backend crashes on the master, it will
 restart right away and come up, but the slave will now be hosed...

You are concerned about the case where postmaster automatically restarts
the crash recovery, in particular? Yes, this case is more problematic.
If the standby is ahead of the master, the standby might find an invalid
record and run into the infinite retry loop, or keep working without
noticing the inconsistency between the database and the WAL.

I'm thinking that walreceiver should throw a PANIC when it receives the
record which is in the LSN older than the last WAL receive location,
except the beginning of streaming (because the standby always requests
for streaming from the starting of WAL file at first even if some records
have already been received in previous time). Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Fujii Masao
On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, we're already not waiting for fsync, which is the slowest part.

 No, currently walsender waits for fsync.

 No, you're mistaken.

 Walsender tries to send WAL up to xlogctl-LogwrtResult.Write. OTOH,
 xlogctl-LogwrtResult.Write is updated after XLogWrite() performs fsync.

 Wrong.  LogwrtResult.Write tracks how far we've written out data,
 but it is only (known to be) fsync'd as far as LogwrtResult.Flush.

Hmm.. I agree that xlogctl-LogwrtResult.Write indicates the byte position
we've written. But in the current XLogWrite() code, it's updated after
XLogWrite() calls issue_xlog_fsync(). No?

Of course, the backend-local LogwrtResult.Write is updated before
issue_xlog_fsync(), but it's not available by walsender.

Am I missing something?

 But that change would cause the problem that Robert pointed out.
 http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

 Yes.  Possibly walsender should only be allowed to send as far as
 LogwrtResult.Flush.

Yes, in order to avoid that problem, walsender should wait for WAL
to be fsync'd before sending it.

But I'm worried that this would slow down the performance on the master
significantly because WAL flush and WAL streaming are not performed
concurrently and the backend must wait for both in a serial manner.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-14 Thread Heikki Linnakangas

On 15/06/10 07:47, Fujii Masao wrote:

On Tue, Jun 15, 2010 at 12:02 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

Fujii Masaomasao.fu...@gmail.com  writes:

Walsender tries to send WAL up to xlogctl-LogwrtResult.Write. OTOH,
xlogctl-LogwrtResult.Write is updated after XLogWrite() performs fsync.


Wrong.  LogwrtResult.Write tracks how far we've written out data,
but it is only (known to be) fsync'd as far as LogwrtResult.Flush.


Hmm.. I agree that xlogctl-LogwrtResult.Write indicates the byte position
we've written. But in the current XLogWrite() code, it's updated after
XLogWrite() calls issue_xlog_fsync(). No?


issue_xlog_fsync() is only called if the caller requested a flush by 
advancing WriteRqst.Flush.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-13 Thread Greg Smith

Florian Pflug wrote:

glibc defines O_DSYNC as an alias for O_SYNC and warrants that with
Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which 
require all metadata updates of a write to be on disk on returning to userspace, but only 
the O_DSYNC semantics, which require only actual file data and metadata necessary to 
retrieve it to be on disk by the time the system call returns.

If that is true, I believe we should default to open_sync, not fdatasync if 
open_datasync isn't available, at least on linux.
  


It's not true, because Linux O_SYNC semantics are basically that it's 
never worked reliably on ext3.  See 
http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php for 
example of how terrible the situation would be if O_SYNC were the 
default on Linux.


We just got a report that a better O_DSYNC is now properly exposed 
starting on kernel 2.6.33+glibc 2.12:  
http://archives.postgresql.org/message-id/201006041539.03868.cousinm...@gmail.com 
and it's possible they may have finally fixed it so it work like it's 
supposed to.  PostgreSQL versions compiled against the right 
prerequisites will default to O_DSYNC by themselves.  Whether or not 
this is a good thing has yet to be determined.  The last thing we'd want 
to do at this point is make the old and usually broken O_SYNC behavior 
suddenly preferred, when the new and possibly fixed O_DSYNC one will be 
automatically selected when available without any code changes on the 
database side.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-12 Thread Florian Pflug
On Jun 12, 2010, at 3:10 , Josh Berkus wrote:
 Hm, but then Robert's failure case is real, and streaming replication might 
 break due to an OS-level crash of the master. Or am I missing something?
 
 1) Master goes out
 2) floating transaction applied to standby.
 3) Standby goes out
 4) Power back on
 5) master comes up
 6) standby comes up
 
 It seems like, in that sequence, the standby would have one transaction
 which the master doesn't have, yet the standby thinks it can continue
 getting WAL from the master.  Or did I miss something which makes this
 impossible?


I did indeed miss something - with wal_sync_method set to either open_datasync 
or open_sync, all written WAL is also synced. Since open_datasync is the 
preferred setting according to 
http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#GUC-WAL-SYNC-METHOD,
 systems supporting open_datasync should be safe.

My Ubuntu 10.04 box running postgres 8.4.4 doesn't support open_datasync 
though, and hence defaults to fdatasync. Probably because of this fragment in 
xlogdefs.h
#if O_DSYNC != BARE_OPEN_SYNC_FLAG
#define OPEN_DATASYNC_FLAG  (O_DSYNC | PG_O_DIRECT)
#endif

glibc defines O_DSYNC as an alias for O_SYNC and warrants that with
Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, 
which require all metadata updates of a write to be on disk on returning to 
userspace, but only the O_DSYNC semantics, which require only actual file data 
and metadata necessary to retrieve it to be on disk by the time the system call 
returns.

If that is true, I believe we should default to open_sync, not fdatasync if 
open_datasync isn't available, at least on linux.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-12 Thread Heikki Linnakangas

On 12/06/10 01:16, Josh Berkus wrote:



Well, we're already not waiting for fsync, which is the slowest part.
If there's a performance problem, it may be because FADVISE_DONTNEED
disables kernel buffering so that we're forced to actually read the data
back from disk before sending it on down the wire.


Well, that's fairly direct to solve, no?  Just disable FADVISE_DONTNEED
if walsenders  0.


We already do that.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Fujii Masao
Hi,

In 9.0, walsender reads WAL always from the disk and sends it to the standby.
That is, we cannot send WAL until it has been written (and flushed) to the disk.
This degrades the performance of synchronous replication very much since a
transaction commit must wait for the WAL write time *plus* the replication time.

The attached patch enables walsender to read data from WAL buffers in addition
to the disk. Since we can write and send WAL simultaneously, in synchronous
replication, a transaction commit has only to wait for either of them. So the
performance would significantly increase.

Now three hackers (Zoltan, Simon and me) are planning to develop synchronous
replication feature. I'm not sure whose patch will be committed at last. But
since the attached patch provides just a infrastructure to optimize SR, it
would work fine with any of them together and have a good effect.

I'll add the patch into the next CF. AFAIK the ReviewFest will start Jun 15.
During that, if you are interested in the patch, please feel free to review it.
Also you can get the code change from my git repository:

git://git.postgresql.org/git/users/fujii/postgres.git
branch: read-wal-buffers

From here I talk about the detail of the change. At first, walsender reads WAL
from the disk. If it has reached the current write location (i.e., there is no
unsent WAL in the disk), then it attempts to read from WAL buffers. This buffer
reading continues until the WAL to send has been purged from WAL buffers. IOW,
If WAL buffers is large enough and walsender has been catching up with insertion
of WAL, it can read WAL from the buffers forever.

Then if WAL to send has purged from the buffers, walsender backs off and tries
to read it from the disk. If we can find no WAL to send in the disk, walsender
attempts to read WAL from the buffers again. Walsender repeats these operations.

The location of the oldest record in the buffers is saved in the shared memory.
This location is used to calculate whether the particular WAL is in the buffers
or not.

To avoid lock contention, walsender reads WAL buffers and XLogCtl-xlblocks
without holding neither WALInsertLock nor WALWriteLock. Of course, they might be
changed because of buffer replacement while being read. So after reading them,
we check that what we read was valid by comparing the location of the read WAL
with the location of the oldest record in the buffers. This logic is similar to
what XLogRead() does at the end.

This feature is required for preventing the performance of synchronous
replication from dropping significantly. It can cut the time that a transaction
committed on the master takes to become visible on the standby. So, it's also
useful for asynchronous replication.

Thought? Comment? Objection?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


read_wal_buffers_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Robert Haas
On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Thought? Comment? Objection?

What happens if the WAL is streamed to the standby and then the master
crashes without writing that WAL to disk?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Fujii Masao
On Fri, Jun 11, 2010 at 10:22 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Thought? Comment? Objection?

 What happens if the WAL is streamed to the standby and then the master
 crashes without writing that WAL to disk?

What are you concerned about?

I think that the situation would be the same as 9.0 from users' perspective.
After failover, the transaction which a client regards as aborted (because
of the crash) might be visible or invisible on new master (i.e., original
standby). For now, we cannot control that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Robert Haas
On Fri, Jun 11, 2010 at 9:57 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Jun 11, 2010 at 10:22 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 11, 2010 at 9:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 Thought? Comment? Objection?

 What happens if the WAL is streamed to the standby and then the master
 crashes without writing that WAL to disk?

 What are you concerned about?

 I think that the situation would be the same as 9.0 from users' perspective.
 After failover, the transaction which a client regards as aborted (because
 of the crash) might be visible or invisible on new master (i.e., original
 standby). For now, we cannot control that.

I think the failover case might be OK.  But if the master crashes and
restarts, the slave might be left thinking its xlog position is ahead
of the xlog position on the master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 In 9.0, walsender reads WAL always from the disk and sends it to the standby.
 That is, we cannot send WAL until it has been written (and flushed) to the 
 disk.

I believe the above statement to be incorrect: walsender does *not* wait
for an fsync to occur.

I agree with the idea of trying to read from WAL buffers instead of the
file system, but the main reason why is that the current behavior makes
FADVISE_DONTNEED for WAL pretty dubious.  It'd be a good idea to still
(artificially) limit replication to not read ahead of the written-out
data.

 ... Since we can write and send WAL simultaneously, in synchronous
 replication, a transaction commit has only to wait for either of them. So the
 performance would significantly increase.

That performance claim, frankly, is ludicrous.  There is no way that
round trip network delay plus write+fsync on the slave is faster than
local write+fsync.  Furthermore, I would say that you are thinking
exactly backwards about the requirements for synchronous replication:
what that would mean is that transaction commit waits for *both*,
not whichever one finishes first.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Stefan Kaltenbrunner

On 06/11/2010 04:31 PM, Tom Lane wrote:

Fujii Masaomasao.fu...@gmail.com  writes:

In 9.0, walsender reads WAL always from the disk and sends it to the standby.
That is, we cannot send WAL until it has been written (and flushed) to the disk.


I believe the above statement to be incorrect: walsender does *not* wait
for an fsync to occur.

I agree with the idea of trying to read from WAL buffers instead of the
file system, but the main reason why is that the current behavior makes
FADVISE_DONTNEED for WAL pretty dubious.  It'd be a good idea to still
(artificially) limit replication to not read ahead of the written-out
data.


... Since we can write and send WAL simultaneously, in synchronous
replication, a transaction commit has only to wait for either of them. So the
performance would significantly increase.


That performance claim, frankly, is ludicrous.  There is no way that
round trip network delay plus write+fsync on the slave is faster than
local write+fsync.  Furthermore, I would say that you are thinking
exactly backwards about the requirements for synchronous replication:
what that would mean is that transaction commit waits for *both*,
not whichever one finishes first.


hmm not sure that is what fujii tried to say - I think his point was 
that in the original case we would have serialized all the operations 
(first write+sync on the master, network afterwards and write+sync on 
the slave) and now we could try parallelizing by sending the wal before 
we have synced locally.




Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Tom Lane
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 hmm not sure that is what fujii tried to say - I think his point was 
 that in the original case we would have serialized all the operations 
 (first write+sync on the master, network afterwards and write+sync on 
 the slave) and now we could try parallelizing by sending the wal before 
 we have synced locally.

Well, we're already not waiting for fsync, which is the slowest part.
If there's a performance problem, it may be because FADVISE_DONTNEED
disables kernel buffering so that we're forced to actually read the data
back from disk before sending it on down the wire.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Stefan Kaltenbrunner

On 06/11/2010 04:47 PM, Tom Lane wrote:

Stefan Kaltenbrunnerste...@kaltenbrunner.cc  writes:

hmm not sure that is what fujii tried to say - I think his point was
that in the original case we would have serialized all the operations
(first write+sync on the master, network afterwards and write+sync on
the slave) and now we could try parallelizing by sending the wal before
we have synced locally.


Well, we're already not waiting for fsync, which is the slowest part.
If there's a performance problem, it may be because FADVISE_DONTNEED
disables kernel buffering so that we're forced to actually read the data
back from disk before sending it on down the wire.


hmm ok - but assuming sync rep we would end up with something like the 
following(hypotetically assuming each operation takes 1 time unit):


originally:

write 1
sync 1
network 1
write 1
sync 1

total: 5

whereas in the new case we would basically have the write+sync compete 
with network+write+sync in parallel(total 3 units) and we would only 
have to wait for the slower of those two sets of operations instead of 
the total time of both or am I missing something.



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Josh Berkus

 Well, we're already not waiting for fsync, which is the slowest part.
 If there's a performance problem, it may be because FADVISE_DONTNEED
 disables kernel buffering so that we're forced to actually read the data
 back from disk before sending it on down the wire.

Well, that's fairly direct to solve, no?  Just disable FADVISE_DONTNEED
if walsenders  0.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Florian Pflug
On Jun 11, 2010, at 16:31 , Tom Lane wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 In 9.0, walsender reads WAL always from the disk and sends it to the standby.
 That is, we cannot send WAL until it has been written (and flushed) to the 
 disk.
 
 I believe the above statement to be incorrect: walsender does *not* wait
 for an fsync to occur.

Hm, but then Robert's failure case is real, and streaming replication might 
break due to an OS-level crash of the master. Or am I missing something?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers

2010-06-11 Thread Josh Berkus

 Hm, but then Robert's failure case is real, and streaming replication might 
 break due to an OS-level crash of the master. Or am I missing something?

Well, in the failover case this isn't a problem, it's a benefit: the
standby gets a transaction which you would have lost off the master.
However, I can see this as a problem in the event of a server-room
powerout with very bad timing where there isn't a failover to the standby:

1) Master goes out
2) floating transaction applied to standby.
3) Standby goes out
4) Power back on
5) master comes up
6) standby comes up

It seems like, in that sequence, the standby would have one transaction
which the master doesn't have, yet the standby thinks it can continue
getting WAL from the master.  Or did I miss something which makes this
impossible?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers