Re: [HACKERS] Causal reads take II

2017-10-27 Thread Thomas Munro
On Sun, Oct 1, 2017 at 10:03 AM, Thomas Munro
 wrote:
>> I tried few more times, and I've got it two times from four attempts on a
>> fresh
>> installation (when all instances were on the same machine). But anyway I'll
>> try
>> to investigate, maybe it has something to do with my environment.
>>
>> ...
>>
>> 2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number  in log
>> segment 00010020, offset 10092544
>
> Hi Dmitry,
>
> Thanks for testing.  Yeah, it looks like the patch may be corrupting
> the WAL stream in some case that I didn't hit in my own testing
> procedure.  I will try to reproduce these failures.

Hi Dmitry,

I managed to reproduce something like this on one of my home lab
machines running a different OS.  Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest.  I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-09-30 Thread Thomas Munro
On Sun, Oct 1, 2017 at 9:05 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>>> Directory not empty
>>> ...
>>
>> Hmm.  The first error ("could not remove directory") could perhaps be
>> explained by temporary files from concurrent backends.
>> ...
>> Perhaps in your testing you accidentally copied a pgdata directory over
>> the
>> top of it while it was running?  In any case I'm struggling to see how
>> anything in this patch would affect anything at the REDO level.
>
> Hmm...no, I don't think so. Basically what I was doing is just running
> `installcheck` against a primary instance (I assume there is nothing wrong
> with
> this approach, am I right?). This particular error was caused by
> `tablespace`
> test which was failed in this case:
>
> ```
> INSERT INTO testschema.foo VALUES(1);
> ERROR:  could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390":
> No such file or directory
> ```
>
> I tried few more times, and I've got it two times from four attempts on a
> fresh
> installation (when all instances were on the same machine). But anyway I'll
> try
> to investigate, maybe it has something to do with my environment.
>
> ...
>
> 2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number  in log
> segment 00010020, offset 10092544

Hi Dmitry,

Thanks for testing.  Yeah, it looks like the patch may be corrupting
the WAL stream in some case that I didn't hit in my own testing
procedure.  I will try to reproduce these failures.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-09-30 Thread Dmitry Dolgov
> On 31 July 2017 at 07:49, Thomas Munro 
wrote:
>> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthali...@gmail.com>
wrote:
>>
>> I looked through the code of `synchronous-replay-v1.patch` a bit and ran
a few
>> tests. I didn't manage to break anything, except one mysterious error
that I've
>> got only once on one of my replicas, but I couldn't reproduce it yet.
>> Interesting thing is that this error did not affect another replica or
primary.
>> Just in case here is the log for this error (maybe you can see something
>> obvious, that I've not noticed):
>>
>> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>> Directory not empty
>> ...
>
> Hmm.  The first error ("could not remove directory") could perhaps be
> explained by temporary files from concurrent backends.
> ...
> Perhaps in your testing you accidentally copied a pgdata directory over
the
> top of it while it was running?  In any case I'm struggling to see how
> anything in this patch would affect anything at the REDO level.

Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong
with
this approach, am I right?). This particular error was caused by
`tablespace`
test which was failed in this case:

```
INSERT INTO testschema.foo VALUES(1);
ERROR:  could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390":
No such file or directory
```

I tried few more times, and I've got it two times from four attempts on a
fresh
installation (when all instances were on the same machine). But anyway I'll
try
to investigate, maybe it has something to do with my environment.

> > * Also I noticed that some time-related values are hardcoded (e.g.
50%/25%
> >   time shift when we're dealing with leases). Does it make sense to move
> >   them out and make them configurable?
>
> These numbers are interrelated, and I think they're best fixed in that
> ratio.  You could make it more adjustable, but I think it's better to
> keep it simple with just a single knob.

Ok, but what do you think about converting them to constants to make them
more
self explanatory? Like:

```
/*
+ * Since this timestamp is being sent to the standby where it will be
+ * compared against a time generated by the standby's system clock, we
+ * must consider clock skew.  We use 25% of the lease time as max
+ * clock skew, and we subtract that from the time we send with the
+ * following reasoning:
+ */
+int max_clock_skew = synchronous_replay_lease_time *
MAX_CLOCK_SKEW_PORTION;
```

Also I have another question. I tried to test this patch little bit more,
and
I've got some strange behaviour after pgbench (here is the full output [1]):

```
# primary

$ ./bin/pgbench -s 100 -i test

NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
10 of 1000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)
20 of 1000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)
30 of 1000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)
2017-09-30 15:47:26.884 CEST [6035] LOG:  revoking synchronous replay lease
for standby "walreceiver"...
2017-09-30 15:47:26.900 CEST [6035] LOG:  standby "walreceiver" is no
longer available for synchronous replay
2017-09-30 15:47:26.903 CEST [6197] LOG:  revoking synchronous replay lease
for standby "walreceiver"...
40 of 1000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)
2017-09-30 15:47:27.979 CEST [6197] LOG:  standby "walreceiver" is no
longer available for synchronous replay
```

```
# replica

2017-09-30 15:47:51.802 CEST [6034] FATAL:  could not receive data from WAL
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number  in log
segment 00010020, offset 10092544
2017-09-30 15:47:55.257 CEST [10508] LOG:  started streaming WAL from
primary at 0/2000 on timeline 1
2017-09-30 15:48:09.622 CEST [10508] FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```

Is it something well known or unrelated to the patch itself?

[1]: https://gist.github.com/erthalion/cdc9357f7437171192348239eb4db764


Re: [HACKERS] Causal reads take II

2017-09-06 Thread Thomas Munro
On Thu, Aug 10, 2017 at 2:02 PM, Thomas Munro
 wrote:
> Rebased after conflicting commit 030273b7.  Now using format-patch
> with a commit message to keep track of review/discussion history.

TAP test 006_logical_decoding.pl failed with that version.  I had
missed some places that know how to decode wire protocol messages I
modified.  Fixed in the attached version.

It might be a good idea to consolidate the message encoding/decoding
logic into reusable routines, independently of this work.

-- 
Thomas Munro
http://www.enterprisedb.com


synchronous-replay-v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-08-09 Thread Thomas Munro
On Mon, Jul 31, 2017 at 5:49 PM, Thomas Munro
 wrote:
>  Here is a version to put it back.

Rebased after conflicting commit 030273b7.  Now using format-patch
with a commit message to keep track of review/discussion history.

-- 
Thomas Munro
http://www.enterprisedb.com


synchronous-replay-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-07-30 Thread Thomas Munro
On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a
> few
> tests. I didn't manage to break anything, except one mysterious error that
> I've
> got only once on one of my replicas, but I couldn't reproduce it yet.
> Interesting thing is that this error did not affect another replica or
> primary.
> Just in case here is the log for this error (maybe you can see something
> obvious, that I've not noticed):
>
> ```
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211":
> Directory not empty
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG:  directories for tablespace 47733 could not be removed
> HINT:  You can remove the directories manually if necessary.
> CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> FATAL:  could not create directory "pg_tblspc/47734/PG_10_201707211/47732":
> File exists
> CONTEXT:  WAL redo at 0/125F5768 for Storage/CREATE:
> pg_tblspc/47734/PG_10_201707211/47732/47736
> LOG:  startup process (PID 8034) exited with exit code 1
> LOG:  terminating any other active server processes
> LOG:  database system is shut down
> ```

Hmm.  The first error ("could not remove directory") could perhaps be
explained by temporary files from concurrent backends, leaked files
from earlier crashes or copying a pgdata directory over the top of an
existing one as a way to set it up, leaving behind some files from an
earlier test?  The second error ("could not create directory") is a
bit stranger though... I think this must come from
TablespaceCreateDbspace(): it must have stat()'d the file and got
ENOENT, decided to create the directory, acquired
TablespaceCreateLock, stat()'d the file again and found it still
absent, then run mkdir() on the parents and got EEXIST and finally on
the directory to be created, and surprisingly got EEXIST.  That means
that someone must have concurrently created the directory.  Perhaps in
your testing you accidentally copied a pgdata directory over the top
of it while it was running?  In any case I'm struggling to see how
anything in this patch would affect anything at the REDO level.

> And speaking about the code, so far I have just a few notes (some of them
> merely questions):
>
> * In general the idea behind this patch sounds interesting for me, but it
>   relies heavily on time synchronization. As mentioned in the documentation:
>   "Current hardware clocks, NTP implementations and public time servers are
>   unlikely to allow the system clocks to differ more than tens or hundreds
> of
>   milliseconds, and systems synchronized with dedicated local time servers
> may
>   be considerably more accurate." But as far as I remember from my own
>   experience sometimes it maybe not that trivial on something like AWS
> because
>   of virtualization. Maybe it's an unreasonable fear, but is it possible to
>   address this problem somehow?

Oops, I had managed to lose an important hunk that deals with
detecting excessive clock drift (ie badly configured servers) while
rebasing a couple of versions back.  Here is a version to put it back.

With that change, if you disable NTP and manually set your standby's
clock to be more than 1.25s (assuming synchronous_replay_lease_time is
set to the default of 5s) behind the primary, the synchronous_replay
should be unavailable and you should see this error in the standby's
log:

ereport(LOG,
(errmsg("the primary server's clock time is too
far ahead for synchronous_replay"),
 errhint("Check your servers' NTP configuration or
equivalent.")));

One way to test this without messing with your NTP setting or
involving two different computers is to modify this code temporarily
in WalSndKeepalive:

now = GetCurrentTimestamp() + 1250100;

This is a best effort intended to detect a system not running ntpd at
all or talking to an insane time server.  Fundamentally this proposal
is based on the assumption that you can get your system clocks into
sync within a tolerance that we feel confident estimating an upper
bound for.

It does appear that some Amazon OS images come with NTP disabled;
that's a problem if you want to use this feature, but if you're
running a virtual server without an ntpd you'll pretty soon drift
seconds to minutes off UTC time and get "unavailable for synchronous
replay" errors from this patch (and possibly the LOG message above,
depending on the direction of drift).


Re: [HACKERS] Causal reads take II

2017-07-29 Thread Dmitry Dolgov
> On 12 July 2017 at 23:45, Thomas Munro 
wrote:
> I renamed it to "synchronous replay", because "causal reads" seemed a bit
too
> arcane.

Hi

I looked through the code of `synchronous-replay-v1.patch` a bit and ran a
few
tests. I didn't manage to break anything, except one mysterious error that
I've
got only once on one of my replicas, but I couldn't reproduce it yet.
Interesting thing is that this error did not affect another replica or
primary.
Just in case here is the log for this error (maybe you can see something
obvious, that I've not noticed):

```
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211":
Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211":
Directory not empty
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG:  directories for tablespace 47733 could not be removed
HINT:  You can remove the directories manually if necessary.
CONTEXT:  WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
FATAL:  could not create directory "pg_tblspc/47734/PG_10_201707211/47732":
File exists
CONTEXT:  WAL redo at 0/125F5768 for Storage/CREATE:
pg_tblspc/47734/PG_10_201707211/47732/47736
LOG:  startup process (PID 8034) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  database system is shut down
```

And speaking about the code, so far I have just a few notes (some of them
merely questions):

* In general the idea behind this patch sounds interesting for me, but it
  relies heavily on time synchronization. As mentioned in the documentation:
  "Current hardware clocks, NTP implementations and public time servers are
  unlikely to allow the system clocks to differ more than tens or hundreds
of
  milliseconds, and systems synchronized with dedicated local time servers
may
  be considerably more accurate." But as far as I remember from my own
  experience sometimes it maybe not that trivial on something like AWS
because
  of virtualization. Maybe it's an unreasonable fear, but is it possible to
  address this problem somehow?

* Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
  time shift when we're dealing with leases). Does it make sense to move
them
  out and make them configurable?

* Judging from the `SyncReplayPotentialStandby` function, it's possible to
have
  `synchronous_replay_standby_names = "*, server_name"`, which is basically
an
  equivalent for just `*`, but it looks confusing. Is it worth it to prevent
  this behaviour?

* In the same function `SyncReplayPotentialStandby` there is this code:

```
if (!SplitIdentifierString(rawstring, ',', ))
{
/* syntax error in list */
pfree(rawstring);
list_free(elemlist);
/* GUC machinery will have already complained - no need to do again */
return false;
}
```

  Am I right that ideally this (a situation when at this point in the code
  `synchronous_replay_standby_names` has incorrect value) should not happen,
  because GUC will prevent us from that? If yes, then it looks for me that
it
  still makes sense to put here a log message, just to give more
information in
  a potentially weird situation.

* In the function `SyncRepReleaseWaiters` there is a commentary:

```
  /*
  * If the number of sync standbys is less than requested or we aren't
* managing a sync standby or a standby in synchronous replay state that
* blocks then just leave.
  * /
if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
```

  Is this commentary correct? If I understand everything right
`!got_recptr` -
  the number of sync standbys is less than requested (a), `!am_sync` - we
aren't
  managing a sync standby (b), `walsender_sr_blocker` - a standby in
synchronous
  replay state that blocks (c). Looks like condition is `(a or b) and not
c`.

* In the function `ProcessStandbyReplyMessage` there is a code that
implements
  this:

```
 * If the standby reports that it has fully replayed the WAL for at
least
* 10 seconds, then let's clear the lag times that were measured when it
* last wrote/flushed/applied a WAL record.  This way we avoid displaying
* stale lag data until more WAL traffic arrives.
```
  but I never found any mention of this 10 seconds in the documentation. Is
it
  not that important? Also, question 2 is related to this one.

* In the function `WalSndGetSyncReplayStateString` all the states are in
lower
  case except `UNKNOWN`, is there any particular reason for that?

There are also few more not that important notes (mostly about some typos
and
few confusing names), but I'm going to do another round of review and
testing
anyway so I'll just send them all 

Re: [HACKERS] Causal reads take II

2017-07-20 Thread Thomas Munro
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs  wrote:
> On 3 January 2017 at 01:43, Thomas Munro  
> wrote:
>
>> Here is a new version of my "causal reads" patch (see the earlier
>> thread from the 9.6 development cycle[1]), which provides a way to
>> avoid stale reads when load balancing with streaming replication.
>
> I'm very happy that you are addressing this topic.
>
> I noticed you didn't put in links my earlier doubts about this
> specific scheme, though I can see doubts from myself and Heikki at
> least in the URLs. I maintain those doubts as to whether this is the
> right way forwards.
>
> This patch presumes we will load balance writes to a master and reads
> to a pool of standbys. How will we achieve that?
>
> 1. We decorate the application with additional info to indicate
> routing/write concerns.
> 2. We get middleware to do routing for us, e.g. pgpool style read/write 
> routing
>
> The explicit premise of the patch is that neither of the above options
> are practical, so I'm unclear how this makes sense. Is there some use
> case that you have in mind that has not been fully described? If so,
> lets get it on the table.
>
> What I think we need is a joined up plan for load balancing, so that
> we can understand how it will work. i.e. explain the whole use case
> and how the solution works.

Simon,

Here's a simple proof-of-concept Java web service using Spring Boot
that demonstrates how load balancing could be done with this patch.
It show two different techniques for routing: an "adaptive" one that
learns which transactional methods need to run on the primary server
by intercepting errors, and a "declarative" one that respects Spring's
@Transactional(readOnly=true) annotations (inspired by the way people
use MySQL Connector/J with Spring to do load balancing).  Whole
transactions are automatically retried at the service request level on
transient failures using existing techniques (Spring Retry, as used
for handling deadlocks and serialisation failures etc), and the
"TransactionRouter" avoids servers that have recently raised the
"synchronous_replay not available" error.  Aside from the optional
annotations, the application code in KeyValueController.java is
unaware of any of this.

https://github.com/macdice/syncreplay-spring-demo

I suspect you could find ways to do similar things with basically any
application development stack that supports some kind of container
managed transactions.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-07-12 Thread Thomas Munro
On Thu, Jul 13, 2017 at 2:51 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Thank you. I started to play with it a little bit, since I think it's an
> interesting idea. And there are already few notes:

Thanks Dmitry.

> * I don't see a CF item for that, where is it?

https://commitfest.postgresql.org/14/951/

The latest version of the patch is here:

https://www.postgresql.org/message-id/CAEepm%3D0YigNQczAF-%3Dx_SxT6cJv77Yb0EO%2BcAFnqRyVu4%2BbKFw%40mail.gmail.com

I renamed it to "synchronous replay", because "causal reads" seemed a
bit too arcane.

> * Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:
>
> ```
> +#causal_reads_standy_names = '*' # standby servers that can potentially
> become
> + # available for causal reads; '*' = all
> +
> ```
>
> it should be `causal_reads_standby_names`.

Fixed in latest version (while renaming).

> Also I hope in the nearest future
> I
> can provide a full review.

Great news, thanks!

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-07-12 Thread Dmitry Dolgov
> On 23 June 2017 at 13:48, Thomas Munro 
wrote:
>
> Apologies for the extended delay.  Here is the rebased patch, now with a
> couple of improvements (see below).

Thank you. I started to play with it a little bit, since I think it's an
interesting idea. And there are already few notes:

* I don't see a CF item for that, where is it?

* Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:

```
+#causal_reads_standy_names = '*' # standby servers that can potentially
become
+ # available for causal reads; '*' = all
+
```

it should be `causal_reads_standby_names`. Also I hope in the nearest
future I
can provide a full review.


Re: [HACKERS] Causal reads take II

2017-06-29 Thread Thomas Munro
On Tue, Jun 27, 2017 at 12:20 PM, Thomas Munro
 wrote:
> On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs  wrote:
>> What I think we need is a joined up plan for load balancing, so that
>> we can understand how it will work. i.e. explain the whole use case
>> and how the solution works.

Here's a proof-of-concept hack of the sort of routing and retry logic
that I think should be feasible with various modern application stacks
(given the right extensions):

https://github.com/macdice/py-pgsync/blob/master/DemoSyncPool.py

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-06-29 Thread Thomas Munro
On Sat, Jun 24, 2017 at 4:05 PM, Thomas Munro
 wrote:
> On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro
>  wrote:
>> Maybe it needs a better name.
>
> Ok, how about this: the feature could be called "synchronous replay".
> The new column in pg_stat_replication could be called sync_replay
> (like the other sync_XXX columns).  The GUCs could be called
> synchronous replay, synchronous_replay_max_lag and
> synchronous_replay_lease_time.  The language in log messages could
> refer to standbys "joining the synchronous replay set".

Feature hereby renamed that way.  It seems a lot more
self-explanatory.  Please see attached.

-- 
Thomas Munro
http://www.enterprisedb.com


synchronous-replay-v1.patch
Description: Binary data


test-synchronous-replay.sh
Description: Bourne shell script
/*
 * A simple test program to test performance and visibility with the
 * synchronous replay patch.
 *
 * Each test loop updates a row on the primary, and then optionally checks if
 * it can see that change immediately on a standby.  If you do this with
 * standard async replication, you should occasionally see an assertion fail
 * if run with --check (depending on the vaguaries of timing -- I can
 * reproduce this very reliably on my system).  If you do it with traditional
 * sync rep, it becomes a little bit less likely (but it's still reliably
 * reproducible on my system).  If you do it with traditional sync rep set up,
 * and "--synchronous-commit remote_apply" then it should no longer be
 * possible to trigger than assertion, but that mode doesn't handle failures.
 * If you do it with --synchronous-replay then you should not be able to
 * reproduce it, no matter which standby you connect to.  If you're using
 * --check and the standby gets dropped (perhaps because you
 * break/disconnect/pause it etc) you should never see that assertion fail
 * (= SELECT running but seeing stale data), instead you should see an error
 * when running the SELECT.
 *
 * Arguments:
 *
 *  --primary  how to connect to the primary
 *  --standby  how to connect to the standby to check
 *  --check   check that the update is visible on standby
 *  --synchronous-replay  enable synchronous replay
 *  --synchronous-commit LEVELset synchronous_commit to LEVEL
 *  --two-phase-commitenable two-phase commit
 *  --loops COUNT how many loops to run through
 *  --verbose chatter
 */

#include 

#include 
#include 
#include 
#include 
#include 
#include 

static double
get_hires_time(void)
{
	struct timeval tv;

	gettimeofday(, NULL);
	return tv.tv_sec + tv.tv_usec / 100.0;
}

int
main(int argc, char *argv[])
{
	PGconn *primary;
	PGconn *standby;
	PGresult *result;
	int i;
	int loops = 1;
	char buffer[1024];
	const char *synchronous_commit = "on";
	bool synchronous_replay = false;
	const char *primary_connstr = "dbname=postgres port=5432";
	const char *standby_connstr = "dbname=postgres port=5441";
	bool check_applied = false;
	bool verbose = false;
	bool two_phase_commit = false;
	double start_time;

	for (i = 1; i != argc; ++i)
	{
		bool more = (i < argc - 1);

		if (strcmp(argv[i], "--verbose") == 0)
			verbose = true;
		else if (strcmp(argv[i], "--check") == 0)
			check_applied = true;
		else if (strcmp(argv[i], "--synchronous-commit") == 0 && more)
			synchronous_commit = argv[++i];
		else if (strcmp(argv[i], "--synchronous-replay") == 0)
			synchronous_replay = true;
		else if (strcmp(argv[i], "--primary") == 0 && more)
			primary_connstr = argv[++i];
		else if (strcmp(argv[i], "--standby") == 0 && more)
			standby_connstr = argv[++i];
		else if (strcmp(argv[i], "--loops") == 0 && more)
			loops = atoi(argv[++i]);
		else if (strcmp(argv[i], "--two-phase-commit") == 0)
			two_phase_commit = true;
		else
		{
			fprintf(stderr, "bad argument\n");
			exit(1);
		}
	}

	primary = PQconnectdb(primary_connstr);
	assert(PQstatus(primary) == CONNECTION_OK);

	standby = PQconnectdb(standby_connstr);
	assert(PQstatus(standby) == CONNECTION_OK);

	snprintf(buffer, sizeof(buffer), "SET synchronous_commit = %s", synchronous_commit);
	result = PQexec(primary, buffer);
	assert(PQresultStatus(result) == PGRES_COMMAND_OK);
	PQclear(result);
	if (synchronous_replay)
	{
		snprintf(buffer, sizeof(buffer), "SET synchronous_replay = on");
		result = PQexec(primary, buffer);
		assert(PQresultStatus(result) == PGRES_COMMAND_OK);
		PQclear(result);
	}

	snprintf(buffer, sizeof(buffer), "SET synchronous_commit = %s", synchronous_commit);
	result = PQexec(standby, buffer);
	assert(PQresultStatus(result) == PGRES_COMMAND_OK);
	PQclear(result);
	if (synchronous_replay)
	{
		snprintf(buffer, sizeof(buffer), "SET synchronous_replay = on");
		result = PQexec(standby, buffer);
		assert(PQresultStatus(result) == PGRES_COMMAND_OK);
		PQclear(result);
	}

	result = PQexec(primary, "CREATE TABLE counter AS SELECT 0 

Re: [HACKERS] Causal reads take II

2017-06-26 Thread Thomas Munro
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs  wrote:
> I'm very happy that you are addressing this topic.
>
> I noticed you didn't put in links my earlier doubts about this
> specific scheme, though I can see doubts from myself and Heikki at
> least in the URLs. I maintain those doubts as to whether this is the
> right way forwards.

One technical problem was raised in the earlier thread by Ants Aasma
that I concede may be fatal to this design (see note below about
read-follows-read below), but I'm not sure.  All the other discussion
seemed to be about trade-offs between writer-waits and reader-waits
schemes, both of which I still view as reasonable options for an end
user to have in the toolbox.  Your initial reaction was:

> What we want is to isolate the wait only to people performing a write-read
> sequence, so I think it should be readers that wait.

I agree with you 100%, up to the comma.  The difficulty is identifying
which transactions are part of a write-read sequence.  An
application-managed LSN tracking system allows for waits to occur
strictly in reads that are part of a write-read sequence because the
application links them explicitly, and nobody is arguing that we
shouldn't support that for hard working expert users.  But to support
applications that don't deal with LSNs (or some kind of "causality
tokens") explicitly I think we'll finish up having to make readers
wait for incidental later transactions too, not just the write that
your read is dependent on, as I'll show below.  When you can't
identify write-read sequences perfectly, it comes down to a choice
between taxing writers or taxing readers, and I'm not sure that one
approach is universally better.  Let me summarise my understanding of
that trade-off.

I'm going to use this terminology:

synchronous replay = my proposal: ask the primary server to wait until
standbys have applied tx1, a bit like 9.6 synchronous_commit =
remote_apply, but with a system of leases for graceful failure.

causality tokens = the approach Heikki mentioned: provide a way for
tx1 to report its commit LSN to the client, then provide a way for a
client to wait for the LSN to be replayed before taking a snapshot for
tx2.

tx1, tx2 = a pair of transactions with a causal dependency; we want
tx2 to see tx1 because tx1 caused tx2 in some sense so must be seen to
precede it.

A poor man's causality token system can be cobbled together today with
pg_current_wal_lsn() and a polling loop that checks
pg_last_wal_replay_lsn().  It's a fairly obvious thing to want to do.
Several people including Heikki Linnakangas, Craig Ringer, Ants Aasma,
Ivan Kartyshov and probably many others have discussed better ways to
do that[1], and a patch for the wait-for-LSN piece appeared in a
recent commitfest[2].  I reviewed Ivan's patch and voted -1 only
because it didn't work for higher isolation levels.  If he continues
to develop that I will be happy to review and help get it into
committable shape, and if he doesn't I may try to develop it myself.
In short, I think this is a good tool to have in the toolbox and
PostgreSQL 11 should have it!  But I don't think it necessarily
invalidates my synchronous replay proposal: they represent different
sets of trade-offs and might appeal to different users.  Here's why:

To actually use a causality token system you either need a carefully
designed application that keeps track of causal dependencies and
tokens, in which case the developer works harder but can benefit from
from an asynchronous pipelining effect (by the time tx2 runs we hope
that tx1 has been applied, so neither transaction had to wait).  Let's
call that "application-managed causality tokens".  That's great for
those users -- let's make that possible -- but most users don't want
to write code like that.  So I see approximately three choices for
transparent middleware (or support built into standbys), which I'll
name and describe as follows:

1.  "Panoptic" middleware: Sees all queries so that it can observe
commit LSNs and inject wait directives into all following read-only
transactions.  Since all queries now need to pass through a single
process, you have a new bottleneck, an extra network hop, and a single
point of failure so you'll probably want a failover system with
split-brain defences.

2.  "Hybrid" middleware: The middleware (or standby itself) listens to
the replication stream so that it knows about commit LSNs (rather than
spying on commits).  The primary server waits until all connected
middleware instances acknowledge commit LSNs before it releases
commits, and then the middleware inserts wait-for-LSN directives into
read-only transactions.  Now there is no single point problem, but
writers are impacted too.  I mention this design because I believe
this is conceptually similar to how Galera wsrep_sync_wait (AKA
wsrep_causal_reads) works.  (I call this "hybrid" because it splits
the waiting between tx1 and tx2.  Since it's synchronous, dealing with

Re: [HACKERS] Causal reads take II

2017-06-24 Thread Simon Riggs
On 3 January 2017 at 01:43, Thomas Munro  wrote:

> Here is a new version of my "causal reads" patch (see the earlier
> thread from the 9.6 development cycle[1]), which provides a way to
> avoid stale reads when load balancing with streaming replication.

I'm very happy that you are addressing this topic.

I noticed you didn't put in links my earlier doubts about this
specific scheme, though I can see doubts from myself and Heikki at
least in the URLs. I maintain those doubts as to whether this is the
right way forwards.

This patch presumes we will load balance writes to a master and reads
to a pool of standbys. How will we achieve that?

1. We decorate the application with additional info to indicate
routing/write concerns.
2. We get middleware to do routing for us, e.g. pgpool style read/write routing

The explicit premise of the patch is that neither of the above options
are practical, so I'm unclear how this makes sense. Is there some use
case that you have in mind that has not been fully described? If so,
lets get it on the table.

What I think we need is a joined up plan for load balancing, so that
we can understand how it will work. i.e. explain the whole use case
and how the solution works.

I'm especially uncomfortable with any approaches that treat all
sessions as one pool. For me, a server should support multiple pools.
Causality seems to be a property of a particular set of pools. e.g.
PoolS1 supports causal reads against writes to PoolM1 but not PoolM2,
yet PoolS2 does not provide causal reads against PoolM1 orPoolM2.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-06-23 Thread Thomas Munro
On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro
 wrote:
> Apply the patch after first applying a small bug fix for replication
> lag tracking[4].  Then:

That bug fix was committed, so now causal-reads-v17.patch can be
applied directly on top of master.

> 1.  Set up some streaming replicas.
> 2.  Stick causal_reads_max_replay_lag = 2s (or any time you like) in
> the primary's postgresql.conf.
> 3.  Set causal_reads = on in some transactions on various nodes.
> 4.  Try to break it!

Someone asked me off-list how to set this up quickly and easily for
testing.  Here is a shell script that will start up a primary server
(port 5432) and 3 replicas (ports 5441 to 5443).  Set the two paths at
the top of the file before running in.  Log in with psql postgres [-p
], then SET causal_reads = on to test its effect.
causal_reads_max_replay_lay is set to 2s and depending on your
hardware you might find that stuff like CREATE TABLE big_table AS
SELECT generate_series(1, 1000) or a large COPY data load causes
replicas to be kicked out of the set after a while; you can also pause
replay on the replicas with SELECT pg_wal_replay_pause() and
pg_wal_replay_resume(), kill -STOP/-CONT or -9 the walreceiver
processes to similar various failure modes, or run the replicas
remotely and unplug the network.  SELECT application_name, replay_lag,
causal_reads_state FROM pg_state_replication to see the current
situation, and also monitor the primary's LOG messages about
transitions.  You should find that the
"read-your-writes-or-fail-explicitly" guarantee is upheld, no matter
what you do, and furthermore than failing or lagging replicas don't
hold hold the primary up very long: in the worst case
causal_reads_lease_time for lost contact, and in the best case the
time to exchange a couple of messages with the standby to tell it its
lease is revoked and it should start raising an error.  You might find
test-causal-reads.c[1] useful for testing.

> Maybe it needs a better name.

Ok, how about this: the feature could be called "synchronous replay".
The new column in pg_stat_replication could be called sync_replay
(like the other sync_XXX columns).  The GUCs could be called
synchronous replay, synchronous_replay_max_lag and
synchronous_replay_lease_time.  The language in log messages could
refer to standbys "joining the synchronous replay set".

Restating the purpose of the feature with that terminology:  If
synchronous_replay is set to on, then you see the effects of all
synchronous_replay = on transactions that committed before your
transaction began, or an error is raised if that is not possible on
the current node.  This allows applications to direct read-only
queries to read-only replicas for load balancing without seeing stale
data.  Is that clearer?

Restating the relationship with synchronous replication with that
terminology: while synchronous_commit and synchronous_standby_names
are concerned with distributed durability, synchronous_replay is
concerned with distributed visibility.  While the former prevents
commits from returning if the configured level of durability isn't met
(for example "must be flushed on master + any 2 standbys"), the latter
will simply drop any standbys from the synchronous replay set if they
fail or lag more than synchronous_replay_max_lag.  It is reasonable to
want to use both features at once:  my policy on distributed
durability might be that I want all transactions to be flushed to disk
on master + any of three servers before I report information to users,
and my policy on distributed visibility might be that I want to be
able to run read-only queries on any of my six read-only replicas, but
don't want to wait for any that lag by more than 1 second.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CAEepm%3D3NF%3D7eLkVR2fefVF9bg6RxpZXoQFmOP3RWE4r4iuO7vg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


test-causal-reads.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-06-23 Thread Thomas Munro
On Wed, May 24, 2017 at 3:58 PM, Thomas Munro
 wrote:
>> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>> I'm wondering about status of this patch and how can I try it out?
>
> I ran into a problem while doing this, and it may take a couple more
> days to fix it since I am at pgcon this week.  More soon.

Apologies for the extended delay.  Here is the rebased patch, now with
a couple of improvements (see below).  To recap, this is the third
part of the original patch series[1], which had these components:

1.  synchronous_commit = remote_apply, committed in PostgreSQL 9.6
2.  replication lag tracking, committed in PostgreSQL 10
3.  causal_reads, the remaining part, hereby proposed for PostgreSQL 11

The goal is to allow applications to move arbitrary read-only
transactions to physical replica databases and still know that they
can see all preceding write transactions or get an error.  It's
something like regular synchronous replication with synchronous_commit
= remote_apply, except that it limits the impact on the primary and
handles failure transitions with defined semantics.

The inspiration for this kind of distributed read-follows-write
consistency using read leases was a system called Comdb2[2][3], whose
designer encouraged me to try to extend Postgres's streaming
replication to do something similar.  Read leases can also be found in
some consensus systems like Google Megastore, albeit in more ambitious
form IIUC.  The name is inspired by a MySQL Galera feature
(approximately the same feature but the approach is completely
different; Galera adds read latency, whereas this patch does not).
Maybe it needs a better name.

Is this is a feature that people want to see in PostgreSQL?

IMPROVEMENTS IN V17

The GUC to enable the feature is now called
"causal_reads_max_replay_lag".  Standbys listed in
causal_reads_standby_names whose pg_stat_replication.replay_lag
doesn't exceed that time are "available" for causal reads and will be
waited for by the primary when committing.  When they exceed that
threshold they are briefly in "revoking" state and then "unavailable",
and when the go return to an acceptable level they are briefly in
"joining" state before reaching "available".  CR states appear in
pg_stat_replication and transitions are logged at LOG level.

A new GUC called "causal_reads_lease_time" controls the lifetime of
read leases sent from the primary to the standby.  This affects the
frequency of lease replacement messages, and more importantly affects
the worst case of commit stall that can be introduced if connectivity
to a standby is lost and we have to wait for the last sent lease to
expire.  In the previous version, one single GUC controlled both
maximum tolerated replay lag and lease lifetime, which was good from
the point of view that fewer GUCs are better, but bad because it had
to be set fairly high when doing both jobs to be conservative about
clock skew.  The lease lifetime must be at least 4 x maximum tolerable
clock skew.  After the recent botching of a leap-second transition on
a popular public NTP network (TL;DR OpenNTP is not a good choice of
implementation to add to a public time server pool) I came to the
conclusion that I wouldn't want to recommend a default max clock skew
under 1.25s, to allow for some servers to be confused about leap
seconds for a while or to be running different smearing algorithms.  A
reasonable causal_reads_lease_time recommendation for people who don't
know much about the quality of their time source might therefore be
5s.  I think it's reasonable to want to set the maximum tolerable
replay lag to lower time than that, or in fact as low as you like,
depending on your workload and hardware.  Therefore I decided to split
the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag"
and "causal_reads_lease_time".

This new version introduces fast lease revocation.  Whenever the
primary decides that a standby is not keeping up, it kicks it out of
the set of CR-available standbys and revokes its lease, so that anyone
trying to run causal reads transactions there will start receiving a
new error.  In the previous version, it always did that by blocking
commits while waiting for the most recently sent lease to expire,
which I now call "slow revocation" because it could take several
seconds.  Now it blocks commits only until the standby acknowledges
that it is no longer available for causal reads OR the lease expires:
ideally that takes the time of a network a round trip.  Slow
revocation is still needed in various failure cases such as lost
connectivity.

TESTING

Apply the patch after first applying a small bug fix for replication
lag tracking[4].  Then:

1.  Set up some streaming replicas.
2.  Stick causal_reads_max_replay_lag = 2s (or any time you like) in
the primary's postgresql.conf.
3.  Set causal_reads = on in some transactions on various nodes.
4.  Try to break it!

As long as your 

Re: [HACKERS] Causal reads take II

2017-05-23 Thread Thomas Munro
On Mon, May 22, 2017 at 6:32 AM, Thomas Munro
 wrote:
> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>> I'm wondering about status of this patch and how can I try it out?
>
> Hi Dmitry, thanks for your interest.
>
>>> On 3 January 2017 at 02:43, Thomas Munro 
>>> wrote:
>>> The replay lag tracking patch this depends on is in the current commitfest
>>
>> I assume you're talking about this patch [1] (at least it's the only thread
>> where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
>> returned with feedback, so what's the status of this one (`causal reads`)?
>
> Right, replay lag tracking was committed.  I'll post a rebased causal
> reads patch later today.

I ran into a problem while doing this, and it may take a couple more
days to fix it since I am at pgcon this week.  More soon.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-05-21 Thread Thomas Munro
On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> I'm wondering about status of this patch and how can I try it out?

Hi Dmitry, thanks for your interest.

>> On 3 January 2017 at 02:43, Thomas Munro 
>> wrote:
>> The replay lag tracking patch this depends on is in the current commitfest
>
> I assume you're talking about this patch [1] (at least it's the only thread
> where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
> returned with feedback, so what's the status of this one (`causal reads`)?

Right, replay lag tracking was committed.  I'll post a rebased causal
reads patch later today.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-05-21 Thread Dmitry Dolgov
Hi

I'm wondering about status of this patch and how can I try it out?

> On 3 January 2017 at 02:43, Thomas Munro 
wrote:
> The replay lag tracking patch this depends on is in the current commitfest

I assume you're talking about this patch [1] (at least it's the only thread
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
returned with feedback, so what's the status of this one (`causal reads`)?

> First apply replay-lag-v16.patch, then refactor-syncrep-exit-v16.patch,
then
> causal-reads-v16.patch.

It would be nice to have all three of them attached (for some reason I see
only
last two of them in this thread). But anyway there are a lot of failed hunks
when I'm trying to apply `replay-lag-v16.patch` and
`refactor-syncrep-exit-v16.patch`,
`causal-reads-v16.patch` (or last two of them separately).

[1] https://commitfest.postgresql.org/12/920/


Re: [HACKERS] Causal reads take II

2017-01-19 Thread Thomas Munro
On Fri, Jan 20, 2017 at 3:01 AM, Ants Aasma  wrote:
> Yes there is a race even with all transactions having the same
> synchronization level. But nobody will mind if we some day fix that
> race. :)

We really should fix that!

> With different synchronization levels it is much trickier to
> fix as either async commits must wait behind sync commits before
> becoming visible, return without becoming visible or visibility order
> must differ from commit record LSN order. The first makes the async
> commit feature useless, second seems a no-go for semantic reasons,
> third requires extra information sent to standby's so they know the
> actual commit order.

Thought experiment:

1.  Log commit and make visible atomically (so DO and REDO agree on
visibility order).
2.  Introduce flag 'not yet visible' to commit record for sync rep commits.
3.  Introduce a new log record 'make all invisible commits up to LSN X
visible', which is inserted when enough sync rep standbys reply.  Log
this + make visible on primary atomically (again, so DO and REDO agree
on visibility order).
4.  Teach GetSnapshotData to deal with this using .

Now standby and primary agree on visibility order of async and sync
transactions, and no standby will allow you to see transactions that
the primary doesn't yet consider to be durable (ie flushed on a quorum
of standbys etc).  But... sync rep has to flush xlog twice on primary,
and standby has to wait to make things visible, and remote_apply would
either need to be changed or supplemented with a new level
remote_apply_and_visible, and it's not obvious how to actually do
atomic visibility + logging (I heard ProcArrayLock is kinda hot...).
Hmm.  Doesn't sound too popular...

>>> In CSN based snapshot
>>> discussions we came to the conclusion that to make standby visibility
>>> order match master while still allowing for async transactions to
>>> become visible before they are durable we need to make the commit
>>> sequence a vector clock and transmit extra visibility ordering
>>> information to standby's. Having one more level of delay between wal
>>> logging of commit and making it visible would make the problem even
>>> worse.
>>
>> I'd like to read that... could you please point me at the right bit of
>> that discussion?
>
> Some of that discussion was face to face at pgconf.eu, some of it is
> here: 
> https://www.postgresql.org/message-id/CA+CSw_vbt=cwluogr7gxdpnho_y4cz7x97+o_bh-rfo7ken...@mail.gmail.com
>
> Let me know if you have any questions.

Thanks!  That may take me some time...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-01-19 Thread Ants Aasma
On Thu, Jan 19, 2017 at 2:22 PM, Thomas Munro
 wrote:
> On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma  wrote:
>> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
>>  wrote:
>>> Long term, I think it would be pretty cool if we could develop a set
>>> of features that give you distributed sequential consistency on top of
>>> streaming replication.  Something like (this | causality-tokens) +
>>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>>> distributed-dirty-read-prevention[4].
>>
>> Is it necessary that causal writes wait for replication before making
>> the transaction visible on the master? I'm asking because the per tx
>> variable wait time between logging commit record and making
>> transaction visible makes it really hard to provide matching
>> visibility order on master and standby.
>
> Yeah, that does seem problematic.  Even with async replication or no
> replication, isn't there already a race in CommitTransaction() where
> two backends could reach RecordTransactionCommit() in one order but
> ProcArrayEndTransaction() in the other order?  AFAICS using
> synchronous replication in one of the transactions just makes it more
> likely you'll experience such a visibility difference between the DO
> and REDO histories (!), by making RecordTransactionCommit() wait.
> Nothing prevents you getting a snapshot that can see t2 but not t1 in
> the DO history, while someone doing PITR or querying an asynchronous
> standby gets a snapshot that can see t1 but not t2 because those
> replay the REDO history.

Yes there is a race even with all transactions having the same
synchronization level. But nobody will mind if we some day fix that
race. :) With different synchronization levels it is much trickier to
fix as either async commits must wait behind sync commits before
becoming visible, return without becoming visible or visibility order
must differ from commit record LSN order. The first makes the async
commit feature useless, second seems a no-go for semantic reasons,
third requires extra information sent to standby's so they know the
actual commit order.

>> In CSN based snapshot
>> discussions we came to the conclusion that to make standby visibility
>> order match master while still allowing for async transactions to
>> become visible before they are durable we need to make the commit
>> sequence a vector clock and transmit extra visibility ordering
>> information to standby's. Having one more level of delay between wal
>> logging of commit and making it visible would make the problem even
>> worse.
>
> I'd like to read that... could you please point me at the right bit of
> that discussion?

Some of that discussion was face to face at pgconf.eu, some of it is
here: 
https://www.postgresql.org/message-id/CA+CSw_vbt=cwluogr7gxdpnho_y4cz7x97+o_bh-rfo7ken...@mail.gmail.com

Let me know if you have any questions.

>> It seems that fixing that would
>> require either keeping some per client state or a global agreement on
>> what snapshots are safe to provide, both of which you tried to avoid
>> for this feature.
>
> Agreed.  You briefly mentioned this problem in the context of pairs of
> read-only transactions a while ago[1].  As you said then, it does seem
> plausible to do that with a token system that gives clients the last
> commit LSN from the snapshot used by a read only query, so that you
> can ask another standby to make sure that LSN has been replayed before
> running another read-only transaction.  This could be handled
> explicitly by a motivated client that is talking to multiple nodes.  A
> more general problem is client A telling client B to go and run
> queries and expecting B to see all transactions that A has seen; it
> now has to pass the LSN along with that communication, or rely on some
> kind of magic proxy that sees all transactions, or a radically
> different system with a GTM.

If/when we do CSN based snapshots, adding a GTM could be relatively
straightforward. It's basically not all that far from what Spanner is
doing by using a timestamp as the snapshot. But this is all relatively
independent of this patch.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-01-19 Thread Thomas Munro
On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma  wrote:
> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
>  wrote:
>> Long term, I think it would be pretty cool if we could develop a set
>> of features that give you distributed sequential consistency on top of
>> streaming replication.  Something like (this | causality-tokens) +
>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>> distributed-dirty-read-prevention[4].
>
> Is it necessary that causal writes wait for replication before making
> the transaction visible on the master? I'm asking because the per tx
> variable wait time between logging commit record and making
> transaction visible makes it really hard to provide matching
> visibility order on master and standby.

Yeah, that does seem problematic.  Even with async replication or no
replication, isn't there already a race in CommitTransaction() where
two backends could reach RecordTransactionCommit() in one order but
ProcArrayEndTransaction() in the other order?  AFAICS using
synchronous replication in one of the transactions just makes it more
likely you'll experience such a visibility difference between the DO
and REDO histories (!), by making RecordTransactionCommit() wait.
Nothing prevents you getting a snapshot that can see t2 but not t1 in
the DO history, while someone doing PITR or querying an asynchronous
standby gets a snapshot that can see t1 but not t2 because those
replay the REDO history.

> In CSN based snapshot
> discussions we came to the conclusion that to make standby visibility
> order match master while still allowing for async transactions to
> become visible before they are durable we need to make the commit
> sequence a vector clock and transmit extra visibility ordering
> information to standby's. Having one more level of delay between wal
> logging of commit and making it visible would make the problem even
> worse.

I'd like to read that... could you please point me at the right bit of
that discussion?

> One other thing that might be an issue for some users is that this
> patch only ensures that clients observe forwards progress of database
> state after a writing transaction. With two consecutive read only
> transactions that go to different servers a client could still observe
> database state going backwards.

True.  This patch is about "read your writes", not, erm, "read your
reads".  That may indeed be problematic for some users.  It's not a
very satisfying answer but I guess you could run a dummy write query
on the primary every time you switch between standbys, or before
telling any other client to run read-only queries after you have done
so, in order to convert your "r r" sequence into a "r w r" sequence...

> It seems that fixing that would
> require either keeping some per client state or a global agreement on
> what snapshots are safe to provide, both of which you tried to avoid
> for this feature.

Agreed.  You briefly mentioned this problem in the context of pairs of
read-only transactions a while ago[1].  As you said then, it does seem
plausible to do that with a token system that gives clients the last
commit LSN from the snapshot used by a read only query, so that you
can ask another standby to make sure that LSN has been replayed before
running another read-only transaction.  This could be handled
explicitly by a motivated client that is talking to multiple nodes.  A
more general problem is client A telling client B to go and run
queries and expecting B to see all transactions that A has seen; it
now has to pass the LSN along with that communication, or rely on some
kind of magic proxy that sees all transactions, or a radically
different system with a GTM.

[1] 
https://www.postgresql.org/message-id/ca%2bcsw_u4vy5fsbjvc7qms6puzl7qv90%2bonbetk9pfqosnj0...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Causal reads take II

2017-01-18 Thread Ants Aasma
On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
 wrote:
> Here is a new version of my "causal reads" patch (see the earlier
> thread from the 9.6 development cycle[1]), which provides a way to
> avoid stale reads when load balancing with streaming replication.

Thanks for working on this. It will let us do something a lot of
people have been asking for.

> Long term, I think it would be pretty cool if we could develop a set
> of features that give you distributed sequential consistency on top of
> streaming replication.  Something like (this | causality-tokens) +
> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
> distributed-dirty-read-prevention[4].

Is it necessary that causal writes wait for replication before making
the transaction visible on the master? I'm asking because the per tx
variable wait time between logging commit record and making
transaction visible makes it really hard to provide matching
visibility order on master and standby. In CSN based snapshot
discussions we came to the conclusion that to make standby visibility
order match master while still allowing for async transactions to
become visible before they are durable we need to make the commit
sequence a vector clock and transmit extra visibility ordering
information to standby's. Having one more level of delay between wal
logging of commit and making it visible would make the problem even
worse.

One other thing that might be an issue for some users is that this
patch only ensures that clients observe forwards progress of database
state after a writing transaction. With two consecutive read only
transactions that go to different servers a client could still observe
database state going backwards. It seems that fixing that would
require either keeping some per client state or a global agreement on
what snapshots are safe to provide, both of which you tried to avoid
for this feature.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers