Hi,

Here is a new version of the patch with a few small improvements:

1.  Adopted the term '[read] lease', replacing various hand-wavy language
in the comments and code.  That seems to be the established term for this
approach[1].

2.  Reduced the stalling time on failure.  When things go wrong with a
standby (such as losing contact with it), instead of stalling for a
conservative amount of time longer than any lease that might have been
granted, the primary now stalls only until the expiry of the last lease
that actually was granted to a given dropped standby, which should be
sooner.

3.  Fixed a couple of bugs that showed up in testing and review (some bad
flow control in the signal handling, and a bug in a circular buffer), and
changed the recovery->walreceiver wakeup signal handling to block the
signal except while waiting in walrcv_receive (it didn't seem a good idea
to interrupt arbitrary syscalls in walreceiver so I thought that would be a
improvement; but of course that area's going to be reworked by Simon's
patch anyway, as discussed elsewhere).

Restating the central idea using the new terminology:  So long as they are
replaying fast enough, the primary grants a series of causal reads leases
to standbys allowing them to handle causal reads queries locally without
any inter-node communication for a limited time.  Leases are promises that
the primary will wait for the standby to apply commit records OR be dropped
from the set of available causal reads standbys and know that it has been
dropped, before the primary returns from commit, in order to uphold the
causal reads guarantee.  In the worst case it can do that by waiting for
the most recently granted lease to expire.

I've also attached a couple of things which might be useful when trying the
patch out: test-causal-reads.c which can be used to test performance and
causality under various conditions, and test-causal-reads.sh which can be
used to bring up a primary and a bunch of local hot standbys to talk to.
 (In the hope of encouraging people to take the patch for a spin...)

[1] Originally from a well known 1989 paper on caching, but in the context
of databases and synchronous replication see for example the recent papers
on "Niobe" and "Paxos Quorum Leases" (especially the reference to Google
Megastore).  Of course a *lot* more is going on in those very different
algorithms, but at some level "read leases" are being used to allow
local-node-only reads for a limited time while upholding some kind of
global consistency guarantee, in some of those consensus database systems.
I spent a bit of time talking about consistency levels to database guru and
former colleague Alex Scotti who works on a Paxos-based system, and he gave
me the initial idea to try out a lease-based consistency system for
Postgres streaming rep.  It seems like a very useful point in the space of
trade-offs to me.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment: test-causal-reads.sh
Description: Bourne shell script

/*
 * A simple test program to test performance and visibility with the causal
 * reads patch.
 *
 * Each test loop updates a row on the primary, and then optionally checks if
 * it can see that change immediately on a standby.  If you do this with
 * standard async replication, you should occasionally see an assertion fail
 * if run with --check (depending on the vaguaries of timing -- I can
 * reproduce this very reliably on my system).  If you do it with traditional
 * sync rep, it becomes a little bit less likely (but it's still reliably
 * reproducible on my system).  If you do it with traditional sync rep set up,
 * and "--synchronous-commit apply" then it should no longer be possible to
 * trigger than assertion, but that's just a straw-man mode.  If you do it
 * with --causal-reads then you should not be able to reproduce it, no matter
 * which standby you connect to.  If you're using --check and the standby gets
 * dropped (perhaps because you break/disconnect/pause it etc) you should
 * never see that assertion fail (= SELECT running but seeing stale data),
 * instead you should see an error when running the SELECT.
 *
 * Arguments:
 *
 *  --primary <connection-string>     how to connect to the primary
 *  --standby <connection-string>     how to connect to the standby to check
 *  --check                           check that the update is visible on standby
 *  --causal-reads                    enable causal reads
 *  --synchronous-commit LEVEL        set synchronous_commit to LEVEL
 *  --loops COUNT                     how many loops to run through
 *  --verbose                         chatter
 */

#include <libpq-fe.h>

#include <assert.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int
main(int argc, char *argv[])
{
	PGconn *primary;
	PGconn *standby;
	PGresult *result;
	int i;
	int loops = 10000;
	char buffer[1024];
	const char *synchronous_commit = "on";
	bool causal_reads = false;
	const char *primary_connstr = "dbname=postgres port=5432";
	const char *standby_connstr = "dbname=postgres port=5442";
	bool check_applied = false;
	bool verbose = false;

	for (i = 1; i != argc; ++i)
	{
		bool more = (i < argc - 1);

		if (strcmp(argv[i], "--verbose") == 0)
			verbose = true;
		else if (strcmp(argv[i], "--check") == 0)
			check_applied = true;
		else if (strcmp(argv[i], "--synchronous-commit") == 0 && more)
			synchronous_commit = argv[++i];
		else if (strcmp(argv[i], "--causal-reads") == 0)
			causal_reads = true;
		else if (strcmp(argv[i], "--primary") == 0 && more)
			primary_connstr = argv[++i];
		else if (strcmp(argv[i], "--standby") == 0 && more)
			standby_connstr = argv[++i];
		else if (strcmp(argv[i], "--loops") == 0 && more)
			loops = atoi(argv[++i]);
		else
		{
			fprintf(stderr, "bad argument\n");
			exit(1);
		}
	}

	primary = PQconnectdb(primary_connstr);
	assert(PQstatus(primary) == CONNECTION_OK);

	standby = PQconnectdb(standby_connstr);
	assert(PQstatus(standby) == CONNECTION_OK);

	snprintf(buffer, sizeof(buffer), "SET synchronous_commit = %s", synchronous_commit);
	result = PQexec(primary, buffer);
	assert(PQresultStatus(result) == PGRES_COMMAND_OK);
	PQclear(result);
	snprintf(buffer, sizeof(buffer), "SET causal_reads = %s", causal_reads ? "on" : "off");
	result = PQexec(primary, buffer);
	assert(PQresultStatus(result) == PGRES_COMMAND_OK);
	PQclear(result);

	snprintf(buffer, sizeof(buffer), "SET synchronous_commit = %s", synchronous_commit);
	result = PQexec(standby, buffer);
	assert(PQresultStatus(result) == PGRES_COMMAND_OK);
	PQclear(result);
	snprintf(buffer, sizeof(buffer), "SET causal_reads = %s", causal_reads ? "on" : "off");
	result = PQexec(standby, buffer);
	assert(PQresultStatus(result) == PGRES_COMMAND_OK);
	PQclear(result);

	result = PQexec(primary, "CREATE TABLE counter AS SELECT 0 AS n");
	assert(PQresultStatus(result) == PGRES_COMMAND_OK ||
		 strcmp(PQresultErrorField(result, PG_DIAG_SQLSTATE), "42P07") == 0);
	PQclear(result);

	for (i = 0; i < loops; ++i)
	{
		if (verbose)
			printf("Updating primary...\n");
		snprintf(buffer, sizeof(buffer), "UPDATE counter SET n = %d", i);
		result = PQexec(primary, buffer);
		assert(PQresultStatus(result) == PGRES_COMMAND_OK);
		PQclear(result);

		if (check_applied)
		{
			if (verbose)
				printf("Checking standby...\n");
			snprintf(buffer, sizeof(buffer), "SELECT n FROM counter");
			result = PQexec(standby, buffer);
			assert(PQresultStatus(result) == PGRES_TUPLES_OK);
			assert(PQntuples(result) == 1);
			assert(atoi(PQgetvalue(result, 0, 0)) == i);
			PQclear(result);
		}
	}
	exit(0);
}

Attachment: causal-reads-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to