Heikki Linnakangas wrote:
The possibilities are endless... Your proposal above covers a pretty good set of scenarios, but it's by no means complete. If we try to solve everything the configuration will need to be written in a Turing-complete Replication Description Language. We'll have to pick a useful, easy-to-understand subset that covers the common scenarios. To handle the more exotic scenarios, you can write a proxy that sits in front of the master, and implements whatever rules you wish, with the rules written in C.

I was thinking about this a bit recently. As I see it, there are three fundamental parts of this:

1) We have a transaction that is being committed. The rest of the computations here are all relative to it.

2) There is an (internal?) table that lists the state of each replication target relative to that transaction. It would include the node name, perhaps some metadata ('location' seems the one that's most likely to help with the remote data center issue), and a state code. The codes from http://wiki.postgresql.org/wiki/Streaming_Replication work fine for the last part (which is the only dynamic one--everything else is static data being joined against):

async=hasn't received yet
recv=been received but just in RAM
fsync=received and synced to disk
apply=applied to the database

These would need to be enums so they can be ordered from lesser to greater consistency.

So in a 3 node case, the internal state table might look like this after a bit of data had been committed:

node | location | state
----------------------------------
a | local | fsync b | remote | recv
c | remote | async

This means that the local node has a fully persistent copy, but the best either remote one has done is received the data, it's not on disk at all yet at the remote data center. Still working its way through.

3) The decision about whether the data has been committed to enough places to be considered safe by the master is computed by a function that is passed this internal table as something like a SRF, and it returns a boolean. Once that returns true, saying it's satisfied, the transaction closes on the master and continues to percolate out from there. If it's false, we wait for another state change to come in and return to (2).

I would propose that most behaviors someone has expressed as being their desired implementation is possible to implement using this scheme. -Semi-sync commit: proceed as soon somebody else has a copy and hope the copies all become consistent: EXISTS WHERE state>=recv -Don't proceed until there's a fsync'd commit on at least one of the remote nodes: EXISTS WHERE location='remote' AND state>=fsync -Look for a quorum of n commits of fsync quality: CASE WHEN (SELECT COUNT(*) WHERE state>=fsync)>n THEN true else FALSE end;

Syntax is obviously rough but I think you can get the drift of what I'm suggesting.

While exposing the local state and running this computation isn't free, in situations where there truly are remote nodes in here being communicated with the network overhead is going to dwarf that. If there were a fast path for the simplest cases and this complicated one for the rest, I think you could get the fully programmable behavior some people want using simple SQL, rather than having to write a new "Replication Description Language" or something so ambitious. This data about what's been replicated to where looks an awful lot like a set of rows you can operate on using features already in the database to me.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to