On 02/06/10 10:22, Greg Smith wrote:
Heikki Linnakangas wrote:
The possibilities are endless... Your proposal above covers a pretty
good set of scenarios, but it's by no means complete. If we try to
solve everything the configuration will need to be written in a
Turing-complete Replication Description Language. We'll have to pick a
useful, easy-to-understand subset that covers the common scenarios. To
handle the more exotic scenarios, you can write a proxy that sits in
front of the master, and implements whatever rules you wish, with the
rules written in C.

I was thinking about this a bit recently. As I see it, there are three
fundamental parts of this:

1) We have a transaction that is being committed. The rest of the
computations here are all relative to it.

Agreed.

So in a 3 node case, the internal state table might look like this after
a bit of data had been committed:

node | location | state
----------------------------------
a | local | fsync b | remote | recv
c | remote | async

This means that the local node has a fully persistent copy, but the best
either remote one has done is received the data, it's not on disk at all
yet at the remote data center. Still working its way through.

3) The decision about whether the data has been committed to enough
places to be considered safe by the master is computed by a function
that is passed this internal table as something like a SRF, and it
returns a boolean. Once that returns true, saying it's satisfied, the
transaction closes on the master and continues to percolate out from
there. If it's false, we wait for another state change to come in and
return to (2).

You can't implement "wait for X to ack the commit, but if that doesn't happen in Y seconds, time out and return true anyway" with that.

While exposing the local state and running this computation isn't free,
in situations where there truly are remote nodes in here being
communicated with the network overhead is going to dwarf that. If there
were a fast path for the simplest cases and this complicated one for the
rest, I think you could get the fully programmable behavior some people
want using simple SQL, rather than having to write a new "Replication
Description Language" or something so ambitious. This data about what's
been replicated to where looks an awful lot like a set of rows you can
operate on using features already in the database to me.

Yeah, if we want to provide full control over when a commit is acknowledged to the client, there's certainly no reason we can't expose that using a hook or something.

It's pretty scary to call a user-defined function at that point in transaction. Even if we document that you must refrain from doing nasty stuff like modifying tables in that function, it's still scary.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to