> On 31 Jul 2017, at 20:03, Robert Haas <robertmh...@gmail.com> wrote:
> 
> Regardless of whether we share XIDs or DXIDs, we need a more complex
> concept of transaction state than we have now.

Seems that discussion shifted from 2PC itself to the general issues with 
distributed
transactions. So it is probably appropriate to share here resume of things that 
we
done in area of distributed visibility. During last two years we tried three 
quite different
approaches and finally settled with Clock-SI. 

At first, to test different approaches we did small patch that wrap calls to 
visibility-related
functions (SetTransactionStatus, GetSnapshot, etc. Described in detail at 
wiki[1] ) in order to
allow overload them from extension. Such approach allows to implement almost 
anything
related to distributed visibility since you have full control about how local 
visibility is done.
That API isn’t hard prerequisite, and if one wants to create some concrete 
implementation
it can be done just in place. However, I think it is good to have such API in 
some form.

So three approaches that we tried:

1) Postgres-XL-like:

That is most straightforward way. Basically we need separate network service 
(GTM/DTM) that is
responsible for xid generation, and managing running-list of transactions. So 
acquiring
xid and snapshot is done by network calls. Because of shared xid space it is 
possible
to compare them in ordinary way and get right order. Gap between 
non-simultaneous
commits by 2pc is covered by the fact that we getting our snapshots from GTM, 
and
it will remove xid from running list only when transaction committed on both 
nodes.

Such approach is okay for OLAP-style transactions where tps isn’t high. But 
OLTP with
high transaction rate GTM will immediately became a bottleneck since even write 
transactions
need to get snapshot from GTM. Even if they access only one node.


2) Incremental SI [2]

Approach with central coordinator, that can allow local reads without network
communications by slightly altering visibility rules.

Despite the fact that it is kind of patented, we also failed to achieve proper 
visibility
by implementing algorithms from that paper. It always showed some 
inconsistencies.
May be because of bugs in our implementation, may be because of some
typos/mistakes in algorithm description itself. Reasoning in paper wasn’t very
clear for us, as well as patent issues, so we just leaved that.


3) Clock-SI [3]

It is MS research paper, that describes algorithm similar to ones used in 
Spanner and
CockroachDB, without central GTM and with reads that do not require network 
roundtrip.

There are two ideas behind it:

* Assuming snapshot isolation and visibility on node are based on CSN, use 
local time as CSN,
then when you are doing 2PC, collect prepare time from all participating nodes 
and
commit transaction everywhere with maximum of that times. If node during read 
faces tuples
committed by tx with CSN greater then their snapshot CSN (that can happen due to
time desynchronisation on node) then it just waits until that time come. So 
time desynchronisation
can affect performance, but can’t affect correctness.

* During distributed commit transaction neither running (if it commits then 
tuple
should be already visible) nor committed/aborted (it still can be aborted, so 
it is illegal to read).
So here IN-DOUBT transaction state appears, when reader should wait for writers.

We managed to implement that using mentioned XTM api. XID<->CSN mapping is
accounted by extension itself. Speed/scalability are also good.

I want to resubmit implementation of that algorithm for FDW later in August, 
along with some
isolation tests based on set of queries in [4].


[1] https://wiki.postgresql.org/wiki/DTM#eXtensible_Transaction_Manager_API
[2] http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf
[3] 
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf
[4] https://github.com/ept/hermitage


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to