Re: Global snapshots

2021-11-19 Thread Andrey V. Lepikhov

Patch in the previous letter is full of faulties. Please, use new version.
Also, here we fixed the problem with loosing CSN value in a parallel 
worker (TAP test 003_parallel_safe.pl). Thanks for a.pyhalov for the 
problem detection and a bugfix.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 7aa57724fc42b8ca7054f9b6edfa33c0cffb24bf Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 17 Nov 2021 11:13:37 +0500
Subject: [PATCH] Add Commit Sequence Number (CSN) machinery into MVCC
 implementation for a timestamp-based resolving of visibility conflicts.

It allows to achieve proper snapshot isolation semantics in the case
of distributed transactions involving more than one Postgres instance.

Authors: K.Knizhnik, S.Kelvich, A.Sher, A.Lepikhov, M.Usama.

Discussion:
(2020/05/21 -)
https://www.postgresql.org/message-id/flat/CA%2Bfd4k6HE8xLGEvqWzABEg8kkju5MxU%2Bif7bf-md0_2pjzXp9Q%40mail.gmail.com#ed1359340871688bed2e643921f73365
(2018/05/01 - 2019/04/21)
https://www.postgresql.org/message-id/flat/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
---
 doc/src/sgml/config.sgml  |  50 +-
 src/backend/access/rmgrdesc/Makefile  |   1 +
 src/backend/access/rmgrdesc/csnlogdesc.c  |  95 +++
 src/backend/access/rmgrdesc/xlogdesc.c|   6 +-
 src/backend/access/transam/Makefile   |   2 +
 src/backend/access/transam/csn_log.c  | 748 ++
 src/backend/access/transam/csn_snapshot.c | 687 
 src/backend/access/transam/rmgr.c |   1 +
 src/backend/access/transam/twophase.c | 154 
 src/backend/access/transam/varsup.c   |   2 +
 src/backend/access/transam/xact.c |  32 +
 src/backend/access/transam/xlog.c |  23 +-
 src/backend/access/transam/xloginsert.c   |   2 +
 src/backend/commands/vacuum.c |   3 +-
 src/backend/replication/logical/snapbuild.c   |   4 +
 src/backend/storage/ipc/ipci.c|   6 +
 src/backend/storage/ipc/procarray.c   |  85 ++
 src/backend/storage/lmgr/lwlock.c |   2 +
 src/backend/storage/lmgr/lwlocknames.txt  |   2 +
 src/backend/storage/lmgr/proc.c   |   6 +
 src/backend/storage/sync/sync.c   |   5 +
 src/backend/utils/misc/guc.c  |  37 +
 src/backend/utils/probes.d|   2 +
 src/backend/utils/time/snapmgr.c  | 183 -
 src/bin/initdb/initdb.c   |   3 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +
 src/bin/pg_upgrade/pg_upgrade.c   |   5 +
 src/bin/pg_upgrade/pg_upgrade.h   |   2 +
 src/bin/pg_waldump/rmgrdesc.c |   1 +
 src/include/access/csn_log.h  |  98 +++
 src/include/access/csn_snapshot.h |  54 ++
 src/include/access/rmgrlist.h |   1 +
 src/include/access/xlog_internal.h|   2 +
 src/include/catalog/pg_control.h  |   1 +
 src/include/catalog/pg_proc.dat   |  17 +
 src/include/datatype/timestamp.h  |   3 +
 src/include/fmgr.h|   1 +
 src/include/portability/instr_time.h  |  10 +
 src/include/storage/lwlock.h  |   1 +
 src/include/storage/proc.h|  14 +
 src/include/storage/procarray.h   |   7 +
 src/include/storage/sync.h|   1 +
 src/include/utils/snapmgr.h   |   7 +-
 src/include/utils/snapshot.h  |  11 +
 src/test/modules/Makefile |   1 +
 src/test/modules/csnsnapshot/Makefile |  22 +
 .../csnsnapshot/expected/csnsnapshot.out  |   1 +
 src/test/modules/csnsnapshot/t/001_base.pl| 100 +++
 src/test/modules/csnsnapshot/t/002_standby.pl |  68 ++
 .../csnsnapshot/t/003_parallel_safe.pl|  67 ++
 src/test/modules/snapshot_too_old/sto.conf|   1 +
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  28 +
 src/test/regress/expected/sysviews.out|   4 +-
 53 files changed, 2660 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/csnlogdesc.c
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/backend/access/transam/csn_snapshot.c
 create mode 100644 src/include/access/csn_log.h
 create mode 100644 src/include/access/csn_snapshot.h
 create mode 100644 src/test/modules/csnsnapshot/Makefile
 create mode 100644 src/test/modules/csnsnapshot/expected/csnsnapshot.out
 create mode 100644 src/test/modules/csnsnapshot/t/001_base.pl
 create mode 100644 src/test/modules/csnsnapshot/t/002_standby.pl
 create mode 100644 src/test/modules/csnsnapshot/t/003_parallel_safe.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5..f4f6c83fd0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9682,8 +9682,56 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
  
 
  

Re: Global snapshots

2021-11-17 Thread Andrey V. Lepikhov
Next version of CSN implementation in snapshots to achieve a proper 
snapshot isolation in the case of a cross-instance distributed transaction.


--
regards,
Andrey Lepikhov
Postgres Professional
>From bbb7dd1d7621c091f11e697d3d894fe7a36918a6 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Wed, 17 Nov 2021 11:13:37 +0500
Subject: [PATCH] Add Commit Sequence Number (CSN) machinery into MVCC
 implementation for a timestamp-based resolving of visibility conflicts.

It allows to achieve proper snapshot isolation semantics in the case
of distributed transactions involving more than one Postgres instance.

Authors: K.Knizhnik, S.Kelvich, A.Sher, A.Lepikhov, M.Usama.

Discussion:
(2020/05/21 -)
https://www.postgresql.org/message-id/flat/CA%2Bfd4k6HE8xLGEvqWzABEg8kkju5MxU%2Bif7bf-md0_2pjzXp9Q%40mail.gmail.com#ed1359340871688bed2e643921f73365
(2018/05/01 - 2019/04/21)
https://www.postgresql.org/message-id/flat/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
---
 doc/src/sgml/config.sgml  |  50 +-
 src/backend/access/rmgrdesc/Makefile  |   1 +
 src/backend/access/rmgrdesc/csnlogdesc.c  |  95 +++
 src/backend/access/rmgrdesc/xlogdesc.c|   6 +-
 src/backend/access/transam/Makefile   |   2 +
 src/backend/access/transam/csn_log.c  | 748 ++
 src/backend/access/transam/csn_snapshot.c | 687 
 src/backend/access/transam/rmgr.c |   1 +
 src/backend/access/transam/twophase.c | 154 
 src/backend/access/transam/varsup.c   |   2 +
 src/backend/access/transam/xact.c |  32 +
 src/backend/access/transam/xlog.c |  23 +-
 src/backend/access/transam/xloginsert.c   |   2 +
 src/backend/commands/vacuum.c |   3 +-
 src/backend/storage/ipc/ipci.c|   6 +
 src/backend/storage/ipc/procarray.c   |  85 ++
 src/backend/storage/lmgr/lwlock.c |   2 +
 src/backend/storage/lmgr/lwlocknames.txt  |   2 +
 src/backend/storage/lmgr/proc.c   |   6 +
 src/backend/storage/sync/sync.c   |   5 +
 src/backend/utils/misc/guc.c  |  37 +
 src/backend/utils/probes.d|   2 +
 src/backend/utils/time/snapmgr.c  | 149 +++-
 src/bin/initdb/initdb.c   |   3 +-
 src/bin/pg_controldata/pg_controldata.c   |   2 +
 src/bin/pg_upgrade/pg_upgrade.c   |   5 +
 src/bin/pg_upgrade/pg_upgrade.h   |   2 +
 src/bin/pg_waldump/rmgrdesc.c |   1 +
 src/include/access/csn_log.h  |  98 +++
 src/include/access/csn_snapshot.h |  54 ++
 src/include/access/rmgrlist.h |   1 +
 src/include/access/xlog_internal.h|   2 +
 src/include/catalog/pg_control.h  |   1 +
 src/include/catalog/pg_proc.dat   |  17 +
 src/include/datatype/timestamp.h  |   3 +
 src/include/fmgr.h|   1 +
 src/include/portability/instr_time.h  |  10 +
 src/include/storage/lwlock.h  |   1 +
 src/include/storage/proc.h|  14 +
 src/include/storage/procarray.h   |   7 +
 src/include/storage/sync.h|   1 +
 src/include/utils/snapmgr.h   |   7 +-
 src/include/utils/snapshot.h  |  11 +
 src/test/modules/Makefile |   1 +
 src/test/modules/csnsnapshot/Makefile |  25 +
 .../modules/csnsnapshot/csn_snapshot.conf |   1 +
 .../csnsnapshot/expected/csnsnapshot.out  |   1 +
 src/test/modules/csnsnapshot/t/001_base.pl| 103 +++
 src/test/modules/csnsnapshot/t/002_standby.pl |  66 ++
 .../modules/csnsnapshot/t/003_time_skew.pl| 214 +
 .../csnsnapshot/t/004_read_committed.pl   |  97 +++
 .../csnsnapshot/t/005_basic_visibility.pl | 181 +
 src/test/modules/snapshot_too_old/sto.conf|   1 +
 src/test/regress/expected/sysviews.out|   4 +-
 54 files changed, 3024 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/csnlogdesc.c
 create mode 100644 src/backend/access/transam/csn_log.c
 create mode 100644 src/backend/access/transam/csn_snapshot.c
 create mode 100644 src/include/access/csn_log.h
 create mode 100644 src/include/access/csn_snapshot.h
 create mode 100644 src/test/modules/csnsnapshot/Makefile
 create mode 100644 src/test/modules/csnsnapshot/csn_snapshot.conf
 create mode 100644 src/test/modules/csnsnapshot/expected/csnsnapshot.out
 create mode 100644 src/test/modules/csnsnapshot/t/001_base.pl
 create mode 100644 src/test/modules/csnsnapshot/t/002_standby.pl
 create mode 100644 src/test/modules/csnsnapshot/t/003_time_skew.pl
 create mode 100644 src/test/modules/csnsnapshot/t/004_read_committed.pl
 create mode 100644 src/test/modules/csnsnapshot/t/005_basic_visibility.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3f806740d5..f4f6c83fd0 100644
-

RE: Global snapshots

2021-03-24 Thread tsunakawa.ta...@fujitsu.com
From: Andrey V. Lepikhov 
> Current state of the patch set rebased on master, 5aed6a1fc2.
> 
> It is development version. Here some problems with visibility still detected 
> in
> two tests:
> 1. CSN Snapshot module - TAP test on time skew.
> 2. Clock SI implementation - TAP test on emulation of bank transaction.

I'm sorry to be late to respond.  Thank you for the update.

As discussed at the HighGo meeting, what do you think we should do about this 
patch set, now that we agreed that Clock-SI is covered by Microsoft's patent?  
I'd appreciate it if you could share some idea to change part of the algorithm 
and circumvent the patent.

Otherwise, why don't we discuss alternatives, such as the Commitment Ordering?

I have a hunch that YugabyteDB's method seems promising, which I wrote in the 
following wiki.  Of course, we should make efforts to see if it's patented 
before diving deeper into the design or implementation.

Scaleout Design - PostgreSQL wiki
https://wiki.postgresql.org/wiki/Scaleout_Design


Regards
Takayuki Tsunakawa




RE: Global snapshots

2021-02-25 Thread tsunakawa.ta...@fujitsu.com
From: Andrey V. Lepikhov 
> After researching in depth, I think this is the real problem.
> My idea was that we are not using real clocks, we only use clock ticks to
> measure time intervals. It can also be interpreted as a kind of clock.

Yes, patent claims tend to be written to cover broad interpretation.  That's 
too sad.


> That we can do:
> 1. Use global clocks at the start of transaction.
> 2. Use CSN-based snapshot as a machinery and create an extension to allow
> user defined commit protocols.

Is this your suggestion to circumvent the patent?  Sorry, I'm afraid I can't 
understand it yet (I have to study more.)  I hope others will comment on this.


Regards
Takayuki Tsunakawa




Re: Global snapshots

2021-02-25 Thread Andrey V. Lepikhov

On 1/1/21 8:14 AM, tsunakawa.ta...@fujitsu.com wrote:

--
11. A method comprising:
receiving information relating to a distributed database transaction operating 
on data in data stores associated with respective participating nodes 
associated with the distributed database transaction;
requesting commit time votes from the respective participating nodes, the 
commit time votes reflecting local clock values of the respective participating 
nodes;
receiving the commit time votes from the respective participating nodes in 
response to the requesting;
computing a global commit timestamp for the distributed database transaction 
based at least in part on the commit time votes, the global commit timestamp 
reflecting a maximum value of the commit time votes received from the 
respective participating nodes; and
synchronizing commitment of the distributed database transaction at the 
respective participating nodes to the global commit timestamp,
wherein at least the computing is performed by a computing device.


Thank you for this analysis of the patent.
After researching in depth, I think this is the real problem.
My idea was that we are not using real clocks, we only use clock ticks 
to measure time intervals. It can also be interpreted as a kind of clock.


That we can do:
1. Use global clocks at the start of transaction.
2. Use CSN-based snapshot as a machinery and create an extension to 
allow user defined commit protocols.


--
regards,
Andrey Lepikhov
Postgres Professional




RE: Global snapshots

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
Hello, Andrey-san, all,


Based on the request at HighGo's sharding meeting, I'm re-sending the 
information on Commitment Ordering that could be used for global visibility.  
Their patents have already expired.



--
Have anyone examined the following Multiversion Commitment Ordering (MVCO)?  
Although I haven't understood this yet, it insists that no concurrency control 
information including timestamps needs to be exchanged among the cluster nodes. 
 I'd appreciate it if someone could give an opinion.



Commitment Ordering Based Distributed Concurrency Control for Bridging Single 
and Multi Version Resources.
 Proceedings of the Third IEEE International Workshop on Research Issues on 
Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, 
Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
https://ieeexplore.ieee.org/document/281924?arnumber=281924



The author of the above paper, Yoav Raz, seems to have had strong passion at 
least until 2011 about making people believe the mightiness of Commitment 
Ordering (CO) for global serializability.  However, he complains (sadly) that 
almost all researchers ignore his theory, as written in his following  site and 
wikipedia page for Commitment Ordering.  Does anyone know why CO is ignored?


--
* Or, maybe we can use the following Commitment ordering that doesn't require 
the timestamp or any other information to be transferred among the cluster 
nodes.  However, this seems to have to track the order of read and write 
operations among concurrent transactions to ensure the correct commit order, so 
I'm not sure about the performance.  The MVCO paper seems to present the 
information we need, but I haven't understood it well yet (it's difficult.)  
Could you anybody kindly interpret this?



Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co



--
Could you please try interpreting MVCO and see if we have any hope in this?  
This doesn't fit in my small brain.  I'll catch up with understanding this when 
I have time.



MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf



MVCO is a multiversion member of Commitment Ordering algorithms described below:



Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co



Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering



Related patents are as follows.  The last one is MVCO.



US5504900A - Commitment ordering for guaranteeing serializability across 
distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900



US5504899A - Guaranteeing global serializability by applying commitment 
ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899



US5701480A - Distributed multi-version commitment ordering protocols for 
guaranteeing serializability during transaction processing
https://patents.google.com/patent/US5701480A/en?oq=US5701480


Regards
Takayuki Tsunakawa



Re: Global snapshots

2021-01-08 Thread Fujii Masao




On 2021/01/01 12:14, tsunakawa.ta...@fujitsu.com wrote:

Hello,


Fujii-san and I discussed how to move the scale-out development forward.  We 
are both worried that Clock-SI is (highly?) likely to infringe the said 
Microsoft's patent.  So we agreed we are going to investigate the Clock-SI and 
the patent, and if we have to conclude that we cannot embrace Clock-SI, we will 
explore other possibilities.


Yes.




IMO, it seems that Clock-SI overlaps with the patent and we can't use it.  First, looking 
back how to interpret the patent document, patent "claims" are what we should 
pay our greatest attention.  According to the following citation from the IP guide by 
Software Freedom Law Center (SFLC) [1], software infringes a patent if it implements 
everything of any claim, not all claims.


--
4.2 Patent Infringement
To prove that you5 infringe a patent, the patent holder must show that you 
make, use, offer to sell, or sell the invention as it is defined in at least 
one claim of the patent.

For software to infringe a patent, the software essentially must implement 
everything recited in one of the patent�fs claims. It is crucial to recognize 
that infringement is based directly on the claims of the patent, and not on 
what is stated or described in other parts of the patent document.
--


And, Clock-SI implements at least claims 11 and 20 cited below.  It doesn't 
matter whether Clock-SI uses a physical clock or logical one.


Thanks for sharing the result of your investigation!

Regarding at least claim 11, I reached the same conclusion. As far as
I understand correctly, Clock-SI actually does the method described
at the claim 11 when determing the commit time and doing the commit
on each node.

I don't intend to offend Clock-SI and any activities based on that. OTOH,
I'm now wondering if it's worth considering another approach for global
transaction support, while I'm still interested in Clock-SI technically.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Global snapshots

2020-12-31 Thread tsunakawa.ta...@fujitsu.com
Hello,


Fujii-san and I discussed how to move the scale-out development forward.  We 
are both worried that Clock-SI is (highly?) likely to infringe the said 
Microsoft's patent.  So we agreed we are going to investigate the Clock-SI and 
the patent, and if we have to conclude that we cannot embrace Clock-SI, we will 
explore other possibilities.

IMO, it seems that Clock-SI overlaps with the patent and we can't use it.  
First, looking back how to interpret the patent document, patent "claims" are 
what we should pay our greatest attention.  According to the following citation 
from the IP guide by Software Freedom Law Center (SFLC) [1], software infringes 
a patent if it implements everything of any claim, not all claims.


--
4.2 Patent Infringement
To prove that you5 infringe a patent, the patent holder must show that you 
make, use, offer to sell, or sell the invention as it is defined in at least 
one claim of the patent.

For software to infringe a patent, the software essentially must implement 
everything recited in one of the patent�fs claims. It is crucial to recognize 
that infringement is based directly on the claims of the patent, and not on 
what is stated or described in other parts of the patent document. 
--


And, Clock-SI implements at least claims 11 and 20 cited below.  It doesn't 
matter whether Clock-SI uses a physical clock or logical one.


--
11. A method comprising:
receiving information relating to a distributed database transaction operating 
on data in data stores associated with respective participating nodes 
associated with the distributed database transaction;
requesting commit time votes from the respective participating nodes, the 
commit time votes reflecting local clock values of the respective participating 
nodes;
receiving the commit time votes from the respective participating nodes in 
response to the requesting;
computing a global commit timestamp for the distributed database transaction 
based at least in part on the commit time votes, the global commit timestamp 
reflecting a maximum value of the commit time votes received from the 
respective participating nodes; and
synchronizing commitment of the distributed database transaction at the 
respective participating nodes to the global commit timestamp,
wherein at least the computing is performed by a computing device.

20. A method for managing a distributed database transaction, the method 
comprising:
receiving information relating to the distributed database transaction from a 
transaction coordinator associated with the distributed database transaction;
determining a commit time vote for the distributed database transaction based 
at least in part on a local clock;
communicating the commit time vote for the distributed database transaction to 
the transaction coordinator;
receiving a global commit timestamp from the transaction coordinator;
synchronizing commitment of the distributed database transaction to the global 
commit timestamp;
receiving a remote request from a requesting database node corresponding to the 
distributed database transaction;
creating a local transaction corresponding to the distributed database 
transaction;
compiling a list of database nodes involved in generating a result of the local 
transaction and access types utilized by respective database nodes in the list 
of database nodes; and
returning the list of database nodes and the access types to the requesting 
database node in response to the remote request,
wherein at least the compiling is performed by a computing device.
--


My question is that the above claims appear to cover somewhat broad range.  I 
wonder if other patents or unpatented technologies overlap with this kind of 
description.

Thoughts?


[1]
A Legal Issues Primer for Open Source and Free Software Projects
https://www.softwarefreedom.org/resources/2008/foss-primer.pdf

[2]
US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


Regards
Takayuki Tsunakawa



RE: Global snapshots

2020-10-28 Thread tsunakawa.ta...@fujitsu.com
Fujii-san, Sawada-san, all,

From: Fujii Masao 
> Yeah, so if we need to guarantee the transaction linearizability even
> in distributed env (probably this is yes. Right?), using only Clock-SI
> is not enough. We would need to implement something more
> in addition to Clock-SI or adopt the different approach other than Clock-SI.
> Thought?

Could you please try interpreting MVCO and see if we have any hope in this?  
This doesn't fit in my small brain.  I'll catch up with understanding this when 
I have time.

MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf


MVCO is a multiversion member of Commitment Ordering algorithms described below:

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co

Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering


Related patents are as follows.  The last one is MVCO.

US5504900A - Commitment ordering for guaranteeing serializability across 
distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900

US5504899A - Guaranteeing global serializability by applying commitment 
ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899

US5701480A - Distributed multi-version commitment ordering protocols for 
guaranteeing serializability during transaction processing
https://patents.google.com/patent/US5701480A/en?oq=US5701480


Regards
Takayuki Tsunakawa




Re: Global snapshots

2020-10-28 Thread Fujii Masao




On 2020/10/23 11:58, Masahiko Sawada wrote:

On Thu, 15 Oct 2020 at 01:41, Fujii Masao  wrote:




On 2020/09/17 15:56, Amit Kapila wrote:

On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao  wrote:



One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".


Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
people said it's not good...


But it may be worth revisiting this idea if we cannot avoid the patent issue.



It is not very clear what exactly we can do about the point raised by
Tsunakawa-San related to patent in this technology as I haven't seen
that discussed during other development but maybe we can try to study
a bit. One more thing I would like to bring here is that it seems to
be there have been some concerns about this idea when originally
discussed [1]. It is not very clear to me if all the concerns are
addressed or not. If one can summarize the concerns discussed and how
the latest patch is able to address those then it will be great.


I have one concern about Clock-SI (sorry if this concern was already
discussed in the past). As far as I read the paper about Clock-SI, ISTM that
Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
due to the clock skew. Please see the following example;

1. Tx1 starts at the server A.

2. Tx1 writes some records at the server A.

3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
   the commit at the server A.
   This means that Tx1 is the local transaction, not distributed one.

4. Tx2 starts at the server B, i.e., the server B works as
   the coordinator node for Tx2.

5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
   due to clock skew) and uses 10 as SnapshotTime at the server B.

6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.

7. Tx2 doesn't need to wait due to clock skew because the imported
   SnapshotTime 10 is smaller than the local clock at the server A.

8. Tx2 fails to see the records written by Tx1 at the server A because
   Tx1's CommitTime 20 is larger than SnapshotTime 10.

So Tx1 was successfully committed before Tx2 starts. But, at the above example,
the subsequent transaction Tx2 fails to see the committed results.

The single PostgreSQL instance seems to guarantee that linearizability of
the transactions, but Clock-SI doesn't in the distributed env. Is this my
understanding right? Or am I missing something?

If my understanding is right, shouldn't we address that issue when using
Clock-SI? Or the patch has already addressed the issue?


As far as I read the paper, the above scenario can happen. I could
reproduce the above scenario with the patch. Moreover, a stale read
could happen even if Tx1 was initiated at server B (i.g., both
transactions started at the same server in sequence). In this case,
Tx1's commit timestamp would be 20 taken from server A's local clock
whereas Tx2's snapshot timestamp would be 10 same as the above case.
Therefore, in spite of both transactions were initiated at the same
server the linearizability is not provided.


Yeah, so if we need to guarantee the transaction linearizability even
in distributed env (probably this is yes. Right?), using only Clock-SI
is not enough. We would need to implement something more
in addition to Clock-SI or adopt the different approach other than Clock-SI.
Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Global snapshots

2020-10-22 Thread Masahiko Sawada
On Thu, 15 Oct 2020 at 01:41, Fujii Masao  wrote:
>
>
>
> On 2020/09/17 15:56, Amit Kapila wrote:
> > On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao  
> > wrote:
> >>
>  One alternative is to add only hooks into PostgreSQL core so that we can
>  implement the global transaction management outside. This idea was
>  discussed before as the title "eXtensible Transaction Manager API".
> >>>
> >>> Yeah, I read that discussion.  And I remember Robert Haas and Postgres 
> >>> Pro people said it's not good...
> >>
> >> But it may be worth revisiting this idea if we cannot avoid the patent 
> >> issue.
> >>
> >
> > It is not very clear what exactly we can do about the point raised by
> > Tsunakawa-San related to patent in this technology as I haven't seen
> > that discussed during other development but maybe we can try to study
> > a bit. One more thing I would like to bring here is that it seems to
> > be there have been some concerns about this idea when originally
> > discussed [1]. It is not very clear to me if all the concerns are
> > addressed or not. If one can summarize the concerns discussed and how
> > the latest patch is able to address those then it will be great.
>
> I have one concern about Clock-SI (sorry if this concern was already
> discussed in the past). As far as I read the paper about Clock-SI, ISTM that
> Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
> due to the clock skew. Please see the following example;
>
> 1. Tx1 starts at the server A.
>
> 2. Tx1 writes some records at the server A.
>
> 3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
>   the commit at the server A.
>   This means that Tx1 is the local transaction, not distributed one.
>
> 4. Tx2 starts at the server B, i.e., the server B works as
>   the coordinator node for Tx2.
>
> 5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
>   due to clock skew) and uses 10 as SnapshotTime at the server B.
>
> 6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.
>
> 7. Tx2 doesn't need to wait due to clock skew because the imported
>   SnapshotTime 10 is smaller than the local clock at the server A.
>
> 8. Tx2 fails to see the records written by Tx1 at the server A because
>   Tx1's CommitTime 20 is larger than SnapshotTime 10.
>
> So Tx1 was successfully committed before Tx2 starts. But, at the above 
> example,
> the subsequent transaction Tx2 fails to see the committed results.
>
> The single PostgreSQL instance seems to guarantee that linearizability of
> the transactions, but Clock-SI doesn't in the distributed env. Is this my
> understanding right? Or am I missing something?
>
> If my understanding is right, shouldn't we address that issue when using
> Clock-SI? Or the patch has already addressed the issue?

As far as I read the paper, the above scenario can happen. I could
reproduce the above scenario with the patch. Moreover, a stale read
could happen even if Tx1 was initiated at server B (i.g., both
transactions started at the same server in sequence). In this case,
Tx1's commit timestamp would be 20 taken from server A's local clock
whereas Tx2's snapshot timestamp would be 10 same as the above case.
Therefore, in spite of both transactions were initiated at the same
server the linearizability is not provided.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global snapshots

2020-10-14 Thread Fujii Masao




On 2020/09/17 15:56, Amit Kapila wrote:

On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao  wrote:



One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".


Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
people said it's not good...


But it may be worth revisiting this idea if we cannot avoid the patent issue.



It is not very clear what exactly we can do about the point raised by
Tsunakawa-San related to patent in this technology as I haven't seen
that discussed during other development but maybe we can try to study
a bit. One more thing I would like to bring here is that it seems to
be there have been some concerns about this idea when originally
discussed [1]. It is not very clear to me if all the concerns are
addressed or not. If one can summarize the concerns discussed and how
the latest patch is able to address those then it will be great.


I have one concern about Clock-SI (sorry if this concern was already
discussed in the past). As far as I read the paper about Clock-SI, ISTM that
Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
due to the clock skew. Please see the following example;

1. Tx1 starts at the server A.

2. Tx1 writes some records at the server A.

3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
 the commit at the server A.
 This means that Tx1 is the local transaction, not distributed one.

4. Tx2 starts at the server B, i.e., the server B works as
 the coordinator node for Tx2.

5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
 due to clock skew) and uses 10 as SnapshotTime at the server B.

6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.

7. Tx2 doesn't need to wait due to clock skew because the imported
 SnapshotTime 10 is smaller than the local clock at the server A.

8. Tx2 fails to see the records written by Tx1 at the server A because
 Tx1's CommitTime 20 is larger than SnapshotTime 10.

So Tx1 was successfully committed before Tx2 starts. But, at the above example,
the subsequent transaction Tx2 fails to see the committed results.

The single PostgreSQL instance seems to guarantee that linearizability of
the transactions, but Clock-SI doesn't in the distributed env. Is this my
understanding right? Or am I missing something?

If my understanding is right, shouldn't we address that issue when using
Clock-SI? Or the patch has already addressed the issue?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Global snapshots

2020-09-27 Thread tsunakawa.ta...@fujitsu.com
Hi Andrey san, all,

From: tsunakawa.ta...@fujitsu.com 
> And please wait.  As below, the patent holder just says that Clock-SI is not
> based on the patent and an independent development.  He doesn't say
> Clock-SI does not overlap with the patent or implementing Clock-SI does not
> infringe on the patent.  Rather, he suggests that Clock-SI has many
> similarities and thus those may match the claims of the patent
> (unintentionally?)  I felt this is a sign of risking infringement.
> 
> "The answer to your question is: No, Clock-SI is not based on the patent - it
> was an entirely independent development. The two approaches are similar in
> the sense that there is no global clock, the commit time of a distributed
> transaction is the same in every partition where it modified data, and a
> transaction gets it snapshot timestamp from a local clock. The difference is
> whether a distributed transaction gets its commit timestamp before or after 
> the
> prepare phase in 2PC."
> 
> The timeline of events also worries me.  It seems unnatural to consider that
> Clock-SI and the patent are independent.
> 
> 2010/6 - 2010/9  One Clock-SI author worked for Microsoft Research as
> an research intern
> 2010/10  Microsoft filed the patent
> 2011/9 - 2011/12  The same Clock-SI author worked for Microsoft
> Research as an research intern
> 2013  The same author moved to EPFL and published the Clock-SI paper
> with another author who has worked for Microsoft Research since then.
> 
> So, could you give your opinion whether we can use Clock-SI without
> overlapping with the patent claims?  I also will try to check and see, so 
> that I
> can understand your technical analysis.
> 
> And I've just noticed that I got in touch with another author of Clock-SI via 
> SNS,
> and sent an inquiry to him.  I'll report again when I have a reply.

I got a reply from the main author of the Clock-SI paper:

[Reply from the Clock-SI author Jiaqing Du]
--
Thanks for reaching out.

I actually did not know that Microsoft wrote a patent which is  similar to the 
ideas in my paper. I worked there as an intern. My Clock-SI paper was done at 
my school (EPFL) after my internships at Microsoft. The paper was very loosely 
related to my internship project at Microsoft. In a sense, the internship 
project at Microsoft inspired me to work on Clock-SI after I finished the 
internship. As you see in the paper, my coauthor, who is my internship host, is 
also from Microsoft, but interestingly he is not on the patent :)

Cheers,
Jiaqing
--


Unfortunately, he also did not assert that Clock-SI does not infringe on the 
patent.  Rather, worrying words are mixed: "similar to my ideas", "loosely 
related", "inspired".

Also, his internship host is the co-author of the Clock-SI paper.  That person 
should be Sameh Elnikety, who has been working for Microsoft Research.  I also 
asked him about the same question, but he has been silent for about 10 days.

When I had a quick look, the patent appeared to be broader than Clock-SI, and 
Clock-SI is a concrete application of the patent.  This is just my guess, but 
Sameh Elnikety had known the patent and set an internship theme at Microsoft or 
the research subject at EPFL based on it, whether he was aware or not.

As of now, it seems that the Clock-SI needs to be evaluated against the patent 
claims by two or more persons -- one from someone who knows Clock-SI well and 
implemented it for Postgres (Andrey-san?), and someone else who shares little 
benefit with the former person and can see it objectively.


Regards
Takayuki Tsunakawa



RE: Global snapshots

2020-09-22 Thread tsunakawa.ta...@fujitsu.com
From: Andrey Lepikhov 
> Thank you for this work!
> As I can see, main development difficulties placed in other areas: CSN, 
> resolver,
> global deadlocks, 2PC commit... I'm not lawyer too. But if we get remarks from
> the patent holders, we can rewrite our Clock-SI implementation.

Yeah, I understand your feeling.  I personally don't want like patents, and 
don't want to be disturbed by them.  But the world is not friendly...  We are 
not a lawyer, but we have to do our best to make sure PostgreSQL will be 
patent-free by checking the technologies as engineers.

Among the above items, CSN is the only concerning one.  Other items are written 
in textbooks, well-known, and used in other DBMSs, so they should be free from 
patents.  However, CSN is not (at least to me.)  Have you checked if CSN is not 
related to some patent?  Or is CSN or similar technology already widely used in 
famous software and we can regard it as patent-free?

And please wait.  As below, the patent holder just says that Clock-SI is not 
based on the patent and an independent development.  He doesn't say Clock-SI 
does not overlap with the patent or implementing Clock-SI does not infringe on 
the patent.  Rather, he suggests that Clock-SI has many similarities and thus 
those may match the claims of the patent (unintentionally?)  I felt this is a 
sign of risking infringement.

"The answer to your question is: No, Clock-SI is not based on the patent - it 
was an entirely independent development. The two approaches are similar in the 
sense that there is no global clock, the commit time of a distributed 
transaction is the same in every partition where it modified data, and a 
transaction gets it snapshot timestamp from a local clock. The difference is 
whether a distributed transaction gets its commit timestamp before or after the 
prepare phase in 2PC."

The timeline of events also worries me.  It seems unnatural to consider that 
Clock-SI and the patent are independent.

2010/6 - 2010/9  One Clock-SI author worked for Microsoft Research as an 
research intern
2010/10  Microsoft filed the patent
2011/9 - 2011/12  The same Clock-SI author worked for Microsoft Research as 
an research intern
2013  The same author moved to EPFL and published the Clock-SI paper with 
another author who has worked for Microsoft Research since then.

So, could you give your opinion whether we can use Clock-SI without overlapping 
with the patent claims?  I also will try to check and see, so that I can 
understand your technical analysis.

And I've just noticed that I got in touch with another author of Clock-SI via 
SNS, and sent an inquiry to him.  I'll report again when I have a reply.


Regards
Takayuki Tsunakawa




Re: Global snapshots

2020-09-22 Thread Andrey Lepikhov




22.09.2020 03:47, tsunakawa.ta...@fujitsu.com пишет:

Does this make sense from your viewpoint, and can we think that we can use 
Clock-SI without infrindging on the patent?  According to the patent holder, 
the difference between Clock-SI and the patent seems to be fewer than the 
similarities.

Thank you for this work!
As I can see, main development difficulties placed in other areas: CSN, 
resolver, global deadlocks, 2PC commit... I'm not lawyer too. But if we 
get remarks from the patent holders, we can rewrite our Clock-SI 
implementation.


--
regards,
Andrey Lepikhov
Postgres Professional




RE: Global snapshots

2020-09-21 Thread tsunakawa.ta...@fujitsu.com
Hi Andrey-san, all,

From: Andrey V. Lepikhov 
> On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:
> > Could you take a look at this patent?  I'm afraid this is the Clock-SI for 
> > MVCC.
> Microsoft holds this until 2031.  I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?

> I haven't seen this patent before. This should be carefully studied.


I contacted 6 people individually, 3 holders of the patent and different 3 
authors of the Clock-SI paper.  I got replies from two people.  (It's a regret 
I couldn't get a reply from the main author of Clock-SI paper.)

[Reply from the patent holder Per-Ake Larson]
--
Thanks for your interest in my patent. 

The answer to your question is: No, Clock-SI is not based on the patent - it 
was an entirely independent development. The two approaches are similar in the 
sense that there is no global clock, the commit time of a distributed 
transaction is the same in every partition where it modified data, and a 
transaction gets it snapshot timestamp from a local clock. The difference is 
whether a distributed transaction gets its commit timestamp before or after the 
prepare phase in 2PC.

Hope this helpful.

Best regards,
Per-Ake
--


[Reply from the Clock-SI author Willy Zwaenepoel]
--
Thank you for your kind words about our work.

I was unaware of this patent at the time I wrote the paper. The two came out 
more or less at the same time.

I am not a lawyer, so I cannot tell you if something based on Clock-SI would 
infringe on the Microsoft patent. The main distinction to me seems to be that 
Clock-SI is based on physical clocks, while the Microsoft patent talks about 
logical clocks, but again I am not a lawyer.

Best regards,

Willy.
--


Does this make sense from your viewpoint, and can we think that we can use 
Clock-SI without infrindging on the patent?  According to the patent holder, 
the difference between Clock-SI and the patent seems to be fewer than the 
similarities.


Regards
Takayuki Tsunakawa



Re: Global snapshots

2020-09-21 Thread Alexey Kondratov

On 2020-09-18 00:54, Bruce Momjian wrote:

On Tue, Sep  8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there 
are two

major differences:

1. There is a built-in foreign xacts resolver in the [1], which should 
be
much more convenient from the end-user perspective. It involves huge 
in-core

changes and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve 
all
foreign prepared xacts on the Postgres' own side with a 100% 
guarantee.
Imagine a situation when the coordinator node is actually a HA cluster 
group
(primary + sync + async replica) and it failed just after PREPARE 
stage of

after local COMMIT. In that case all foreign xacts will be left in the
prepared state. After failover process complete synchronous replica 
will
become a new primary. Would it have all required info to properly 
resolve

orphan prepared xacts?

Probably, this situation is handled properly in the [1], but I've not 
yet
finished a thorough reading of the patch set, though it has a great 
doc!


On the other hand, previous 0003 and my proposed patch rely on either 
manual
resolution of hung prepared xacts or usage of external 
monitor/resolver.
This approach is much simpler from the in-core perspective, but 
doesn't look

as complete as [1] though.


Have we considered how someone would clean up foreign transactions if 
the
coordinating server dies?  Could it be done manually?  Would an 
external

resolver, rather than an internal one, make this easier?


Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC 
with a special gid format including a xid + server identification info. 
Thus, one can select from pg_prepared_xacts, get xid and coordinator 
info, then use txid_status() on the coordinator (or ex-coordinator) to 
get transaction status and finally either commit or abort these stale 
prepared xacts. Of course this could be wrapped into some user-level 
support routines as it is done in the [1].


As for the benefits of using an external resolver, I think that there 
are some of them from the whole system perspective:


1) If one follows the logic above, then this resolver could be 
stateless, it takes all the required info from the Postgres nodes 
themselves.


2) Then you can easily put it into container, which make it easier do 
deploy to all these 'cloud' stuff like kubernetes.


3) Also you can scale resolvers independently from Postgres nodes.

I do not think that either of these points is a game changer, but we use 
a very simple external resolver altogether with [2] in our sharding 
prototype and it works just fine so far.



[1] 
https://www.postgresql.org/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com


[2] 
https://www.postgresql.org/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-17 Thread Bruce Momjian
On Tue, Sep  8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:
> Thank you for the link!
> 
> After a quick look on the Sawada-san's patch set I think that there are two
> major differences:
> 
> 1. There is a built-in foreign xacts resolver in the [1], which should be
> much more convenient from the end-user perspective. It involves huge in-core
> changes and additional complexity that is of course worth of.
> 
> However, it's still not clear for me that it is possible to resolve all
> foreign prepared xacts on the Postgres' own side with a 100% guarantee.
> Imagine a situation when the coordinator node is actually a HA cluster group
> (primary + sync + async replica) and it failed just after PREPARE stage of
> after local COMMIT. In that case all foreign xacts will be left in the
> prepared state. After failover process complete synchronous replica will
> become a new primary. Would it have all required info to properly resolve
> orphan prepared xacts?
> 
> Probably, this situation is handled properly in the [1], but I've not yet
> finished a thorough reading of the patch set, though it has a great doc!
> 
> On the other hand, previous 0003 and my proposed patch rely on either manual
> resolution of hung prepared xacts or usage of external monitor/resolver.
> This approach is much simpler from the in-core perspective, but doesn't look
> as complete as [1] though.

Have we considered how someone would clean up foreign transactions if the
coordinating server dies?  Could it be done manually?  Would an external
resolver, rather than an internal one, make this easier?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Global snapshots

2020-09-16 Thread Amit Kapila
On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao  wrote:
>
> >> One alternative is to add only hooks into PostgreSQL core so that we can
> >> implement the global transaction management outside. This idea was
> >> discussed before as the title "eXtensible Transaction Manager API".
> >
> > Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
> > people said it's not good...
>
> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>

It is not very clear what exactly we can do about the point raised by
Tsunakawa-San related to patent in this technology as I haven't seen
that discussed during other development but maybe we can try to study
a bit. One more thing I would like to bring here is that it seems to
be there have been some concerns about this idea when originally
discussed [1]. It is not very clear to me if all the concerns are
addressed or not. If one can summarize the concerns discussed and how
the latest patch is able to address those then it will be great.

Also, I am not sure but maybe global deadlock detection also needs to
be considered as that also seems to be related because it depends on
how we manage global transactions. We need to prevent deadlock among
transaction operations spanned across multiple nodes. Say a
transaction T-1 has updated row r-1 of tbl-1 on node-1 and tries to
update row r-1 of tbl-2 on node n-2. Similarly, a transaction T-2
tries to perform those two operations in reverse order. Now, this will
lead to the deadlock that spans across multiple nodes and our current
deadlock detector doesn't have that capability. Having some form of
global/distributed transaction id might help to resolve it but not
sure how it can be solved with this clock-si based algorithm.

As all these problems are related, that is why I am insisting on this
thread and other thread "Transactions involving multiple postgres
foreign servers" [2] to have a high-level idea on how the distributed
transaction management will work before we decide on a particular
approach and commit one part of that patch.

[1] - 
https://www.postgresql.org/message-id/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
[2] - 
https://www.postgresql.org/message-id/CAA4eK1J86S%3DmeivVsH%2Boy%3DTwUC%2Byr9jj2VtmmqMfYRmgs2JzUA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Global snapshots

2020-09-10 Thread Alexey Kondratov

On 2020-09-09 20:29, Fujii Masao wrote:

On 2020/09/09 2:00, Alexey Kondratov wrote:


According to the Sawada-san's v25 0002 the logic is pretty much the 
same there:


+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after 
ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't 
see many difference between these approach and CallXactCallbacks() 
usage regarding this point.


IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the
remote servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?



No, you are right, sorry. At a first glance I thought that 
AtEOXact_FdwXact is responsible for COMMIT PREPARED as well, but it is 
only calling FdwXactParticipantEndTransaction in the abort case.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-10 Thread Fujii Masao




On 2020/09/10 18:01, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

But I'm concerned about that it's really hard to say there is no patent risk
around that. I'm not sure who can judge there is no patent risk,
in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
YugabyteDB, etc use the global transaction approach based on the clock
similar to Clock-SI. Since I've never heard they have the patent issues,
I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
safe, though...


Hm, it may be difficult to be sure that the algorithm does not violate a 
patent.  But it may not be difficult to know if the algorithm apparently 
violates a patent or is highly likely (for those who know Clock-SI well.)  At 
least, Andrey-san seems to have felt that it needs careful study, so I guess he 
had some hunch.

I understand this community is sensitive to patents.  After the discussions at 
and after PGCon 2018, the community concluded that it won't accept patented 
technology.  In the distant past, the community released Postgres 8.0 that 
contains an IBM's pending patent ARC, and removed it in 8.0.2.  I wonder how 
could this could be detected, and how hard to cope with the patent issue.  
Bruce warned that we should be careful not to violate Greenplum's patents.

E.25. Release 8.0.2
https://www.postgresql.org/docs/8.0/release-8-0-2.html
--
New cache management algorithm 2Q replaces ARC (Tom)
This was done to avoid a pending US patent on ARC. The 2Q code might be a few 
percentage points slower than ARC for some work loads. A better cache 
management algorithm will appear in 8.1.
--


I think I'll try to contact the people listed in Clock-SI paper and the 
Microsoft patent to ask about this.


Thanks!



I'm going to have a late summer vacation next week, so this is my summer 
homework?



One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".


Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
people said it's not good...


But it may be worth revisiting this idea if we cannot avoid the patent issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Global snapshots

2020-09-10 Thread tsunakawa.ta...@fujitsu.com
From: Fujii Masao 
> But I'm concerned about that it's really hard to say there is no patent risk
> around that. I'm not sure who can judge there is no patent risk,
> in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
> YugabyteDB, etc use the global transaction approach based on the clock
> similar to Clock-SI. Since I've never heard they have the patent issues,
> I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
> safe, though...

Hm, it may be difficult to be sure that the algorithm does not violate a 
patent.  But it may not be difficult to know if the algorithm apparently 
violates a patent or is highly likely (for those who know Clock-SI well.)  At 
least, Andrey-san seems to have felt that it needs careful study, so I guess he 
had some hunch.

I understand this community is sensitive to patents.  After the discussions at 
and after PGCon 2018, the community concluded that it won't accept patented 
technology.  In the distant past, the community released Postgres 8.0 that 
contains an IBM's pending patent ARC, and removed it in 8.0.2.  I wonder how 
could this could be detected, and how hard to cope with the patent issue.  
Bruce warned that we should be careful not to violate Greenplum's patents.

E.25. Release 8.0.2
https://www.postgresql.org/docs/8.0/release-8-0-2.html
--
New cache management algorithm 2Q replaces ARC (Tom)
This was done to avoid a pending US patent on ARC. The 2Q code might be a few 
percentage points slower than ARC for some work loads. A better cache 
management algorithm will appear in 8.1.
--


I think I'll try to contact the people listed in Clock-SI paper and the 
Microsoft patent to ask about this.  I'm going to have a late summer vacation 
next week, so this is my summer homework?


> One alternative is to add only hooks into PostgreSQL core so that we can
> implement the global transaction management outside. This idea was
> discussed before as the title "eXtensible Transaction Manager API".

Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro 
people said it's not good...


Regards
Takayuki Tsunakawa




Re: Global snapshots

2020-09-10 Thread Fujii Masao




On 2020/09/10 10:38, tsunakawa.ta...@fujitsu.com wrote:

Hi Andrey san,

From: Andrey V. Lepikhov > > From: 
tsunakawa.ta...@fujitsu.com 

While Clock-SI seems to be considered the best promising for global

Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.

Microsoft holds this until 2031.  I couldn't find this with the keyword
"Clock-SI.""



US8356007B2 - Distributed transaction management for database systems

with multiversioning - Google Patents

https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?



Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.


I wanted to ask about this after I've published the revised scale-out design 
wiki, but I'm taking too long, so could you share your study results?  I think 
we need to make it clear about the patent before discussing the code.


Yes.

But I'm concerned about that it's really hard to say there is no patent risk
around that. I'm not sure who can judge there is no patent risk,
in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
YugabyteDB, etc use the global transaction approach based on the clock
similar to Clock-SI. Since I've never heard they have the patent issues,
I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
safe, though...



 After we hear your opinion, we also have to check to see if Clock-SI is 
patented or avoid it by modifying part of the algorithm.  Just in case we 
cannot use it, we have to proceed with thinking about alternatives.


One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Global snapshots

2020-09-09 Thread tsunakawa.ta...@fujitsu.com
Hi Andrey san,

From: Andrey V. Lepikhov > > From: 
tsunakawa.ta...@fujitsu.com 
> >> While Clock-SI seems to be considered the best promising for global
> >>> > Could you take a look at this patent?  I'm afraid this is the Clock-SI 
> >>> > for MVCC.
> Microsoft holds this until 2031.  I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?
> >> 
> Thank you for the research (and previous links too).
> I haven't seen this patent before. This should be carefully studied.

I wanted to ask about this after I've published the revised scale-out design 
wiki, but I'm taking too long, so could you share your study results?  I think 
we need to make it clear about the patent before discussing the code.  After we 
hear your opinion, we also have to check to see if Clock-SI is patented or 
avoid it by modifying part of the algorithm.  Just in case we cannot use it, we 
have to proceed with thinking about alternatives.


Regards
Takayuki Tsunakawa




Re: Global snapshots

2020-09-09 Thread Fujii Masao




On 2020/09/09 2:00, Alexey Kondratov wrote:

On 2020-09-08 14:48, Fujii Masao wrote:

On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds 
a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues 
above and tries to add proper comments everywhere. I think, that 0003 should be 
rebased on the top of it, or it could be a first patch in the set, since it may 
be used independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are two 
major differences:


Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.




1. There is a built-in foreign xacts resolver in the [1], which should be much 
more convenient from the end-user perspective. It involves huge in-core changes 
and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve all foreign 
prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a 
situation when the coordinator node is actually a HA cluster group (primary + 
sync + async replica) and it failed just after PREPARE stage of after local 
COMMIT. In that case all foreign xacts will be left in the prepared state. 
After failover process complete synchronous replica will become a new primary. 
Would it have all required info to properly resolve orphan prepared xacts?


IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.



I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change to
+rollback, therefore at this time some participants might be prepared whereas
+some are not prepared. The former foreign transactions need to be resolved
+using pg_resolve_foreign_xact() manually and the latter ends transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.



Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()).



According to the Sawada-san's v25 0002 the logic is pretty much the same there:

+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() in the 
CommitTransaction(). Thus, I don't see many difference between these approach 
and CallXactCallbacks() usage regarding this point.


IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the remote 
servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Global snapshots

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 08:35, Masahiko Sawada wrote:

On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
 wrote:


On 2020-09-08 14:48, Fujii Masao wrote:
>
> IIUC, yes, the information required for automatic resolution is
> WAL-logged and the standby tries to resolve those orphan transactions
> from WAL after the failover. But Sawada-san's patch provides
> the special function for manual resolution, so there may be some cases
> where manual resolution is necessary.
>

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we 
change

to
+rollback, therefore at this time some participants might be prepared
whereas
+some are not prepared. The former foreign transactions need to be
resolved
+using pg_resolve_foreign_xact() manually and the latter ends
transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.


Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.



OK, thank you for the explanation!



Once the transaction is committed locally any ERROR (or higher level
message) will be escalated to PANIC.


I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?



It's not actually related to critical section errors escalation. Any 
error in the backend after the local commit and 
ProcArrayEndTransaction() will try to abort the current transaction and 
do RecordTransactionAbort(), but it's too late to do so and PANIC will 
be risen:


/*
	 * Check that we haven't aborted halfway through 
RecordTransactionCommit.

 */
if (TransactionIdDidCommit(xid))
elog(PANIC, "cannot abort transaction %u, it was already 
committed",
 xid);

At least that's how I understand it.



And I do see possible ERROR level
messages in the postgresCommitForeignTransaction() for example:

+   if (PQresultStatus(res) != PGRES_COMMAND_OK)
+   ereport(ERROR, (errmsg("could not commit transaction 
on server %s",
+  
frstate->server->servername)));


I don't think that it's very convenient to get a PANIC every time we
failed to commit one of the prepared foreign xacts, since it could be
not so rare in the distributed system. That's why I tried to get rid 
of

possible ERRORs as far as possible in my proposed patch.



In my patch, the second phase of 2PC is executed only by the resolver
process. Therefore, even if an error would happen during committing a
foreign prepared transaction, we just need to relaunch the resolver
process and trying again. During that, the backend process will be
just waiting. If a backend process raises an error after the local
commit, the client will see transaction failure despite the local
transaction having been committed. An error could happen even by
palloc. So the patch uses a background worker to commit prepared
foreign transactions, not by backend itself.



Yes, if it's a background process, then it seems to be safe.

BTW, it seems that I've chosen a wrong thread for posting my patch and 
staring a discussion :) Activity from this thread moved to [1] and you 
solution with built-in resolver is discussed [2]. I'll try to take a 
look on v25 closely and write to [2] instead.



[1] 
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca


[2] 
https://www.postgresql.org/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-08 Thread Masahiko Sawada
On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
 wrote:
>
> On 2020-09-08 14:48, Fujii Masao wrote:
> > On 2020/09/08 19:36, Alexey Kondratov wrote:
> >> On 2020-09-08 05:49, Fujii Masao wrote:
> >>> On 2020/09/05 3:31, Alexey Kondratov wrote:
> 
>  Attached is a patch, which implements a plain 2PC in the
>  postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it
>  solves these errors handling issues above and tries to add proper
>  comments everywhere. I think, that 0003 should be rebased on the top
>  of it, or it could be a first patch in the set, since it may be used
>  independently. What do you think?
> >>>
> >>> Thanks for the patch!
> >>>
> >>> Sawada-san was proposing another 2PC patch at [1]. Do you have any
> >>> thoughts
> >>> about pros and cons between your patch and Sawada-san's?
> >>>
> >>> [1]
> >>> https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com
> >>
> >> Thank you for the link!
> >>
> >> After a quick look on the Sawada-san's patch set I think that there
> >> are two major differences:
> >
> > Thanks for sharing your thought! As far as I read your patch quickly,
> > I basically agree with your this view.
> >
> >
> >>
> >> 1. There is a built-in foreign xacts resolver in the [1], which should
> >> be much more convenient from the end-user perspective. It involves
> >> huge in-core changes and additional complexity that is of course worth
> >> of.
> >>
> >> However, it's still not clear for me that it is possible to resolve
> >> all foreign prepared xacts on the Postgres' own side with a 100%
> >> guarantee. Imagine a situation when the coordinator node is actually a
> >> HA cluster group (primary + sync + async replica) and it failed just
> >> after PREPARE stage of after local COMMIT. In that case all foreign
> >> xacts will be left in the prepared state. After failover process
> >> complete synchronous replica will become a new primary. Would it have
> >> all required info to properly resolve orphan prepared xacts?
> >
> > IIUC, yes, the information required for automatic resolution is
> > WAL-logged and the standby tries to resolve those orphan transactions
> > from WAL after the failover. But Sawada-san's patch provides
> > the special function for manual resolution, so there may be some cases
> > where manual resolution is necessary.
> >
>
> I've found a note about manual resolution in the v25 0002:
>
> +After that we prepare all foreign transactions by calling
> +PrepareForeignTransaction() API. If we failed on any of them we change
> to
> +rollback, therefore at this time some participants might be prepared
> whereas
> +some are not prepared. The former foreign transactions need to be
> resolved
> +using pg_resolve_foreign_xact() manually and the latter ends
> transaction
> +in one-phase by calling RollbackForeignTransaction() API.
>
> but it's not yet clear for me.

Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.

>
> >
> > Implementing 2PC feature only inside postgres_fdw seems to cause
> > another issue; COMMIT PREPARED is issued to the remote servers
> > after marking the local transaction as committed
> > (i.e., ProcArrayEndTransaction()).
> >
>
> According to the Sawada-san's v25 0002 the logic is pretty much the same
> there:
>
> +2. Pre-Commit phase (1st phase of two-phase commit)
>
> +3. Commit locally
> +Once we've prepared all of them, commit the transaction locally.
>
> +4. Post-Commit Phase (2nd phase of two-phase commit)
>
> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
> FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction()
> in the CommitTransaction(). Thus, I don't see many difference between
> these approach and CallXactCallbacks() usage regarding this point.
>
> > Is this safe? This issue happens
> > because COMMIT PREPARED is issued via
> > CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
> > is called after ProcArrayEndTransaction().
> >
>
> Once the transaction is committed locally any ERROR (or higher level
> message) will be escalated to PANIC.

I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?

> And I do see possible ERROR level
> messages in the postgresCommitForeignTransaction() for example:
>
> +   if (PQresultStatus(res) != PGRES_COMMAND_OK)
> +   ereport(ERROR, (errmsg("could not commit transaction on 
> server %

Re: Global snapshots

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 14:48, Fujii Masao wrote:

On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the 
postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it 
solves these errors handling issues above and tries to add proper 
comments everywhere. I think, that 0003 should be rebased on the top 
of it, or it could be a first patch in the set, since it may be used 
independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any 
thoughts

about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there 
are two major differences:


Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.




1. There is a built-in foreign xacts resolver in the [1], which should 
be much more convenient from the end-user perspective. It involves 
huge in-core changes and additional complexity that is of course worth 
of.


However, it's still not clear for me that it is possible to resolve 
all foreign prepared xacts on the Postgres' own side with a 100% 
guarantee. Imagine a situation when the coordinator node is actually a 
HA cluster group (primary + sync + async replica) and it failed just 
after PREPARE stage of after local COMMIT. In that case all foreign 
xacts will be left in the prepared state. After failover process 
complete synchronous replica will become a new primary. Would it have 
all required info to properly resolve orphan prepared xacts?


IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.



I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change 
to
+rollback, therefore at this time some participants might be prepared 
whereas
+some are not prepared. The former foreign transactions need to be 
resolved
+using pg_resolve_foreign_xact() manually and the latter ends 
transaction

+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.



Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()).



According to the Sawada-san's v25 0002 the logic is pretty much the same 
there:


+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() 
in the CommitTransaction(). Thus, I don't see many difference between 
these approach and CallXactCallbacks() usage regarding this point.



Is this safe? This issue happens
because COMMIT PREPARED is issued via
CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
is called after ProcArrayEndTransaction().



Once the transaction is committed locally any ERROR (or higher level 
message) will be escalated to PANIC. And I do see possible ERROR level 
messages in the postgresCommitForeignTransaction() for example:


+   if (PQresultStatus(res) != PGRES_COMMAND_OK)
+   ereport(ERROR, (errmsg("could not commit transaction on server 
%s",
+  
frstate->server->servername)));

I don't think that it's very convenient to get a PANIC every time we 
failed to commit one of the prepared foreign xacts, since it could be 
not so rare in the distributed system. That's why I tried to get rid of 
possible ERRORs as far as possible in my proposed patch.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-08 Thread Fujii Masao




On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds 
a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues 
above and tries to add proper comments everywhere. I think, that 0003 should be 
rebased on the top of it, or it could be a first patch in the set, since it may 
be used independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are two 
major differences:


Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.




1. There is a built-in foreign xacts resolver in the [1], which should be much 
more convenient from the end-user perspective. It involves huge in-core changes 
and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve all foreign 
prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a 
situation when the coordinator node is actually a HA cluster group (primary + 
sync + async replica) and it failed just after PREPARE stage of after local 
COMMIT. In that case all foreign xacts will be left in the prepared state. 
After failover process complete synchronous replica will become a new primary. 
Would it have all required info to properly resolve orphan prepared xacts?


IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.




Probably, this situation is handled properly in the [1], but I've not yet 
finished a thorough reading of the patch set, though it has a great doc!

On the other hand, previous 0003 and my proposed patch rely on either manual 
resolution of hung prepared xacts or usage of external monitor/resolver. This 
approach is much simpler from the in-core perspective, but doesn't look as 
complete as [1] though.

2. In the patch from this thread all 2PC logic sit in the postgres_fdw, while 
[1] tries to put it into the generic fdw core, which also feels like a more 
general and architecturally correct way. However, how many from the currently 
available dozens of various FDWs are capable to perform 2PC? And how many of 
them are maintained well enough to adopt this new API? This is not an argument 
against [1] actually, since postgres_fdw is known to be the most advanced FDW 
and an early adopter of new feature, just a little doubt about a usefulness of 
this preliminary generalisation.


If we implement 2PC feature only for PostgreSQL sharding using
postgres_fdw, IMO it's ok to support only postgres_fdw.
But if we implement 2PC as the improvement on FDW independently
from PostgreSQL sharding and global visibility, I think that it's
necessary to support other FDW. I'm not sure how many FDW
actually will support this new 2PC interface. But if the interface is
not so complicated, I *guess* some FDW will support it in the near future.

Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()). Is this safe? This issue happens
because COMMIT PREPARED is issued via
CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
is called after ProcArrayEndTransaction().




Anyway, I think that [1] is a great work and really hope to find more time to 
investigate it deeper later this year.


I'm sure your work is also great! I hope we can discuss the design
of 2PC feature together!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Global snapshots

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the postgres_fdw 
and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these 
errors handling issues above and tries to add proper comments 
everywhere. I think, that 0003 should be rebased on the top of it, or 
it could be a first patch in the set, since it may be used 
independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any 
thoughts

about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are 
two major differences:


1. There is a built-in foreign xacts resolver in the [1], which should 
be much more convenient from the end-user perspective. It involves huge 
in-core changes and additional complexity that is of course worth of.


However, it's still not clear for me that it is possible to resolve all 
foreign prepared xacts on the Postgres' own side with a 100% guarantee. 
Imagine a situation when the coordinator node is actually a HA cluster 
group (primary + sync + async replica) and it failed just after PREPARE 
stage of after local COMMIT. In that case all foreign xacts will be left 
in the prepared state. After failover process complete synchronous 
replica will become a new primary. Would it have all required info to 
properly resolve orphan prepared xacts?


Probably, this situation is handled properly in the [1], but I've not 
yet finished a thorough reading of the patch set, though it has a great 
doc!


On the other hand, previous 0003 and my proposed patch rely on either 
manual resolution of hung prepared xacts or usage of external 
monitor/resolver. This approach is much simpler from the in-core 
perspective, but doesn't look as complete as [1] though.


2. In the patch from this thread all 2PC logic sit in the postgres_fdw, 
while [1] tries to put it into the generic fdw core, which also feels 
like a more general and architecturally correct way. However, how many 
from the currently available dozens of various FDWs are capable to 
perform 2PC? And how many of them are maintained well enough to adopt 
this new API? This is not an argument against [1] actually, since 
postgres_fdw is known to be the most advanced FDW and an early adopter 
of new feature, just a little doubt about a usefulness of this 
preliminary generalisation.


Anyway, I think that [1] is a great work and really hope to find more 
time to investigate it deeper later this year.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-07 Thread Fujii Masao




On 2020/09/05 3:31, Alexey Kondratov wrote:

Hi,

On 2020-07-27 09:44, Andrey V. Lepikhov wrote:

On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?



Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.


I had a look on the patch set, although it is quite outdated, especially on 
0003.

Two thoughts about 0003:

First, IIUC atomicity of the distributed transaction in the postgres_fdw is 
achieved by the usage of 2PC. I think that this postgres_fdw 2PC support should 
be separated from global snapshots.


Agreed.



It could be useful to have such atomic distributed transactions even without a 
proper visibility, which is guaranteed by the global snapshot. Especially 
taking into account the doubts about Clock-SI and general questions about 
algorithm choosing criteria above in the thread.

Thus, I propose to split 0003 into two parts and add a separate GUC 
'postgres_fdw.use_twophase', which could be turned on independently from 
'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, then 
2PC should be forcedly turned on as well.

Second, there are some problems with errors handling in the 0003 (thanks to 
Arseny Sher for review).

+error:
+    if (!res)
+    {
+    sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
+    BroadcastCmd(sql);
+    elog(ERROR, "Failed to PREPARE transaction on remote node");
+    }

It seems that we should never reach this point, just because BroadcastStmt will 
throw an ERROR if it fails to prepare transaction on the foreign server:

+    if (PQresultStatus(result) != expectedStatus ||
+    (handler && !handler(result, arg)))
+    {
+    elog(WARNING, "Failed command %s: status=%d, expected 
status=%d", sql, PQresultStatus(result), expectedStatus);
+    pgfdw_report_error(ERROR, result, entry->conn, true, sql);
+    allOk = false;
+    }

Moreover, It doesn't make much sense to try to abort prepared xacts, since if 
we failed to prepare it somewhere, then some foreign servers may become 
unavailable already and this doesn't provide us a 100% guarantee of clean up.

+    /* COMMIT open transaction of we were doing 2PC */
+    if (fdwTransState->two_phase_commit &&
+    (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
+    {
+    BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
+    }

At this point, the host (local) transaction is already committed and there is 
no way to abort it gracefully. However, BroadcastCmd may rise an ERROR that 
will cause a PANIC, since it is non-recoverable state:

PANIC:  cannot abort transaction 487, it was already committed

Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds 
a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues 
above and tries to add proper comments everywhere. I think, that 0003 should be 
rebased on the top of it, or it could be a first patch in the set, since it may 
be used independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

Regards,

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Global snapshots

2020-09-04 Thread Alexey Kondratov

Hi,

On 2020-07-27 09:44, Andrey V. Lepikhov wrote:

On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:


US8356007B2 - Distributed transaction management for database systems 
with multiversioning - Google Patents

https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?



Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.


I had a look on the patch set, although it is quite outdated, especially 
on 0003.


Two thoughts about 0003:

First, IIUC atomicity of the distributed transaction in the postgres_fdw 
is achieved by the usage of 2PC. I think that this postgres_fdw 2PC 
support should be separated from global snapshots. It could be useful to 
have such atomic distributed transactions even without a proper 
visibility, which is guaranteed by the global snapshot. Especially 
taking into account the doubts about Clock-SI and general questions 
about algorithm choosing criteria above in the thread.


Thus, I propose to split 0003 into two parts and add a separate GUC 
'postgres_fdw.use_twophase', which could be turned on independently from 
'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, 
then 2PC should be forcedly turned on as well.


Second, there are some problems with errors handling in the 0003 (thanks 
to Arseny Sher for review).


+error:
+   if (!res)
+   {
+   sql = psprintf("ABORT PREPARED '%s'", 
fdwTransState->gid);
+   BroadcastCmd(sql);
+   elog(ERROR, "Failed to PREPARE transaction on remote 
node");
+   }

It seems that we should never reach this point, just because 
BroadcastStmt will throw an ERROR if it fails to prepare transaction on 
the foreign server:


+   if (PQresultStatus(result) != expectedStatus ||
+   (handler && !handler(result, arg)))
+   {
+elog(WARNING, "Failed command %s: status=%d, expected status=%d", 
sql, PQresultStatus(result), expectedStatus);

+   pgfdw_report_error(ERROR, result, entry->conn, 
true, sql);
+   allOk = false;
+   }

Moreover, It doesn't make much sense to try to abort prepared xacts, 
since if we failed to prepare it somewhere, then some foreign servers 
may become unavailable already and this doesn't provide us a 100% 
guarantee of clean up.


+   /* COMMIT open transaction of we were doing 2PC */
+   if (fdwTransState->two_phase_commit &&
+   (event == XACT_EVENT_PARALLEL_COMMIT || event == 
XACT_EVENT_COMMIT))
+   {
+   BroadcastCmd(psprintf("COMMIT PREPARED '%s'", 
fdwTransState->gid));
+   }

At this point, the host (local) transaction is already committed and 
there is no way to abort it gracefully. However, BroadcastCmd may rise 
an ERROR that will cause a PANIC, since it is non-recoverable state:


PANIC:  cannot abort transaction 487, it was already committed

Attached is a patch, which implements a plain 2PC in the postgres_fdw 
and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors 
handling issues above and tries to add proper comments everywhere. I 
think, that 0003 should be rebased on the top of it, or it could be a 
first patch in the set, since it may be used independently. What do you 
think?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom debdffade7abcdbf29031bda6c8359a89776ad36 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Fri, 7 Aug 2020 16:50:57 +0300
Subject: [PATCH] Add postgres_fdw.use_twophase GUC to use 2PC for transactions
 involving several servers.

---
 contrib/postgres_fdw/connection.c   | 234 +---
 contrib/postgres_fdw/postgres_fdw.c |  17 ++
 contrib/postgres_fdw/postgres_fdw.h |   2 +
 3 files changed, 228 insertions(+), 25 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf0..d18fdd1f94e 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -66,6 +66,20 @@ typedef struct ConnCacheEntry
  */
 static HTAB *ConnectionHash = NULL;
 
+/*
+ * FdwTransactionState
+ *
+ * Holds number of open remote transactions and shared state
+ * needed for all connection entries.
+ */
+typedef struct FdwTransactionState
+{
+	char	   *gid;
+	int			nparticipants;
+	bool		two_phase_commit;
+} FdwTransactionState;
+static FdwTransactionState *fdwTransState;
+
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
@@ -73,6 +87,9 @@ static unsigned int prep_stmt_number = 0;
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
+/*

Re: Global snapshots

2020-07-26 Thread Andrey V. Lepikhov

On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:

Hi Andrey san, Movead san,


From: tsunakawa.ta...@fujitsu.com 

While Clock-SI seems to be considered the best promising for global
serializability here,

* Why does Clock-SI gets so much attention?  How did Clock-SI become the
only choice?

* Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
organization use Clock-SI?


Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.  Microsoft 
holds this until 2031.  I couldn't find this with the keyword "Clock-SI.""


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?


Regards
Takayuki Tsunakawa




Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.

--
regards,
Andrey Lepikhov
Postgres Professional




RE: Global snapshots

2020-07-26 Thread tsunakawa.ta...@fujitsu.com
Hi Andrey san, Movead san,


From: tsunakawa.ta...@fujitsu.com 
> While Clock-SI seems to be considered the best promising for global
> serializability here,
> 
> * Why does Clock-SI gets so much attention?  How did Clock-SI become the
> only choice?
> 
> * Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
> organization use Clock-SI?

Could you take a look at this patent?  I'm afraid this is the Clock-SI for 
MVCC.  Microsoft holds this until 2031.  I couldn't find this with the keyword 
"Clock-SI.""


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?


Regards
Takayuki Tsunakawa




Re: Global snapshots

2020-07-22 Thread Masahiko Sawada
On Mon, 13 Jul 2020 at 20:18, Amit Kapila  wrote:
>
> On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, 8 Jul 2020 at 21:35, Amit Kapila  wrote:
> > >
> > >
> > > Cool. While studying, if you can try to think whether this approach is
> > > different from the global coordinator based approach then it would be
> > > great.  Here is my initial thought apart from other reasons the global
> > > coordinator based design can help us to do the global transaction
> > > management and snapshots.  It can allocate xids for each transaction
> > > and then collect the list of running xacts (or CSN) from each node and
> > > then prepare a global snapshot that can be used to perform any
> > > transaction. OTOH, in the design proposed in this patch, we don't need any
> > > coordinator to manage transactions and snapshots because each node's
> > > current CSN will be sufficient for snapshot and visibility as
> > > explained above.
> >
> > Yeah, my thought is the same as you. Since both approaches have strong
> > points and weak points I cannot mention which is a better approach,
> > but that 2PC patch would go well together with the design proposed in
> > this patch.
> >
>
> I also think with some modifications we might be able to integrate
> your 2PC patch with the patches proposed here.  However, if we decide
> not to pursue this approach then it is uncertain whether your proposed
> patch can be further enhanced for global visibility.

Yes. I think even if we decide not to pursue this approach it's not
the reason for not pursuing the 2PC patch. if so we would need to
consider the design of 2PC patch again so it generically resolves the
atomic commit problem.

> Does it make
> sense to dig the design of this approach a bit further so that we can
> be somewhat more sure that pursuing your 2PC patch would be a good
> idea and we can, in fact, enhance it later for global visibility?

Agreed.

> AFAICS, Andrey has mentioned couple of problems with this approach
> [1], the details of which I am also not sure at this stage but if we
> can dig those it would be really great.
>
> > > Now, sure this assumes that there is no clock skew
> > > on different nodes or somehow we take care of the same (Note that in
> > > the proposed patch the CSN is a timestamp.).
> >
> > As far as I read Clock-SI paper, we take care of the clock skew by
> > putting some waits on the transaction start and reading tuples on the
> > remote node.
> >
>
> Oh, but I am not sure if this patch is able to solve that, and if so, how?

I'm not sure the details but, as far as I read the patch I guess the
transaction will sleep at GlobalSnapshotSync() when the received
global csn is greater than the local global csn.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: Global snapshots

2020-07-22 Thread tsunakawa.ta...@fujitsu.com
Hello,

While I'm thinking of the following issues of the current approach Andrey 
raised, I'm getting puzzled and can't help asking certain things.  Please 
forgive me if I'm missing some discussions in the past.

> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case 
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed 
> for a transaction snapshot at another node.

While Clock-SI seems to be considered the best promising for global 
serializability here,

* Why does Clock-SI gets so much attention?  How did Clock-SI become the only 
choice?

* Clock-SI was devised in Microsoft Research.  Does Microsoft or some other 
organization use Clock-SI?


Have anyone examined the following Multiversion Commitment Ordering (MVCO)?  
Although I haven't understood this yet, it insists that no concurrency control 
information including timestamps needs to be exchanged among the cluster nodes. 
 I'd appreciate it if someone could give an opinion.

Commitment Ordering Based Distributed Concurrency Control for Bridging Single 
and Multi Version Resources.
 Proceedings of the Third IEEE International Workshop on Research Issues on 
Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS), Vienna, 
Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
https://ieeexplore.ieee.org/document/281924?arnumber=281924


The author of the above paper, Yoav Raz, seems to have had strong passion at 
least until 2011 about making people believe the mightiness of Commitment 
Ordering (CO) for global serializability.  However, he complains (sadly) that 
almost all researchers ignore his theory, as written in his following  site and 
wikipedia page for Commitment Ordering.  Does anyone know why CO is ignored?

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


FWIW, some researchers including Michael Stonebraker evaluated the performance 
of various distributed concurrency control methods in 2017.  Have anyone looked 
at this?  (I don't mean there was some promising method that we might want to 
adopt.)

An Evaluation of Distributed Concurrency Control
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017.
Proc. VLDB Endow. 10, 5 (January 2017), 553-564. 
https://doi.org/10.14778/3055540.3055548


Regards
Takayuki Tsunakawa



Re: Global snapshots

2020-07-13 Thread Amit Kapila
On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
 wrote:
>
> On Wed, 8 Jul 2020 at 21:35, Amit Kapila  wrote:
> >
> >
> > Cool. While studying, if you can try to think whether this approach is
> > different from the global coordinator based approach then it would be
> > great.  Here is my initial thought apart from other reasons the global
> > coordinator based design can help us to do the global transaction
> > management and snapshots.  It can allocate xids for each transaction
> > and then collect the list of running xacts (or CSN) from each node and
> > then prepare a global snapshot that can be used to perform any
> > transaction. OTOH, in the design proposed in this patch, we don't need any
> > coordinator to manage transactions and snapshots because each node's
> > current CSN will be sufficient for snapshot and visibility as
> > explained above.
>
> Yeah, my thought is the same as you. Since both approaches have strong
> points and weak points I cannot mention which is a better approach,
> but that 2PC patch would go well together with the design proposed in
> this patch.
>

I also think with some modifications we might be able to integrate
your 2PC patch with the patches proposed here.  However, if we decide
not to pursue this approach then it is uncertain whether your proposed
patch can be further enhanced for global visibility.  Does it make
sense to dig the design of this approach a bit further so that we can
be somewhat more sure that pursuing your 2PC patch would be a good
idea and we can, in fact, enhance it later for global visibility?
AFAICS, Andrey has mentioned couple of problems with this approach
[1], the details of which I am also not sure at this stage but if we
can dig those it would be really great.

> > Now, sure this assumes that there is no clock skew
> > on different nodes or somehow we take care of the same (Note that in
> > the proposed patch the CSN is a timestamp.).
>
> As far as I read Clock-SI paper, we take care of the clock skew by
> putting some waits on the transaction start and reading tuples on the
> remote node.
>

Oh, but I am not sure if this patch is able to solve that, and if so, how?

> >
> > I think InDoubt status helps in checking visibility in the proposed
> > patch wherein if we find the status of the transaction as InDoubt, we
> > wait till we get some valid CSN for it as explained in my previous
> > email.  So whether we use it for Rollback/Rollback Prepared, it is
> > required for this design.
>
> Yes, InDoubt status is required for checking visibility. My comment
> was it's not necessary from the perspective of atomic commit.
>

True and probably we can enhance your patch for InDoubt status if required.

Thanks for moving this work forward.  I know the progress is a bit
slow due to various reasons but I think it is important to keep making
some progress.

[1] - 
https://www.postgresql.org/message-id/f23083b9-38d0-6126-eb6e-091816a78585%40postgrespro.ru

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-07-09 Thread Masahiko Sawada
On Wed, 8 Jul 2020 at 21:35, Amit Kapila  wrote:
>
> On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
>  wrote:
> >
> > On Tue, 7 Jul 2020 at 15:40, Amit Kapila  wrote:
> > >
> > >
> > > Okay, but isn't there some advantage with this approach (manage 2PC at
> > > postgres_fdw level) as well which is that any node will be capable of
> > > handling global transactions rather than doing them via central
> > > coordinator?  I mean any node can do writes or reads rather than
> > > probably routing them (at least writes) via coordinator node.
> >
> > The postgres server where the client started the transaction works as
> > the coordinator node. I think this is true for both this patch and
> > that 2PC patch. From the perspective of atomic commit, any node will
> > be capable of handling global transactions in both approaches.
> >
>
> Okay, but then probably we need to ensure that GID has to be unique
> even if that gets generated on different nodes?  I don't know if that
> is ensured.

Yes, if you mean GID is global transaction id specified to PREPARE
TRANSACTION, it has to be unique. In that 2PC patch, GID is generated
in form of 'fx___'. I believe it
can ensure uniqueness in most cases. In addition, there is FDW API to
generate an arbitrary identifier.

>
> > >  Now, I
> > > agree that even if this advantage is there in the current approach, we
> > > can't lose the crash-safety aspect of other approach.  Will you be
> > > able to summarize what was the problem w.r.t crash-safety and how your
> > > patch has dealt it?
> >
> > Since this patch proceeds 2PC without any logging, foreign
> > transactions prepared on foreign servers are left over without any
> > clues if the coordinator crashes during commit. Therefore, after
> > restart, the user will need to find and resolve in-doubt foreign
> > transactions manually.
> >
>
> Okay, but is it because we can't directly WAL log in postgres_fdw or
> there is some other reason for not doing so?

Yes, I think it is because we cannot WAL log in postgres_fdw. Maybe I
missed the point in your question. Please correct me if I missed
something.

>
> >
> > >
> > > > Looking at the commit procedure with this patch:
> > > >
> > > > When starting a new transaction on a foreign server, postgres_fdw
> > > > executes pg_global_snapshot_import() to import the global snapshot.
> > > > After some work, in pre-commit phase we do:
> > > >
> > > > 1. generate global transaction id, say 'gid'
> > > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > > 3. prepare global snapshot locally, if the local node also involves
> > > > the transaction
> > > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > > >
> > > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > > returned from each pg_global_snapshot_prepare() executions.
> > > >
> > > > 5. assign global snapshot locally, if the local node also involves the
> > > > transaction
> > > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all 
> > > > participants.
> > > >
> > > > Then, we commit locally (i.g. mark the current transaction as
> > > > committed in clog).
> > > >
> > > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > > participants.
> > > >
> > >
> > > As per my current understanding, the overall idea is as follows.  For
> > > global transactions, pg_global_snapshot_prepare('gid') will set the
> > > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > > at the node where that function is executed, it also returns the
> > > NodeCSN to the coordinator.  Then the coordinator (the current
> > > postgres_fdw node on which write transaction is being executed)
> > > computes MaxCSN based on the return value (NodeCSN) of prepare
> > > (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> > > to each node.  Finally, when Commit Prepared is issued for each node
> > > that MaxCSN will be written to each node including the current node.
> > > So, with this idea, each node will have the same view of CSN value
> > > corresponding to any particular transaction.
> > >
> > > For Snapshot management, the node which receives the query generates a
> > > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > > xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> > > possible that when we are examining a tuple, the CSN corresponding to
> > > xid that has written the tuple has a value as INDOUBT which will
> > > indicate that the transaction is yet not committed on all nodes.  And
> > > we wait till we get the valid CSN value corresponding to xid and then
> > > use it to check if the tuple is visible.
> > >
> > > Now, one thing to note here is that for global transactions we
> > > primarily rely on CSN value corresponding to a transaction for its
> > > visibility even though we still maintain CLOG for local transaction
> > > status.
> > >
> > > Leaving aside the incomplete parts and or flaws of the

Re: Global snapshots

2020-07-08 Thread Amit Kapila
On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
 wrote:
>
> On Tue, 7 Jul 2020 at 15:40, Amit Kapila  wrote:
> >
> >
> > Okay, but isn't there some advantage with this approach (manage 2PC at
> > postgres_fdw level) as well which is that any node will be capable of
> > handling global transactions rather than doing them via central
> > coordinator?  I mean any node can do writes or reads rather than
> > probably routing them (at least writes) via coordinator node.
>
> The postgres server where the client started the transaction works as
> the coordinator node. I think this is true for both this patch and
> that 2PC patch. From the perspective of atomic commit, any node will
> be capable of handling global transactions in both approaches.
>

Okay, but then probably we need to ensure that GID has to be unique
even if that gets generated on different nodes?  I don't know if that
is ensured.

> >  Now, I
> > agree that even if this advantage is there in the current approach, we
> > can't lose the crash-safety aspect of other approach.  Will you be
> > able to summarize what was the problem w.r.t crash-safety and how your
> > patch has dealt it?
>
> Since this patch proceeds 2PC without any logging, foreign
> transactions prepared on foreign servers are left over without any
> clues if the coordinator crashes during commit. Therefore, after
> restart, the user will need to find and resolve in-doubt foreign
> transactions manually.
>

Okay, but is it because we can't directly WAL log in postgres_fdw or
there is some other reason for not doing so?

>
> >
> > > Looking at the commit procedure with this patch:
> > >
> > > When starting a new transaction on a foreign server, postgres_fdw
> > > executes pg_global_snapshot_import() to import the global snapshot.
> > > After some work, in pre-commit phase we do:
> > >
> > > 1. generate global transaction id, say 'gid'
> > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > 3. prepare global snapshot locally, if the local node also involves
> > > the transaction
> > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > >
> > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > returned from each pg_global_snapshot_prepare() executions.
> > >
> > > 5. assign global snapshot locally, if the local node also involves the
> > > transaction
> > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> > >
> > > Then, we commit locally (i.g. mark the current transaction as
> > > committed in clog).
> > >
> > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > participants.
> > >
> >
> > As per my current understanding, the overall idea is as follows.  For
> > global transactions, pg_global_snapshot_prepare('gid') will set the
> > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > at the node where that function is executed, it also returns the
> > NodeCSN to the coordinator.  Then the coordinator (the current
> > postgres_fdw node on which write transaction is being executed)
> > computes MaxCSN based on the return value (NodeCSN) of prepare
> > (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> > to each node.  Finally, when Commit Prepared is issued for each node
> > that MaxCSN will be written to each node including the current node.
> > So, with this idea, each node will have the same view of CSN value
> > corresponding to any particular transaction.
> >
> > For Snapshot management, the node which receives the query generates a
> > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> > possible that when we are examining a tuple, the CSN corresponding to
> > xid that has written the tuple has a value as INDOUBT which will
> > indicate that the transaction is yet not committed on all nodes.  And
> > we wait till we get the valid CSN value corresponding to xid and then
> > use it to check if the tuple is visible.
> >
> > Now, one thing to note here is that for global transactions we
> > primarily rely on CSN value corresponding to a transaction for its
> > visibility even though we still maintain CLOG for local transaction
> > status.
> >
> > Leaving aside the incomplete parts and or flaws of the current patch,
> > does the above match the top-level idea of this patch?
>
> I'm still studying this patch but your understanding seems right to me.
>

Cool. While studying, if you can try to think whether this approach is
different from the global coordinator based approach then it would be
great.  Here is my initial thought apart from other reasons the global
coordinator based design can help us to do the global transaction
management and snapshots.  It can allocate xids for each transaction
and then collect the list of running xacts (or CSN) from each node and
then prepare a global snapshot that can be used to perform any
transaction.

OTOH, in the design

Re: Global snapshots

2020-07-07 Thread Masahiko Sawada
On Tue, 7 Jul 2020 at 15:40, Amit Kapila  wrote:
>
> On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
>  wrote:
> >
> > On Sat, 20 Jun 2020 at 21:21, Amit Kapila  wrote:
> > >
> > > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> > >  wrote:
> > >
> > > >Also, can you let us know if this
> > > > > supports 2PC in some way and if so how is it different from what the
> > > > > other thread on the same topic [1] is trying to achieve?
> > > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > > 2PC machinery. Now I'd not judge which approach is better.
> > > >
> > >
> >
> > Sorry for being late.
> >
>
> No problem, your summarization, and comparisons of both approaches are
> quite helpful.
>
> >
> > I studied this patch and did a simple comparison between this patch
> > (0002 patch) and my 2PC patch.
> >
> > In terms of atomic commit, the features that are not implemented in
> > this patch but in the 2PC patch are:
> >
> > * Crash safe.
> > * PREPARE TRANSACTION command support.
> > * Query cancel during waiting for the commit.
> > * Automatically in-doubt transaction resolution.
> >
> > On the other hand, the feature that is implemented in this patch but
> > not in the 2PC patch is:
> >
> > * Executing PREPARE TRANSACTION (and other commands) in parallel
> >
> > When the 2PC patch was proposed, IIRC it was like this patch (0002
> > patch). I mean, it changed only postgres_fdw to support 2PC. But after
> > discussion, we changed the approach to have the core manage foreign
> > transaction for crash-safe. From my perspective, this patch has a
> > minimum implementation of 2PC to work the global snapshot feature and
> > has some missing features important for supporting crash-safe atomic
> > commit. So I personally think we should consider how to integrate this
> > global snapshot feature with the 2PC patch, rather than improving this
> > patch if we want crash-safe atomic commit.
> >
>
> Okay, but isn't there some advantage with this approach (manage 2PC at
> postgres_fdw level) as well which is that any node will be capable of
> handling global transactions rather than doing them via central
> coordinator?  I mean any node can do writes or reads rather than
> probably routing them (at least writes) via coordinator node.

The postgres server where the client started the transaction works as
the coordinator node. I think this is true for both this patch and
that 2PC patch. From the perspective of atomic commit, any node will
be capable of handling global transactions in both approaches.

>  Now, I
> agree that even if this advantage is there in the current approach, we
> can't lose the crash-safety aspect of other approach.  Will you be
> able to summarize what was the problem w.r.t crash-safety and how your
> patch has dealt it?

Since this patch proceeds 2PC without any logging, foreign
transactions prepared on foreign servers are left over without any
clues if the coordinator crashes during commit. Therefore, after
restart, the user will need to find and resolve in-doubt foreign
transactions manually.

In that 2PC patch, the information of foreign transactions is WAL
logged before PREPARE TRANSACTION. So even if the coordinator crashes
after preparing some foreign transactions, the prepared foreign
transactions are recovered during crash recovery, and then the
transaction resolver resolves them automatically or the user also can
resolve them. The user doesn't need to check other participants node
to resolve in-doubt foreign transactions. Also, since the foreign
transaction information is replicated to physical standbys the new
master can take over resolving in-doubt transactions.

>
> > Looking at the commit procedure with this patch:
> >
> > When starting a new transaction on a foreign server, postgres_fdw
> > executes pg_global_snapshot_import() to import the global snapshot.
> > After some work, in pre-commit phase we do:
> >
> > 1. generate global transaction id, say 'gid'
> > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > 3. prepare global snapshot locally, if the local node also involves
> > the transaction
> > 4. execute pg_global_snapshot_prepare('gid') for all participants
> >
> > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > returned from each pg_global_snapshot_prepare() executions.
> >
> > 5. assign global snapshot locally, if the local node also involves the
> > transaction
> > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> >
> > Then, we commit locally (i.g. mark the current transaction as
> > committed in clog).
> >
> > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > participants.
> >
>
> As per my current understanding, the overall idea is as follows.  For
> global transactions, pg_global_snapshot_prepare('gid') will set the
> transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> at the node where that function is executed, it also returns the
> Nod

Re: Global snapshots

2020-07-06 Thread Amit Kapila
On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
 wrote:
>
> On Sat, 20 Jun 2020 at 21:21, Amit Kapila  wrote:
> >
> > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> >  wrote:
> >
> > >Also, can you let us know if this
> > > > supports 2PC in some way and if so how is it different from what the
> > > > other thread on the same topic [1] is trying to achieve?
> > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > 2PC machinery. Now I'd not judge which approach is better.
> > >
> >
>
> Sorry for being late.
>

No problem, your summarization, and comparisons of both approaches are
quite helpful.

>
> I studied this patch and did a simple comparison between this patch
> (0002 patch) and my 2PC patch.
>
> In terms of atomic commit, the features that are not implemented in
> this patch but in the 2PC patch are:
>
> * Crash safe.
> * PREPARE TRANSACTION command support.
> * Query cancel during waiting for the commit.
> * Automatically in-doubt transaction resolution.
>
> On the other hand, the feature that is implemented in this patch but
> not in the 2PC patch is:
>
> * Executing PREPARE TRANSACTION (and other commands) in parallel
>
> When the 2PC patch was proposed, IIRC it was like this patch (0002
> patch). I mean, it changed only postgres_fdw to support 2PC. But after
> discussion, we changed the approach to have the core manage foreign
> transaction for crash-safe. From my perspective, this patch has a
> minimum implementation of 2PC to work the global snapshot feature and
> has some missing features important for supporting crash-safe atomic
> commit. So I personally think we should consider how to integrate this
> global snapshot feature with the 2PC patch, rather than improving this
> patch if we want crash-safe atomic commit.
>

Okay, but isn't there some advantage with this approach (manage 2PC at
postgres_fdw level) as well which is that any node will be capable of
handling global transactions rather than doing them via central
coordinator?  I mean any node can do writes or reads rather than
probably routing them (at least writes) via coordinator node.  Now, I
agree that even if this advantage is there in the current approach, we
can't lose the crash-safety aspect of other approach.  Will you be
able to summarize what was the problem w.r.t crash-safety and how your
patch has dealt it?

> Looking at the commit procedure with this patch:
>
> When starting a new transaction on a foreign server, postgres_fdw
> executes pg_global_snapshot_import() to import the global snapshot.
> After some work, in pre-commit phase we do:
>
> 1. generate global transaction id, say 'gid'
> 2. execute PREPARE TRANSACTION 'gid' on all participants.
> 3. prepare global snapshot locally, if the local node also involves
> the transaction
> 4. execute pg_global_snapshot_prepare('gid') for all participants
>
> During step 2 to 4, we calculate the maximum CSN from the CSNs
> returned from each pg_global_snapshot_prepare() executions.
>
> 5. assign global snapshot locally, if the local node also involves the
> transaction
> 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
>
> Then, we commit locally (i.g. mark the current transaction as
> committed in clog).
>
> After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> participants.
>

As per my current understanding, the overall idea is as follows.  For
global transactions, pg_global_snapshot_prepare('gid') will set the
transaction status as InDoubt and generate CSN (let's call it NodeCSN)
at the node where that function is executed, it also returns the
NodeCSN to the coordinator.  Then the coordinator (the current
postgres_fdw node on which write transaction is being executed)
computes MaxCSN based on the return value (NodeCSN) of prepare
(pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
to each node.  Finally, when Commit Prepared is issued for each node
that MaxCSN will be written to each node including the current node.
So, with this idea, each node will have the same view of CSN value
corresponding to any particular transaction.

For Snapshot management, the node which receives the query generates a
CSN (CurrentCSN) and follows the simple rule that the tuple having a
xid with CSN lesser than CurrentCSN will be visible.  Now, it is
possible that when we are examining a tuple, the CSN corresponding to
xid that has written the tuple has a value as INDOUBT which will
indicate that the transaction is yet not committed on all nodes.  And
we wait till we get the valid CSN value corresponding to xid and then
use it to check if the tuple is visible.

Now, one thing to note here is that for global transactions we
primarily rely on CSN value corresponding to a transaction for its
visibility even though we still maintain CLOG for local transaction
status.

Leaving aside the incomplete parts and or flaws of the current patch,
does the above match the top-level idea of this patch?  I am not s

Re: Global snapshots

2020-07-02 Thread Masahiko Sawada
On Sat, 20 Jun 2020 at 21:21, Amit Kapila  wrote:
>
> On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
>  wrote:
> >
> > On 6/19/20 11:48 AM, Amit Kapila wrote:
> > > On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> > >  wrote:
> > >> On 09.06.2020 11:41, Fujii Masao wrote:
> > >>> The patches seem not to be registered in CommitFest yet.
> > >>> Are you planning to do that?
> > >> Not now. It is a sharding-related feature. I'm not sure that this
> > >> approach is fully consistent with the sharding way now.
> > > Can you please explain in detail, why you think so?  There is no
> > > commit message explaining what each patch does so it is difficult to
> > > understand why you said so?
> > For now I used this patch set for providing correct visibility in the
> > case of access to the table with foreign partitions from many nodes in
> > parallel. So I saw at this patch set as a sharding-related feature, but
> > [1] shows another useful application.
> > CSN-based approach has weak points such as:
> > 1. Dependency on clocks synchronization
> > 2. Needs guarantees of monotonically increasing of the CSN in the case
> > of an instance restart/crash etc.
> > 3. We need to delay increasing of OldestXmin because it can be needed
> > for a transaction snapshot at another node.
> >
>
> So, is anyone working on improving these parts of the patch.  AFAICS
> from what Bruce has shared [1], some people from HighGo are working on
> it but I don't see any discussion of that yet.
>
> > So I do not have full conviction that it will be better than a single
> > distributed transaction manager.
> >
>
> When you say "single distributed transaction manager"  do you mean
> something like pg_dtm which is inspired by Postgres-XL?
>
> >Also, can you let us know if this
> > > supports 2PC in some way and if so how is it different from what the
> > > other thread on the same topic [1] is trying to achieve?
> > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > 2PC machinery. Now I'd not judge which approach is better.
> >
>

Sorry for being late.

> Yeah, I have studied both the approaches a little and I feel the main
> difference seems to be that in this patch atomicity is tightly coupled
> with how we achieve global visibility, basically in this patch "all
> running transactions are marked as InDoubt on all nodes in prepare
> phase, and after that, each node commit it and stamps each xid with a
> given GlobalCSN.".  There are no separate APIs for
> prepare/commit/rollback exposed by postgres_fdw as we do it in the
> approach followed by Sawada-San's patch.  It seems to me in the patch
> in this email one of postgres_fdw node can be a sort of coordinator
> which prepares and commit the transaction on all other nodes whereas
> that is not true in Sawada-San's patch (where the coordinator is a
> local Postgres node, am I right Sawada-San?).

Yeah, where to manage foreign transactions is different: postgres_fdw
manages foreign transactions in this patch whereas the PostgreSQL core
does that in that 2PC patch.

>
> I feel if Sawada-San or someone involved in another patch also once
> studies this approach and try to come up with some form of comparison
> then we might be able to make better decision.  It is possible that
> there are few good things in each approach which we can use.
>

I studied this patch and did a simple comparison between this patch
(0002 patch) and my 2PC patch.

In terms of atomic commit, the features that are not implemented in
this patch but in the 2PC patch are:

* Crash safe.
* PREPARE TRANSACTION command support.
* Query cancel during waiting for the commit.
* Automatically in-doubt transaction resolution.

On the other hand, the feature that is implemented in this patch but
not in the 2PC patch is:

* Executing PREPARE TRANSACTION (and other commands) in parallel

When the 2PC patch was proposed, IIRC it was like this patch (0002
patch). I mean, it changed only postgres_fdw to support 2PC. But after
discussion, we changed the approach to have the core manage foreign
transaction for crash-safe. From my perspective, this patch has a
minimum implementation of 2PC to work the global snapshot feature and
has some missing features important for supporting crash-safe atomic
commit. So I personally think we should consider how to integrate this
global snapshot feature with the 2PC patch, rather than improving this
patch if we want crash-safe atomic commit.

Looking at the commit procedure with this patch:

When starting a new transaction on a foreign server, postgres_fdw
executes pg_global_snapshot_import() to import the global snapshot.
After some work, in pre-commit phase we do:

1. generate global transaction id, say 'gid'
2. execute PREPARE TRANSACTION 'gid' on all participants.
3. prepare global snapshot locally, if the local node also involves
the transaction
4. execute pg_global_snapshot_prepare('gid') for all participants

During step 2 to 4, we calculate the maximum CSN from 

Re: Global snapshots

2020-06-22 Thread Amit Kapila
On Mon, Jun 22, 2020 at 8:36 PM Bruce Momjian  wrote:
>
> On Sat, Jun 20, 2020 at 05:51:21PM +0530, Amit Kapila wrote:
> > I feel if Sawada-San or someone involved in another patch also once
> > studies this approach and try to come up with some form of comparison
> > then we might be able to make better decision.  It is possible that
> > there are few good things in each approach which we can use.
>
> Agreed. Postgres-XL code is under the Postgres license:
>
> Postgres-XL is released under the PostgreSQL License, a liberal Open
> Source license, similar to the BSD or MIT licenses.
>
> and even says they want it moved into Postgres core:
>
> https://www.postgres-xl.org/2017/08/postgres-xl-9-5-r1-6-announced/
>
> Postgres-XL is a massively parallel database built on top of,
> and very closely compatible with PostgreSQL 9.5 and its set of 
> advanced
> features. Postgres-XL is fully open source and many parts of it will
> feed back directly or indirectly into later releases of PostgreSQL, as
> we begin to move towards a fully parallel sharded version of core 
> PostgreSQL.
>
> so we should understand what can be used from it.
>

+1.  I think that will be quite useful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-06-22 Thread Bruce Momjian
On Sat, Jun 20, 2020 at 05:51:21PM +0530, Amit Kapila wrote:
> I feel if Sawada-San or someone involved in another patch also once
> studies this approach and try to come up with some form of comparison
> then we might be able to make better decision.  It is possible that
> there are few good things in each approach which we can use.

Agreed. Postgres-XL code is under the Postgres license:

Postgres-XL is released under the PostgreSQL License, a liberal Open
Source license, similar to the BSD or MIT licenses.

and even says they want it moved into Postgres core:

https://www.postgres-xl.org/2017/08/postgres-xl-9-5-r1-6-announced/

Postgres-XL is a massively parallel database built on top of,
and very closely compatible with PostgreSQL 9.5 and its set of advanced
features. Postgres-XL is fully open source and many parts of it will
feed back directly or indirectly into later releases of PostgreSQL, as
we begin to move towards a fully parallel sharded version of core 
PostgreSQL.

so we should understand what can be used from it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Global snapshots

2020-06-22 Thread Bruce Momjian
On Sat, Jun 20, 2020 at 05:54:18PM +0530, Amit Kapila wrote:
> On Fri, Jun 19, 2020 at 6:33 PM Bruce Momjian  wrote:
> >
> > On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead...@highgo.ca wrote:
> > >
> > > >> would like to know if the patch related to CSN based snapshot [2] is a
> > > >> precursor for this, if not, then is it any way related to this patch
> > > >> because I see the latest reply on that thread [2] which says it is an
> > > >> infrastructure of sharding feature but I don't understand completely
> > > >> whether these patches are related?
> > > >I need some time to study this patch.. At first sight it is different.
> > >
> > > This patch[2] is almost base on [3], because I think [1] is talking about 
> > > 2PC
> > > and FDW, so this patch focus on CSN only and I detach the global snapshot
> > > part and FDW part from the [1] patch.
> > >
> > > I notice CSN will not survival after a restart in [1] patch, I think it 
> > > may not
> > > the
> > > right way, may be it is what in last mail "Needs guarantees of 
> > > monotonically
> > > increasing of the CSN in the case of an instance restart/crash etc" so I 
> > > try to
> > > add wal support for CSN on this patch.
> > >
> > > That's why this thread exist.
> >
> > I was certainly missing how these items fit together.  Sharding needs
> > parallel FDWs, atomic commits, and atomic snapshots.  To get atomic
> > snapshots, we need CSN.  This new sharding wiki pages has more details:
> >
> > https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding
> >
> 
> Thanks for maintaining this page.  It is quite helpful!

Ahsan Hadi  created that page, and I just made a
few wording edits.  Ahsan is copying information from this older
sharding wiki page:

https://wiki.postgresql.org/wiki/Built-in_Sharding

to the new one you listed above.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Global snapshots

2020-06-20 Thread Amit Kapila
On Sat, Jun 20, 2020 at 5:51 PM Amit Kapila  wrote:
>
>
> So, is anyone working on improving these parts of the patch.  AFAICS
> from what Bruce has shared [1],
>

oops, forgot to share the link [1] -
https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-06-20 Thread Amit Kapila
On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
 wrote:
>
> On 6/19/20 11:48 AM, Amit Kapila wrote:
> > On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> >  wrote:
> >> On 09.06.2020 11:41, Fujii Masao wrote:
> >>> The patches seem not to be registered in CommitFest yet.
> >>> Are you planning to do that?
> >> Not now. It is a sharding-related feature. I'm not sure that this
> >> approach is fully consistent with the sharding way now.
> > Can you please explain in detail, why you think so?  There is no
> > commit message explaining what each patch does so it is difficult to
> > understand why you said so?
> For now I used this patch set for providing correct visibility in the
> case of access to the table with foreign partitions from many nodes in
> parallel. So I saw at this patch set as a sharding-related feature, but
> [1] shows another useful application.
> CSN-based approach has weak points such as:
> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed
> for a transaction snapshot at another node.
>

So, is anyone working on improving these parts of the patch.  AFAICS
from what Bruce has shared [1], some people from HighGo are working on
it but I don't see any discussion of that yet.

> So I do not have full conviction that it will be better than a single
> distributed transaction manager.
>

When you say "single distributed transaction manager"  do you mean
something like pg_dtm which is inspired by Postgres-XL?

>Also, can you let us know if this
> > supports 2PC in some way and if so how is it different from what the
> > other thread on the same topic [1] is trying to achieve?
> Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> 2PC machinery. Now I'd not judge which approach is better.
>

Yeah, I have studied both the approaches a little and I feel the main
difference seems to be that in this patch atomicity is tightly coupled
with how we achieve global visibility, basically in this patch "all
running transactions are marked as InDoubt on all nodes in prepare
phase, and after that, each node commit it and stamps each xid with a
given GlobalCSN.".  There are no separate APIs for
prepare/commit/rollback exposed by postgres_fdw as we do it in the
approach followed by Sawada-San's patch.  It seems to me in the patch
in this email one of postgres_fdw node can be a sort of coordinator
which prepares and commit the transaction on all other nodes whereas
that is not true in Sawada-San's patch (where the coordinator is a
local Postgres node, am I right Sawada-San?).  OTOH, Sawada-San's
patch has advanced concepts like a resolver process that can
commit/abort the transactions later.  I couldn't still get a complete
grip of both patches so difficult to say which is better approach but
I think at the least we should have some discussion.

I feel if Sawada-San or someone involved in another patch also once
studies this approach and try to come up with some form of comparison
then we might be able to make better decision.  It is possible that
there are few good things in each approach which we can use.

>   Also, I
> > would like to know if the patch related to CSN based snapshot [2] is a
> > precursor for this, if not, then is it any way related to this patch
> > because I see the latest reply on that thread [2] which says it is an
> > infrastructure of sharding feature but I don't understand completely
> > whether these patches are related?
> I need some time to study this patch. At first sight it is different.
>

I feel the opposite.  I think it has extracted some stuff from this
patch series and extended the same.

Thanks for the inputs.  I feel inputs from you and others who were
involved in this project will be really helpful to move this project
forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-06-20 Thread Amit Kapila
On Fri, Jun 19, 2020 at 6:33 PM Bruce Momjian  wrote:
>
> On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead...@highgo.ca wrote:
> >
> > >> would like to know if the patch related to CSN based snapshot [2] is a
> > >> precursor for this, if not, then is it any way related to this patch
> > >> because I see the latest reply on that thread [2] which says it is an
> > >> infrastructure of sharding feature but I don't understand completely
> > >> whether these patches are related?
> > >I need some time to study this patch.. At first sight it is different.
> >
> > This patch[2] is almost base on [3], because I think [1] is talking about 
> > 2PC
> > and FDW, so this patch focus on CSN only and I detach the global snapshot
> > part and FDW part from the [1] patch.
> >
> > I notice CSN will not survival after a restart in [1] patch, I think it may 
> > not
> > the
> > right way, may be it is what in last mail "Needs guarantees of monotonically
> > increasing of the CSN in the case of an instance restart/crash etc" so I 
> > try to
> > add wal support for CSN on this patch.
> >
> > That's why this thread exist.
>
> I was certainly missing how these items fit together.  Sharding needs
> parallel FDWs, atomic commits, and atomic snapshots.  To get atomic
> snapshots, we need CSN.  This new sharding wiki pages has more details:
>
> https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding
>

Thanks for maintaining this page.  It is quite helpful!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-06-19 Thread Bruce Momjian
On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead...@highgo.ca wrote:
> 
> >> would like to know if the patch related to CSN based snapshot [2] is a
> >> precursor for this, if not, then is it any way related to this patch
> >> because I see the latest reply on that thread [2] which says it is an
> >> infrastructure of sharding feature but I don't understand completely
> >> whether these patches are related?
> >I need some time to study this patch.. At first sight it is different.
> 
> This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> and FDW, so this patch focus on CSN only and I detach the global snapshot
> part and FDW part from the [1] patch. 
> 
> I notice CSN will not survival after a restart in [1] patch, I think it may 
> not
> the
> right way, may be it is what in last mail "Needs guarantees of monotonically
> increasing of the CSN in the case of an instance restart/crash etc" so I try 
> to
> add wal support for CSN on this patch.
> 
> That's why this thread exist.

I was certainly missing how these items fit together.  Sharding needs
parallel FDWs, atomic commits, and atomic snapshots.  To get atomic
snapshots, we need CSN.  This new sharding wiki pages has more details:

https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

After all that is done, we will need optimizer improvements and shard
management tooling.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Global snapshots

2020-06-19 Thread movead...@highgo.ca

>> would like to know if the patch related to CSN based snapshot [2] is a
>> precursor for this, if not, then is it any way related to this patch
>> because I see the latest reply on that thread [2] which says it is an
>> infrastructure of sharding feature but I don't understand completely
>> whether these patches are related?
>I need some time to study this patch. At first sight it is different.

This patch[2] is almost base on [3], because I think [1] is talking about 2PC
and FDW, so this patch focus on CSN only and I detach the global snapshot
part and FDW part from the [1] patch. 

I notice CSN will not survival after a restart in [1] patch, I think it may not 
the
right way, may be it is what in last mail "Needs guarantees of monotonically
increasing of the CSN in the case of an instance restart/crash etc" so I try to
add wal support for CSN on this patch.

That's why this thread exist.

> [1] - 
> https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca
[3]https://www.postgresql.org/message-id/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
 


Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: Global snapshots

2020-06-19 Thread Andrey V. Lepikhov

On 6/19/20 11:48 AM, Amit Kapila wrote:

On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
 wrote:

On 09.06.2020 11:41, Fujii Masao wrote:

The patches seem not to be registered in CommitFest yet.
Are you planning to do that?

Not now. It is a sharding-related feature. I'm not sure that this
approach is fully consistent with the sharding way now.

Can you please explain in detail, why you think so?  There is no
commit message explaining what each patch does so it is difficult to
understand why you said so?
For now I used this patch set for providing correct visibility in the 
case of access to the table with foreign partitions from many nodes in 
parallel. So I saw at this patch set as a sharding-related feature, but 
[1] shows another useful application.

CSN-based approach has weak points such as:
1. Dependency on clocks synchronization
2. Needs guarantees of monotonically increasing of the CSN in the case 
of an instance restart/crash etc.
3. We need to delay increasing of OldestXmin because it can be needed 
for a transaction snapshot at another node.
So I do not have full conviction that it will be better than a single 
distributed transaction manager.

  Also, can you let us know if this

supports 2PC in some way and if so how is it different from what the
other thread on the same topic [1] is trying to achieve?
Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains 
2PC machinery. Now I'd not judge which approach is better.

 Also, I

would like to know if the patch related to CSN based snapshot [2] is a
precursor for this, if not, then is it any way related to this patch
because I see the latest reply on that thread [2] which says it is an
infrastructure of sharding feature but I don't understand completely
whether these patches are related?

I need some time to study this patch. At first sight it is different.


Basically, there seem to be three threads, first, this one and then
[1] and [2] which seems to be doing the work for sharding feature but
there is no clear explanation anywhere if these are anyway related or
whether combining all these three we are aiming for a solution for
atomic commit and atomic visibility.

It can be useful to study all approaches.


I am not sure if you know answers to all these questions so I added
the people who seem to be working on the other two patches.  I am also
afraid that if there is any duplicate or conflicting work going on in
these threads so we should try to find that as well.

Ok



[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca



[1] 
https://www.postgresql.org/message-id/flat/20200301083601.ews6hz5dduc3w2se%40alap3.anarazel.de


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com




Re: Global snapshots

2020-06-18 Thread Amit Kapila
On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
 wrote:
>
>
> On 09.06.2020 11:41, Fujii Masao wrote:
> >
> >
> > The patches seem not to be registered in CommitFest yet.
> > Are you planning to do that?
> Not now. It is a sharding-related feature. I'm not sure that this
> approach is fully consistent with the sharding way now.
>

Can you please explain in detail, why you think so?  There is no
commit message explaining what each patch does so it is difficult to
understand why you said so?  Also, can you let us know if this
supports 2PC in some way and if so how is it different from what the
other thread on the same topic [1] is trying to achieve?  Also, I
would like to know if the patch related to CSN based snapshot [2] is a
precursor for this, if not, then is it any way related to this patch
because I see the latest reply on that thread [2] which says it is an
infrastructure of sharding feature but I don't understand completely
whether these patches are related?

Basically, there seem to be three threads, first, this one and then
[1] and [2] which seems to be doing the work for sharding feature but
there is no clear explanation anywhere if these are anyway related or
whether combining all these three we are aiming for a solution for
atomic commit and atomic visibility.

I am not sure if you know answers to all these questions so I added
the people who seem to be working on the other two patches.  I am also
afraid that if there is any duplicate or conflicting work going on in
these threads so we should try to find that as well.


[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Global snapshots

2020-06-09 Thread Andrey V. Lepikhov


On 09.06.2020 11:41, Fujii Masao wrote:



On 2020/05/12 19:24, Andrey Lepikhov wrote:

Rebased onto current master (fb544735f1).


Thanks for the patches!

These patches are no longer applied cleanly and caused the compilation 
failure.

So could you rebase and update them?

Rebased onto 57cb806308 (see attachment).


The patches seem not to be registered in CommitFest yet.
Are you planning to do that?
Not now. It is a sharding-related feature. I'm not sure that this 
approach is fully consistent with the sharding way now.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com

>From cd6a8585f9814b7e465abb2649ac84e80e7c726b Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Tue, 9 Jun 2020 14:55:38 +0500
Subject: [PATCH 1/3] GlobalCSNLog-SLRU

---
 src/backend/access/transam/Makefile |   1 +
 src/backend/access/transam/global_csn_log.c | 439 
 src/backend/access/transam/twophase.c   |   1 +
 src/backend/access/transam/varsup.c |   2 +
 src/backend/access/transam/xlog.c   |  12 +
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/ipc/procarray.c |   3 +
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/tcop/postgres.c |   1 +
 src/backend/utils/misc/guc.c|   9 +
 src/backend/utils/probes.d  |   2 +
 src/bin/initdb/initdb.c |   3 +-
 src/include/access/global_csn_log.h |  30 ++
 src/include/storage/lwlock.h|   1 +
 src/include/utils/snapshot.h|   3 +
 15 files changed, 510 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..60ff8b141e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	global_csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/global_csn_log.c b/src/backend/access/transam/global_csn_log.c
new file mode 100644
index 00..6f7fded350
--- /dev/null
+++ b/src/backend/access/transam/global_csn_log.c
@@ -0,0 +1,439 @@
+/*-
+ *
+ * global_csn_log.c
+ *		Track global commit sequence numbers of finished transactions
+ *
+ * Implementation of cross-node transaction isolation relies on commit sequence
+ * number (CSN) based visibility rules.  This module provides SLRU to store
+ * CSN for each transaction.  This mapping need to be kept only for xid's
+ * greater then oldestXid, but that can require arbitrary large amounts of
+ * memory in case of long-lived transactions.  Because of same lifetime and
+ * persistancy requirements this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/global_csn_log.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "access/global_csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool track_global_snapshots;
+
+/*
+ * Defines for GlobalCSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0x,
+ * GlobalCSNLog page numbering also wraps around at
+ * 0x/GLOBAL_CSN_LOG_XACTS_PER_PAGE, and GlobalCSNLog segment numbering at
+ * 0x/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateGlobalCSNLog (see GlobalCSNLogPagePrecedes).
+ */
+
+/* We store the commit GlobalCSN for each xid */
+#define GCSNLOG_XACTS_PER_PAGE (BLCKSZ / sizeof(GlobalCSN))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData GlobalCSNLogCtlData;
+#define GlobalCsnlogCtl (&GlobalCSNLogCtlData)
+
+static int	ZeroGlobalCSNLogPage(int pageno);
+static bool GlobalCSNLogPagePrecedes(int page1, int page2);
+static void GlobalCSNLogSetPageStatus(TransactionId xid, int nsubxids,
+	  TransactionId *subxids,
+	  GlobalCSN csn, int pageno);
+static void GlobalCSNLogSetCSNInSlot(TransactionId xid, GlobalCSN csn,
+	  int slotno);
+
+/*
+ * GlobalCSNLogSetCSN
+ *
+ * Record GlobalCSN of transaction and

Re: Global snapshots

2020-06-08 Thread Fujii Masao




On 2020/05/12 19:24, Andrey Lepikhov wrote:

Rebased onto current master (fb544735f1).


Thanks for the patches!

These patches are no longer applied cleanly and caused the compilation failure.
So could you rebase and update them?

The patches seem not to be registered in CommitFest yet.
Are you planning to do that?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Global snapshots

2020-05-12 Thread Andrey Lepikhov

Rebased onto current master (fb544735f1).

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 29183c42a8ae31b830ab5af0dfcfdaadd6229700 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 12 May 2020 08:29:54 +0500
Subject: [PATCH 1/3] GlobalCSNLog-SLRU-v3

---
 src/backend/access/transam/Makefile |   1 +
 src/backend/access/transam/global_csn_log.c | 439 
 src/backend/access/transam/twophase.c   |   1 +
 src/backend/access/transam/varsup.c |   2 +
 src/backend/access/transam/xlog.c   |  12 +
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/ipc/procarray.c |   3 +
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/tcop/postgres.c |   1 +
 src/backend/utils/misc/guc.c|   9 +
 src/backend/utils/probes.d  |   2 +
 src/bin/initdb/initdb.c |   3 +-
 src/include/access/global_csn_log.h |  30 ++
 src/include/storage/lwlock.h|   1 +
 src/include/utils/snapshot.h|   3 +
 15 files changed, 510 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..60ff8b141e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	clog.o \
 	commit_ts.o \
+	global_csn_log.o \
 	generic_xlog.o \
 	multixact.o \
 	parallel.o \
diff --git a/src/backend/access/transam/global_csn_log.c b/src/backend/access/transam/global_csn_log.c
new file mode 100644
index 00..6f7fded350
--- /dev/null
+++ b/src/backend/access/transam/global_csn_log.c
@@ -0,0 +1,439 @@
+/*-
+ *
+ * global_csn_log.c
+ *		Track global commit sequence numbers of finished transactions
+ *
+ * Implementation of cross-node transaction isolation relies on commit sequence
+ * number (CSN) based visibility rules.  This module provides SLRU to store
+ * CSN for each transaction.  This mapping need to be kept only for xid's
+ * greater then oldestXid, but that can require arbitrary large amounts of
+ * memory in case of long-lived transactions.  Because of same lifetime and
+ * persistancy requirements this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/global_csn_log.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "access/global_csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool track_global_snapshots;
+
+/*
+ * Defines for GlobalCSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0x,
+ * GlobalCSNLog page numbering also wraps around at
+ * 0x/GLOBAL_CSN_LOG_XACTS_PER_PAGE, and GlobalCSNLog segment numbering at
+ * 0x/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateGlobalCSNLog (see GlobalCSNLogPagePrecedes).
+ */
+
+/* We store the commit GlobalCSN for each xid */
+#define GCSNLOG_XACTS_PER_PAGE (BLCKSZ / sizeof(GlobalCSN))
+
+#define TransactionIdToPage(xid)	((xid) / (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData GlobalCSNLogCtlData;
+#define GlobalCsnlogCtl (&GlobalCSNLogCtlData)
+
+static int	ZeroGlobalCSNLogPage(int pageno);
+static bool GlobalCSNLogPagePrecedes(int page1, int page2);
+static void GlobalCSNLogSetPageStatus(TransactionId xid, int nsubxids,
+	  TransactionId *subxids,
+	  GlobalCSN csn, int pageno);
+static void GlobalCSNLogSetCSNInSlot(TransactionId xid, GlobalCSN csn,
+	  int slotno);
+
+/*
+ * GlobalCSNLogSetCSN
+ *
+ * Record GlobalCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should b

Re: Global snapshots

2019-04-21 Thread Andrey Borodin
Hi!

> 30 нояб. 2018 г., в 18:00, Stas Kelvich  написал(а):
> 
> 
> <0001-GlobalCSNLog-SLRU-v3.patch><0002-Global-snapshots-v3.patch><0003-postgres_fdw-support-for-global-snapshots-v3.patch>

In spite of recent backup discussions I realized that we need to backup 
clusters even if they provide global snapshot capabilities.

I think we can have pretty elegant Point-in-CSN-Recovery here, right? If we 
want a group of clusters to recover to a globally consistent state.

Best regards, Andrey Borodin.



Re: Global snapshots

2019-01-31 Thread Stas Kelvich



> On 31 Jan 2019, at 18:42, Andres Freund  wrote:
> 
> Hi,
> 
> On 2018-11-30 16:00:17 +0300, Stas Kelvich wrote:
>>> On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>> Is there any resulting patch where the ideas how to implement this are 
>>> outlined?
>> 
>> Not yet. I’m going to continue work on this in January. And probably try to
>> force some of nearby committers to make a line by line review.
> 
> This hasn't happened yet, so I think this ought to be marked ad returned
> with feedback?
> 

No objections. I don't think this will realistically go in during last CF, so
will open it during next release cycle.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2019-01-31 Thread Andres Freund
Hi,

On 2018-11-30 16:00:17 +0300, Stas Kelvich wrote:
> > On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Is there any resulting patch where the ideas how to implement this are 
> > outlined?
> 
> Not yet. I’m going to continue work on this in January. And probably try to
> force some of nearby committers to make a line by line review.

This hasn't happened yet, so I think this ought to be marked ad returned
with feedback?

- Andres



Re: Global snapshots

2018-11-29 Thread Dmitry Dolgov
> On Wed, Jul 25, 2018 at 1:35 PM Arseny Sher  wrote:
>
> Hello,
>
> I have looked through the patches and found them pretty accurate. I'd
> fixed a lot of small issues here and there; updated patchset is
> attached.

Hi,

Thank you for working on this patch. Unfortunately, the patch has some
conflicts, could you please rebase it? Also I wonder if you or Stas can shed
some light about this:

> On Wed, May 16, 2018 at 2:02 PM Stas Kelvich  wrote:
> > On 15 May 2018, at 15:53, Robert Haas  wrote:
> >
> > I guess it seems to me that you
> > have some further research to do along the lines you've described:
> >
> > 1. Can we hold back xmin only when necessary and to the extent
> > necessary instead of all the time?
> > 2. Can we use something like an STO analog, maybe as an optional
> > feature, rather than actually holding back xmin?
>
> Yes, to both questions. I'll implement that and share results.

Is there any resulting patch where the ideas how to implement this are outlined?



Re: Global snapshots

2018-09-26 Thread Arseny Sher
Hello,

Andrey Borodin  writes:

> I like the idea that with this patch set universally all postgres
> instances are bound into single distributed DB, even if they never
> heard about each other before :) This is just amazing. Or do I get
> something wrong?

Yeah, in a sense of xact visibility we can view it like this.

> I've got few questions:
> 1. If we coordinate HA-clusters with replicas, can replicas
> participate if their part of transaction is read-only?

Ok, there are several things to consider. Clock-SI as described in the
paper technically boils down to three things. First two assume CSN-based
implementation of MVCC where local time acts as CSN/snapshot source, and
they impose the following additional rules:

1) When xact expands on some node and imports its snapshot, it must be
  blocked until clocks on this node will show time >= snapshot being
  imported: node never processes xacts with snap 'from the future'
2) When we choose CSN to commit xact with, we must read clocks on
  all the nodes who participated in it and set CSN to max among read
  values.

These rules ensure *integrity* of the snapshot in the face of clock
skews regardless of which node we access: that is, snapshots are stable
(no non-repeatable reads) and no xact is considered half committed: they
prevent situation when the snapshot sees some xact on one node as
committed and on another node as still running.
(Actually, this is only true under the assumption that any distributed
xact is committed at all nodes instantly at the same time; this is
obviously not true, see 3rd point below.)

If we are speaking about e.g. traditional usage of hot standy, when
client in one xact accesses either primary, or some standby, but not
several nodes at once, we just don't need this stuff because usual MVCC
in Postgres already provides you consistent snapshot. Same is true for
multimaster-like setups, when each node accepts writes, but client still
accesses single node; if there is a write conflict (e.g. same row
updated on different nodes), one of xacts must be aborted; the snapshot
is still good.

However, if you really intend to read in one xact data from multiple
nodes (e.g. read primary and then replica), then yes, these problems
arise and Clock-SI helps with them. However, honestly it is hard for me
to make up a reason why would you want to do that: reading local data is
always more efficient than visiting several nodes. It would make sense
if we could read primary and replica in parallel, but that currently is
impossible in core Postgres. More straightforward application of the
patchset is sharding, when data is splitted and you might need to go to
several nodes in one xact to collect needed data.

Also, this patchset adds core algorithm and makes use of it only in
postgres_fdw; you would need to adapt replication (import/export global
snapshot API) to make it work there.

3) The third rule of Clock-SI deals with the following problem.
  Distributed (writing to several nodes) xact doesn't commit (i.e.
  becomes visible) instantly at all nodes. That means that there is a
  time hole in which we can see xact as committed on some node and still
  running on another. To mitigate this, Clock-SI adds kind of two-phase
  commit on visibility: additional state InDoubt which blocks all
  attempts to read this xact changes until xact's fate (commit/abort) is
  determined.

Unlike the previous problem, this issue exists in all replicated
setups. Even if we just have primary streaming data to one hot standby,
xacts are not committed on them instantly and we might observe xact as
committed on primary, then quickly switch to standby and find the data
we have just seen disappeared. remote_apply mode partially alleviates
this problem (apparently to the degree comfortable for most application
developers) by switching positions: with it xact always commits on
replicas earlier than on master. At least this guarantees that the guy
who wrote the xact will definitely see it on replica unless it drops the
connection to master before commit ack. Still the problem is not fully
solved: only addition of InDoubt state can fix this.

While Clock-SI (and this patchset) certainly addresses the issue as it
becomes even more serious in sharded setups (it makes possible to see
/parts/ of transactions), there is nothing CSN or clock specific
here. In theory, we could implement the same two-phase commit on
visiblity without switching to timestamp-based CSN MVCC.

Aside from the paper, you can have a look at Clock-SI explanation in
these slides [1] from PGCon.

> 2. How does InDoubt transaction behave when we add or subtract leap seconds?

Good question! In Clock-SI, time can be arbitrary desynchronized and
might go forward with arbitrary speed (e.g. clocks can be stopped), but
it must never go backwards. So if leap second correction is implemented
by doubling the duration of certain second (as it usually seems to be),
we are fine.

> Also, I could not understand some notes

Re: Global snapshots

2018-09-24 Thread Andrey Borodin
Hi!

I want to review this patch set. Though I understand that it probably will be 
quite long process.

I like the idea that with this patch set universally all postgres instances are 
bound into single distributed DB, even if they never heard about each other 
before :) This is just amazing. Or do I get something wrong?

I've got few questions:
1. If we coordinate HA-clusters with replicas, can replicas participate if 
their part of transaction is read-only?
2. How does InDoubt transaction behave when we add or subtract leap seconds?

Also, I could not understand some notes from Arseny:

> 25 июля 2018 г., в 16:35, Arseny Sher  написал(а):
> 
> * One drawback of these patches is that only REPEATABLE READ is
>   supported. For READ COMMITTED, we must export every new snapshot
>   generated on coordinator to all nodes, which is fairly easy to
>   do. SERIALIZABLE will definitely require chattering between nodes,
>   but that's much less demanded isolevel (e.g. we still don't support
>   it on replicas).

If all shards are executing transaction in SERIALIZABLE, what anomalies does it 
permit?

If you have transactions on server A and server B, there are transactions 1 and 
2, transaction A1 is serialized before A2, but B1 is after B2, right?

Maybe we can somehow abort 1 or 2? 

> 
> * Another somewhat serious issue is that there is a risk of recency
>   guarantee violation. If client starts transaction at node with
>   lagging clocks, its snapshot might not include some recently
>   committed transactions; if client works with different nodes, she
>   might not even see her own changes. CockroachDB describes at [1] how
>   they and Google Spanner overcome this problem. In short, both set
>   hard limit on maximum allowed clock skew.  Spanner uses atomic
>   clocks, so this skew is small and they just wait it at the end of
>   each transaction before acknowledging the client. In CockroachDB, if
>   tuple is not visible but we are unsure whether it is truly invisible
>   or it's just the skew (the difference between snapshot and tuple's
>   csn is less than the skew), transaction is restarted with advanced
>   snapshot. This process is not infinite because the upper border
>   (initial snapshot + max skew) stays the same; this is correct as we
>   just want to ensure that our xact sees all the committed ones before
>   it started. We can implement the same thing.
I think that this situation is also covered in Clock-SI since transactions will 
not exit InDoubt state before we can see them. But I'm not sure, chances are 
that I get something wrong, I'll think more about it. I'd be happy to hear 
comments from Stas about this.
> 
> 
> * 003_bank_shared.pl test is removed. In current shape (loading one
>   node) it is useless, and if we bombard both nodes, deadlock surely
>   appears. In general, global snaphots are not needed for such
>   multimaster-like setup -- either there are no conflicts and we are
>   fine, or there is a conflict, in which case we get a deadlock.
Can we do something with this deadlock? Will placing an upper limit on time of 
InDoubt state fix the issue? I understand that aborting automatically is kind 
of dangerous...

Also, currently hanging 2pc transaction can cause a lot of headache for DBA. 
Can we have some kind of protection for the case when one node is gone 
permanently during transaction?

Thanks!

Best regards, Andrey Borodin.


Re: Global snapshots

2018-07-25 Thread Arseny Sher
Hello,

I have looked through the patches and found them pretty accurate. I'd
fixed a lot of small issues here and there; updated patchset is
attached. But first, some high-level notes:

 * I agree that it would be cool to implement functionality like current
   "snapshot too old": that is, abort transaction with old global
   snapshot only if it really attempted to touch modified data.

 * I also agree with Stas that any attempts to trade oldestxmin in
   gossip between the nodes would drastically complicate this patch and
   make it discussion-prone; it would be nice first to get some feedback
   on general approach, especially from people trying to distribute
   Postgres.

 * One drawback of these patches is that only REPEATABLE READ is
   supported. For READ COMMITTED, we must export every new snapshot
   generated on coordinator to all nodes, which is fairly easy to
   do. SERIALIZABLE will definitely require chattering between nodes,
   but that's much less demanded isolevel (e.g. we still don't support
   it on replicas).

 * Another somewhat serious issue is that there is a risk of recency
   guarantee violation. If client starts transaction at node with
   lagging clocks, its snapshot might not include some recently
   committed transactions; if client works with different nodes, she
   might not even see her own changes. CockroachDB describes at [1] how
   they and Google Spanner overcome this problem. In short, both set
   hard limit on maximum allowed clock skew.  Spanner uses atomic
   clocks, so this skew is small and they just wait it at the end of
   each transaction before acknowledging the client. In CockroachDB, if
   tuple is not visible but we are unsure whether it is truly invisible
   or it's just the skew (the difference between snapshot and tuple's
   csn is less than the skew), transaction is restarted with advanced
   snapshot. This process is not infinite because the upper border
   (initial snapshot + max skew) stays the same; this is correct as we
   just want to ensure that our xact sees all the committed ones before
   it started. We can implement the same thing.

Now, the description of my mostly cosmetical changes:

 * Don't ERROR in BroadcastStmt to allow us to handle failure manually;
 * Check global_snapshot_defer_time in ImportGlobalSnapshot instead of
   falling on assert;
 * (Arguably) improved comments around locking at circular buffer
   maintenance; also, don't lock procarray during global_snapshot_xmin
   bump.
 * s/snaphot/snapshot, other typos.
 * Don't track_global_snapshots by default -- while handy for testing, it
   doesn't look generally good.
 * Set track_global_snapshots = true in tests everywhere.
 * GUC renamed from postgres_fdw.use_tsdtm to
   postgres_fdw.use_global_snapshots for consistency.
 * 003_bank_shared.pl test is removed. In current shape (loading one
   node) it is useless, and if we bombard both nodes, deadlock surely
   appears. In general, global snaphots are not needed for such
   multimaster-like setup -- either there are no conflicts and we are
   fine, or there is a conflict, in which case we get a deadlock.
 * Fix initdb failure with non-zero global_snapshot_defer_time.
 * Enforce REPEATABLE READ since currently we export snap only once in
   xact.
 * Remove assertion that circular buffer entries are monotonic, as
   GetOldestXmin *can* go backwards.


[1] https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 21687e75366df03b92e48c6125bb2e90d01bb70a Mon Sep 17 00:00:00 2001
From: Stas Kelvich 
Date: Wed, 25 Apr 2018 16:05:46 +0300
Subject: [PATCH 1/3] GlobalCSNLog SLRU

---
 src/backend/access/transam/Makefile |   3 +-
 src/backend/access/transam/global_csn_log.c | 439 
 src/backend/access/transam/twophase.c   |   1 +
 src/backend/access/transam/varsup.c |   2 +
 src/backend/access/transam/xlog.c   |  12 +
 src/backend/storage/ipc/ipci.c  |   3 +
 src/backend/storage/ipc/procarray.c |   3 +
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/utils/misc/guc.c|   9 +
 src/backend/utils/probes.d  |   2 +
 src/bin/initdb/initdb.c |   3 +-
 src/include/access/global_csn_log.h |  30 ++
 src/include/storage/lwlock.h|   1 +
 src/include/utils/snapshot.h|   3 +
 14 files changed, 510 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47269..03aa360ea3 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile

Re: Global snapshots

2018-05-16 Thread Stas Kelvich


> On 15 May 2018, at 15:53, Robert Haas  wrote:
> 
> Actually, I think if we're going to pursue that approach, we ought to
> back off a bit from thinking about global snapshots and think about
> what kind of general mechanism we want.  For example, maybe you can
> imagine it like a message bus, where there are a bunch of named
> channels on which the server publishes messages and you can listen to
> the ones you care about.  There could, for example, be a channel that
> publishes the new system-wide globalxmin every time it changes, and
> another channel that publishes the wait graph every time the deadlock
> detector runs, and so on.  In fact, perhaps we should consider
> implementing it using the existing LISTEN/NOTIFY framework: have a
> bunch of channels that are predefined by PostgreSQL itself, and set
> things up so that the server automatically begins publishing to those
> channels as soon as anybody starts listening to them.  I have to
> imagine that if we had a good mechanism for this, we'd get all sorts
> of proposals for things to publish.  As long as they don't impose
> overhead when nobody's listening, we should be able to be fairly
> accommodating of such requests.
> 
> Or maybe that model is too limiting, either because we don't want to
> broadcast to everyone but rather send specific messages to specific
> connections, or else because we need a request-and-response mechanism
> rather than what is in some sense a one-way communication channel.
> Regardless, we should start by coming up with the right model for the
> protocol first, bearing in mind how it's going to be used and other
> things for which somebody might want to use it (deadlock detection,
> failover, leader election), and then implement whatever we need for
> global snapshots on top of it.  I don't think that writing the code
> here is going to be hugely difficult, but coming up with a good design
> is going to require some thought and discussion.

Well, it would be cool to have some general mechanism to unreliably send
messages between postgres instances. I was thinking about the same thing
mostly in context of our multimaster, where we have an arbiter bgworker
which collects 2PC responses and heartbeats from other nodes on different
TCP port. It used to have some logic inside but evolved to just sending
messages from shared memory out queue and wake backends upon message arrival.
But necessity to manage second port is painful and error-prone at least
from configuration point of view. So it would be nice to have more general
mechanism to exchange messages via postgres port. Ideally with interface
like in shm_mq: send some messages in one queue, subscribe to responses
in different. Among other thing that were mentioned (xmin, deadlock,
elections/heartbeats) I especially interested in some multiplexing for
postgres_fdw, to save on context switches of individual backends while
sending statements.

Talking about model, I think it would be cool to have some primitives like
ones provided by ZeroMQ (message push/subscribe/pop) and then implement
on top of them some more complex ones like scatter/gather.

However, that's probably topic for a very important, but different thread.
For the needs of global snapshots something less ambitious will be suitable.

> And, for that matter, I think the same thing is true for global
> snapshots.  The coding is a lot harder for that than it is for some
> new subprotocol, I'd imagine, but it's still easier than coming up
> with a good design.

Sure. This whole global snapshot thing experienced several internal redesigns,
before becoming satisfactory from our standpoint. However, nothing refraining
us from next iterations. In this regard, it is interesting to also hear comments
from Postgres-XL team -- from my experience with XL code this patches in
core can help XL to drop a lot of visibility-related ifdefs and seriously
offload GTM. But may be i'm missing something.

> I guess it seems to me that you
> have some further research to do along the lines you've described:
> 
> 1. Can we hold back xmin only when necessary and to the extent
> necessary instead of all the time?
> 2. Can we use something like an STO analog, maybe as an optional
> feature, rather than actually holding back xmin?

Yes, to both questions. I'll implement that and share results.

> And I'd add:
> 
> 3. Is there another approach altogether that doesn't rely on holding
> back xmin at all?

And for that question I believe the answer is no. If we want to keep
MVCC-like behaviour where read transactions aren't randomly aborted, we
will need to keep old versions. Disregarding whether it is local or global
transaction. And to keep old versions we need to hold xmin to defuse HOT,
microvacuum, macrovacuum, visibility maps, etc. At some point we can switch
to STO-like behaviour, but that probably should be used as protection from
unusually long transactions rather then a standard behavior.

> For example, if you constructed the happens-

Re: Global snapshots

2018-05-15 Thread Robert Haas
On Mon, May 14, 2018 at 7:20 AM, Stas Kelvich  wrote:
> Summarising, I think, that introducing some permanent connections to
> postgres_fdw node will put too much burden on this patch set and that it will
> be possible to address that later (in a long run such connection will be 
> anyway
> needed at least for a deadlock detection). However, if you think that current
> behavior + STO analog isn't good enough, then I'm ready to pursue that track.

I don't think I'd be willing to commit to a particular approach at
this point.  I think the STO analog is an interesting idea and worth
more investigation, and I think the idea of a permanent connection
with chatter that can be used to resolve deadlocks, coordinate shared
state, etc. is also an interesting idea.  But there are probably lots
of ideas that somebody could come up with in this area that would
sound interesting but ultimately not work out.  Also, an awful lot
depends on quality of implementation.  If you come up with an
implementation of a permanent connection for coordination "chatter",
and the patch gets rejected, it's almost certainly not a sign that we
don't want that thing in general.  It means we don't want yours.  :-)

Actually, I think if we're going to pursue that approach, we ought to
back off a bit from thinking about global snapshots and think about
what kind of general mechanism we want.  For example, maybe you can
imagine it like a message bus, where there are a bunch of named
channels on which the server publishes messages and you can listen to
the ones you care about.  There could, for example, be a channel that
publishes the new system-wide globalxmin every time it changes, and
another channel that publishes the wait graph every time the deadlock
detector runs, and so on.  In fact, perhaps we should consider
implementing it using the existing LISTEN/NOTIFY framework: have a
bunch of channels that are predefined by PostgreSQL itself, and set
things up so that the server automatically begins publishing to those
channels as soon as anybody starts listening to them.  I have to
imagine that if we had a good mechanism for this, we'd get all sorts
of proposals for things to publish.  As long as they don't impose
overhead when nobody's listening, we should be able to be fairly
accommodating of such requests.

Or maybe that model is too limiting, either because we don't want to
broadcast to everyone but rather send specific messages to specific
connections, or else because we need a request-and-response mechanism
rather than what is in some sense a one-way communication channel.
Regardless, we should start by coming up with the right model for the
protocol first, bearing in mind how it's going to be used and other
things for which somebody might want to use it (deadlock detection,
failover, leader election), and then implement whatever we need for
global snapshots on top of it.  I don't think that writing the code
here is going to be hugely difficult, but coming up with a good design
is going to require some thought and discussion.

And, for that matter, I think the same thing is true for global
snapshots.  The coding is a lot harder for that than it is for some
new subprotocol, I'd imagine, but it's still easier than coming up
with a good design.  As far as I can see, and everybody can decide for
themselves how far they think that is, the proposal you're making now
sounds like a significant improvement over the XTM proposal.  In
particular, the provisioning and deprovisioning issues sound like they
have been thought through a lot more.  I'm happy to call that
progress.  At the same time, progress on a journey is not synonymous
with arrival at the destination, and I guess it seems to me that you
have some further research to do along the lines you've described:

1. Can we hold back xmin only when necessary and to the extent
necessary instead of all the time?
2. Can we use something like an STO analog, maybe as an optional
feature, rather than actually holding back xmin?

And I'd add:

3. Is there another approach altogether that doesn't rely on holding
back xmin at all?

For example, if you constructed the happens-after graph between
transactions in shared memory, including actions on all nodes, and
looked for cycles, you could abort transactions that would complete a
cycle.  (We say A happens-after B if A reads or writes data previously
written by B.)  If no cycle exists then all is well.  I'm pretty sure
it's been well-established that a naive implementation of this
algorithm is terribly unperformant, but for example SSI works on this
principle.  It reduces the bookkeeping involved by being willing to
abort transactions that aren't really creating a cycle if they look
like they *might* create a cycle.  Now that's an implementation *on
top of* snapshots for the purpose of getting true serializability
rather than a way of getting global snapshots per se, so it's not
suitable for what you're trying do here, but I think it shows that
algorithms b

Re: Global snapshots

2018-05-14 Thread Stas Kelvich


> On 11 May 2018, at 04:05, Masahiko Sawada  wrote:
> 
> If I understand correctly, simple writes with ordinary 2PC doesn't
> block read who reads that writes. For example, an insertion on a node
> doesn't block readers who reads the inserted tuple. But in this global
> transaction, the read will be blocked during the global transaction is
> InDoubt state. Is that right? InDoubt state will be short-live state
> if it's local transaction but I'm not sure in global transaction.
> Because during InDoubt state the coordinator has to prepare on all
> participant nodes and to assign the global csn to them (and end global
> transaction) the global transaction could be in InDoubt state for a
> relatively long time.

What I meant by "short-lived" is that InDoubt is set after transaction
is prepared so it doesn't depend on size of transaction, only on
network/commit latency. So you can have transaction that did bulk load for
a several hours and still InDoubt state will last for network round-trip
and possibly fsync that can happened during logging of commit record.

> Also, it could be more longer if the
> commit/rollback prepared never be performed due to a failure of any
> nodes of them.

In this case it definitely can. Individual node can not know whether
that InDoubt transaction was somewhere committed or aborted, so it should
wait until somebody will finish this tx. Particular time to recover
depends on how failures are handled.

Speaking more generally in presence of failures we can unlock tuples
only when consensus on transaction commit is reached. And FLP theorem
states that it can take indefinitely long time in fully asynchronous
network. However from more practical PoW probability of such behaviour
in real network becomes negligible after several iterations of voting process
(some evaluations can be found in 
https://ieeexplore.ieee.org/document/1352999/).
So several roundtrips can be a decent approximation of how long it should
take to recover from InDoubt state in case failure.

> With this patch, can we start a remote transaction at READ COMMITTED
> with imported a global snapshot if the local transaction started at
> READ COMMITTED?

In theory it is possible, one just need to send new snapshot before each
statement. With some amount of careful work it is possible to achieve
READ COMMITED with postgres_fwd using global snapshots.


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-14 Thread Stas Kelvich


> On 9 May 2018, at 17:51, Robert Haas  wrote:
> 
> Ouch.  That's not so bad at READ COMMITTED, but at higher isolation
> levels failure becomes extremely likely.  Any multi-statement
> transaction that lasts longer than global_snapshot_defer_time is
> pretty much doomed.

Ouch indeed. Current xmin holding scheme has two major drawbacks: it introduces
timeout between export/import of snapshot and holds xmin in pessimistic way, so
old versions will be preserved even when there were no global transactions. On
a positive side is simplicity: that is the only way which I can think of that
doesn't require distributed calculation of global xmin, which in turn, will
probably require permanent connection to remote postgres_fdw node. It is not
hard to add some background worker to postgres_fdw that will hold permanent
connection, but I afraid that it is very discussion-prone topic and that's why
I tried to avoid that.

> I don't think holding back xmin is a very good strategy.  Maybe it
> won't be so bad if and when we get zheap, since only the undo log will
> bloat rather than the table.  But as it stands, holding back xmin
> means everything bloats and you have to CLUSTER or VACUUM FULL the
> table in order to fix it.

Well, opened local transaction in postgres holds globalXmin for whole postgres
instance (with exception of STO). Also active global transaction should hold
globalXmin of participating nodes to be able to read right versions (again,
with exception of STO).
  However, xmin holding scheme itself can be different. For example we can
periodically check (lets say every 1-2 seconds) oldest GlobalCSN on each node
and delay globalXmin advancement only if there is really exist some long
transaction. So the period of bloat will be limited by this 1-2 seconds, and
will not impose timeout between export/import.
  Also, I want to note, that global_snapshot_defer_time values about
of tens of seconds doesn't change much in terms of bloat comparing to logical
replication. Active logical slot holds globalXmin by setting
replication_slot_xmin, which is advanced on every RunningXacts, which in turn
logged every 15 seconds (hardcoded in LOG_SNAPSHOT_INTERVAL_MS).

> If the behavior were really analogous to our existing "snapshot too
> old" feature, it wouldn't be so bad.  Old snapshots continue to work
> without error so long as they only read unmodified data, and only
> error out if they hit modified pages.

That is actually a good idea that I missed, thanks. Really since all logic
for checking modified pages is already present, it is possible to reuse that
and don't raise "Global STO" error right when old snapshot is imported, but
only in case when global transaction read modified page. I will implement
that and update patch set.

Summarising, I think, that introducing some permanent connections to
postgres_fdw node will put too much burden on this patch set and that it will
be possible to address that later (in a long run such connection will be anyway
needed at least for a deadlock detection). However, if you think that current
behavior + STO analog isn't good enough, then I'm ready to pursue that track.


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-10 Thread Masahiko Sawada
On Fri, May 4, 2018 at 2:11 AM, Stas Kelvich  wrote:
>
>
>> On 3 May 2018, at 18:28, Masahiko Sawada  wrote:
>>
>> On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich  
>> wrote:
>>> 1) To achieve commit atomicity of different nodes intermediate step is
>>>   introduced: at first running transaction is marked as InDoubt on all 
>>> nodes,
>>>   and only after that each node commit it and stamps with a given GlobalCSN.
>>>   All readers who ran into tuples of an InDoubt transaction should wait 
>>> until
>>>   it ends and recheck visibility.
>>
>> I'm concerned that long-running transaction could keep other
>> transactions waiting and then the system gets stuck. Can this happen?
>> and is there any workaround?
>
> InDoubt state is set just before commit, so it is short-lived state.
> During transaction execution global tx looks like an ordinary running
> transaction. Failure during 2PC with coordinator not being able to
> commit/abort this prepared transaction can result in situation where
> InDoubt tuples will be locked for reading, but in such situation
> coordinator should be blamed. Same problems will arise if prepared
> transaction holds locks, for example.

Thank you for explanation! I understood that algorithm. I have two questions.

If I understand correctly, simple writes with ordinary 2PC doesn't
block read who reads that writes. For example, an insertion on a node
doesn't block readers who reads the inserted tuple. But in this global
transaction, the read will be blocked during the global transaction is
InDoubt state. Is that right? InDoubt state will be short-live state
if it's local transaction but I'm not sure in global transaction.
Because during InDoubt state the coordinator has to prepare on all
participant nodes and to assign the global csn to them (and end global
transaction) the global transaction could be in InDoubt state for a
relatively long time. Also, it could be more longer if the
commit/rollback prepared never be performed due to a failure of any
nodes of them.

With this patch, can we start a remote transaction at READ COMMITTED
with imported a global snapshot if the local transaction started at
READ COMMITTED?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Global snapshots

2018-05-09 Thread Robert Haas
On Tue, May 8, 2018 at 4:51 PM, Stas Kelvich  wrote:
>> On 7 May 2018, at 20:04, Robert Haas  wrote:
>> But what happens if a transaction starts on node A at time T0 but
>> first touches node B at a much later time T1, such that T1 - T0 >
>> global_snapshot_defer_time?
>
> Such transaction will get "global snapshot too old" error.

Ouch.  That's not so bad at READ COMMITTED, but at higher isolation
levels failure becomes extremely likely.  Any multi-statement
transaction that lasts longer than global_snapshot_defer_time is
pretty much doomed.

> In principle such behaviour can be avoided by calculating oldest
> global csn among all cluster nodes and oldest xmin on particular
> node will be held only when there is some open old transaction on
> other node. It's easy to do from global snapshot point of view,
> but it's not obvious how to integrate that into postgres_fdw. Probably
> that will require bi-derectional connection between postgres_fdw nodes
> (also distributed deadlock detection will be easy with such connection).

I don't think holding back xmin is a very good strategy.  Maybe it
won't be so bad if and when we get zheap, since only the undo log will
bloat rather than the table.  But as it stands, holding back xmin
means everything bloats and you have to CLUSTER or VACUUM FULL the
table in order to fix it.

If the behavior were really analogous to our existing "snapshot too
old" feature, it wouldn't be so bad.  Old snapshots continue to work
without error so long as they only read unmodified data, and only
error out if they hit modified pages.  SERIALIZABLE works according to
a similar principle: it worries about data that is written by one
transaction and read by another, but if there's a portion of the data
that is only read and not written, or at least not written by any
transactions that were active around the same time, then it's fine.
While the details aren't really clear to me, I'm inclined to think
that any solution we adopt for global snapshots ought to leverage this
same principle in some way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Global snapshots

2018-05-08 Thread Stas Kelvich


> On 7 May 2018, at 20:04, Robert Haas  wrote:
> 
> But what happens if a transaction starts on node A at time T0 but
> first touches node B at a much later time T1, such that T1 - T0 >
> global_snapshot_defer_time?
> 

Such transaction will get "global snapshot too old" error.

In principle such behaviour can be avoided by calculating oldest
global csn among all cluster nodes and oldest xmin on particular
node will be held only when there is some open old transaction on
other node. It's easy to do from global snapshot point of view,
but it's not obvious how to integrate that into postgres_fdw. Probably
that will require bi-derectional connection between postgres_fdw nodes
(also distributed deadlock detection will be easy with such connection).

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-07 Thread Robert Haas
On Sun, May 6, 2018 at 6:22 AM, Stas Kelvich  wrote:
> Also each second GetSnapshotData writes globalxmin (as it was before
> procArray->global_snapshot_xmin was taken into account) into a circle
> buffer with a size equal to global_snapshot_defer_time value. That more
> or less the same thing as with Snapshot Too Old feature, but with a
> bucket size of 1 second instead of 1 minute.
> procArray->global_snapshot_xmin is always set to oldest
> value in circle buffer.
>
> This way xmin calculation is always gives a value that were
> global_snapshot_xmin seconds ago and we have mapping from time (or
> GlobalCSN) to globalxmin for each second in this range. So when
> some backends imports global snapshot with some GlobalCSN, that
> GlobalCSN is mapped to a xmin and this xmin is set as a Proc->xmin.

But what happens if a transaction starts on node A at time T0 but
first touches node B at a much later time T1, such that T1 - T0 >
global_snapshot_defer_time?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Global snapshots

2018-05-06 Thread Stas Kelvich


> On 4 May 2018, at 22:09, Robert Haas  wrote:
> 
> So, is the idea that we'll definitely find out about any remote
> transactions within 30 seconds, and then after we know about remote
> transactions, we'll hold back OldestXmin some other way?

Yes, kind of. There is a procArray->global_snapshot_xmin variable which
acts as a barrier to xmin calculations in GetOldestXmin and
GetSnapshotData, when set.

Also each second GetSnapshotData writes globalxmin (as it was before
procArray->global_snapshot_xmin was taken into account) into a circle
buffer with a size equal to global_snapshot_defer_time value. That more
or less the same thing as with Snapshot Too Old feature, but with a
bucket size of 1 second instead of 1 minute.
procArray->global_snapshot_xmin is always set to oldest
value in circle buffer.

This way xmin calculation is always gives a value that were
global_snapshot_xmin seconds ago and we have mapping from time (or
GlobalCSN) to globalxmin for each second in this range. So when
some backends imports global snapshot with some GlobalCSN, that
GlobalCSN is mapped to a xmin and this xmin is set as a Proc->xmin.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-04 Thread Robert Haas
On Tue, May 1, 2018 at 5:02 PM, Stas Kelvich  wrote:
> Yes, that totally possible. On both systems you need:

Cool.

> * set track_global_snapshots='on' -- this will start writing each
>   transaction commit sequence number to SRLU.
> * set global_snapshot_defer_time to 30 seconds, for example -- this
>   will delay oldestXmin advancement for specified amount of time,
>   preserving old tuples.

So, is the idea that we'll definitely find out about any remote
transactions within 30 seconds, and then after we know about remote
transactions, we'll hold back OldestXmin some other way?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Global snapshots

2018-05-03 Thread Stas Kelvich


> On 3 May 2018, at 18:28, Masahiko Sawada  wrote:
> 
> On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich  wrote:
>> 1) To achieve commit atomicity of different nodes intermediate step is
>>   introduced: at first running transaction is marked as InDoubt on all nodes,
>>   and only after that each node commit it and stamps with a given GlobalCSN.
>>   All readers who ran into tuples of an InDoubt transaction should wait until
>>   it ends and recheck visibility.
> 
> I'm concerned that long-running transaction could keep other
> transactions waiting and then the system gets stuck. Can this happen?
> and is there any workaround?

InDoubt state is set just before commit, so it is short-lived state.
During transaction execution global tx looks like an ordinary running
transaction. Failure during 2PC with coordinator not being able to
commit/abort this prepared transaction can result in situation where
InDoubt tuples will be locked for reading, but in such situation
coordinator should be blamed. Same problems will arise if prepared
transaction holds locks, for example.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-03 Thread Masahiko Sawada
On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich  wrote:
> 1) To achieve commit atomicity of different nodes intermediate step is
>introduced: at first running transaction is marked as InDoubt on all nodes,
>and only after that each node commit it and stamps with a given GlobalCSN.
>All readers who ran into tuples of an InDoubt transaction should wait until
>it ends and recheck visibility.

I'm concerned that long-running transaction could keep other
transactions waiting and then the system gets stuck. Can this happen?
and is there any workaround?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Global snapshots

2018-05-02 Thread Stas Kelvich


> On 2 May 2018, at 05:58, Peter Eisentraut  
> wrote:
> 
> On 5/1/18 12:27, Stas Kelvich wrote:
>> Clock-SI is described in [5] and here I provide a small overview, which
>> supposedly should be enough to catch the idea. Assume that each node runs 
>> Commit
>> Sequence Number (CSN) based visibility: database tracks one counter for each
>> transaction start (xid) and another counter for each transaction commit 
>> (csn).
>> In such setting, a snapshot is just a single number -- a copy of current CSN 
>> at
>> the moment when the snapshot was taken. Visibility rules are boiled down to
>> checking whether current tuple's CSN is less than our snapshot's csn. Also it
>> worth of mentioning that for the last 5 years there is an active proposal to
>> switch Postgres to CSN-based visibility [6].
> 
> But that proposal has so far not succeeded.  How are you overcoming the
> reasons for that?

Well, CSN proposal is aiming to switch all postgres visibility stuff with CSN.
This proposal is far more ambitious and original postgres visibility with
snapshots being arrays of XIDs is preserved. In this patch CSNs are written
to SLRU during commit (in a way like commit_ts does) but will be read in two
cases:

1) When the local transaction faced XID that in progress according to XIP-based
snapshot, it CSN need to be checked, as it may already be InDoubt. XIDs that
viewed as committed doesn't need that check (in [6] they also need to be
checked through SLRU).

2) If we are in backend that imported global snapshot, then only CSN-based
visibility can be used. But that happens only for global transactions.

So I hope that local transactions performance will be affected only by
in-progress check and there are ways to circumvent this check.

Also all this behaviour is optional and can be switched off by not enabling
track_global_snapshots.


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-01 Thread Peter Eisentraut
On 5/1/18 12:27, Stas Kelvich wrote:
> Clock-SI is described in [5] and here I provide a small overview, which
> supposedly should be enough to catch the idea. Assume that each node runs 
> Commit
> Sequence Number (CSN) based visibility: database tracks one counter for each
> transaction start (xid) and another counter for each transaction commit (csn).
> In such setting, a snapshot is just a single number -- a copy of current CSN 
> at
> the moment when the snapshot was taken. Visibility rules are boiled down to
> checking whether current tuple's CSN is less than our snapshot's csn. Also it
> worth of mentioning that for the last 5 years there is an active proposal to
> switch Postgres to CSN-based visibility [6].

But that proposal has so far not succeeded.  How are you overcoming the
reasons for that?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Global snapshots

2018-05-01 Thread Stas Kelvich


> On 1 May 2018, at 22:43, Robert Haas  wrote:
> 
> I'm concerned about the provisioning aspect of this problem.  Suppose
> I have two existing database systems with, perhaps, wildly different
> XID counters.  On a certain date, I want to start using this system.

Yes, that totally possible. On both systems you need:

* set track_global_snapshots='on' -- this will start writing each
  transaction commit sequence number to SRLU.
* set global_snapshot_defer_time to 30 seconds, for example -- this
  will delay oldestXmin advancement for specified amount of time,
  preserving old tuples.
* restart database
* optionally enable NTPd if it wasn't enabled.

Also it is possible to avoid reboot, but that will require some careful
work: after enabling track_global_snapshots it will be safe to start
global transactions only when all concurrently running transactions
will finish. More or less equivalent thing happens during logical slot
creation.

> Or conversely, I have two systems that are bonded together using this
> system from the beginning, and then, as of a certain date, I want to
> break them apart into two standalone systems.  In your proposed
> design, are things like this possible?  Can you describe the setup
> involved?

Well, they are not actually "bonded" in any persistent way. If there will
be no distributed transactions, there will be no any logical or physical
connection between that nodes.

And returning to your original concern about "wildly different XID
counters" I want to emphasise that only thing that is floating between
nodes is a GlobalCSN's during start and commit of distributed transaction.
And that GlobalCSN is actually a timestamp of commit, the real one, from
clock_gettime(). And clock time is supposedly more or less the same
on different nodes in normal condition. But correctness here will not
depend on degree of clock synchronisation, only performance of
global transactions will.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Global snapshots

2018-05-01 Thread Robert Haas
On Tue, May 1, 2018 at 12:27 PM, Stas Kelvich  wrote:
> Here proposed a set of patches that allow achieving proper snapshot isolation
> semantics in the case of cross-node transactions. Provided infrastructure to
> synchronize snapshots between different Postgres instances and a way to
> atomically commit such transaction with respect to other concurrent global and
> local transactions. Such global transactions can be coordinated outside of
> Postgres by using provided SQL functions or through postgres_fdw, which make 
> use
> of this functions on remote nodes transparently.

I'm concerned about the provisioning aspect of this problem.  Suppose
I have two existing database systems with, perhaps, wildly different
XID counters.  On a certain date, I want to start using this system.
Or conversely, I have two systems that are bonded together using this
system from the beginning, and then, as of a certain date, I want to
break them apart into two standalone systems.  In your proposed
design, are things like this possible?  Can you describe the setup
involved?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company