On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:
Hi All,

Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.

Wow, this is going to be a lot of new infrastructure. This is going to need good documentation, explaining how two-phase commit works in general, how it's implemented, how to monitor it etc. It's important to explain all the possible failure scenarios where you're left with in-doubt transactions, and how the DBA can resolve them.

Since we're building a Transaction Manager into PostgreSQL, please put a lot of thought on what kind of APIs it provides to the rest of the system. APIs for monitoring it, configuring it, etc. And how an extension could participate in a transaction, without necessarily being an FDW.

Regarding the configuration, there are many different behaviours that an FDW could implement:

1. The FDW is read-only. Commit/abort behaviour is moot.
2. Transactions are not supported. All updates happen immediately regardless of the local transaction. 3. Transactions are supported, but two-phase commit is not. There are three different ways we can use the remote transactions in that case:
3.1. Commit the remote transaction before local transaction.
3.2. Commit the remote transaction after local transaction.
3.3. As long as there is only one such FDW involved, we can still do safe two-phase commit using so-called Last Resource Optimization.
4. Full two-phases commit support

We don't necessarily have to support all of that, but let's keep all these cases in mind when we design the how to configure FDWs. There's more to it than "does it support 2PC".

A. Steps during transaction processing
------------------------------------------------

1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.

Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form
of CREATE/ALTER SERVER option. Implemented in the patch.

For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed.

Just to be clear: you also need two-phase commit if the transaction updated anything in the local server and in even one foreign server.

D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.

The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.

Your current approach with a file that's flushed to disk on every update has a few problems. Firstly, it's not crash safe. Secondly, if you make it crash-safe with fsync(), performance will suffer. You're going to need to need several fsyncs per commit with 2PC anyway, there's no way around that, but the scalable way to do that is to use the WAL so that one fsync() can flush more than one update in one operation.

So I think you'll need to do something similar to the pg_twophase files. WAL-log each update, and only flush the file/files to disk on a checkpoint. Perhaps you could use the pg_twophase infrastructure for this directly, by essentially treating every local transaction as a two-phase transaction, with some extra flag to indicate that it's an internally-created one.

2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.

Or you could insert/update the rows in the catalog with xmin=FrozenXid, ignoring MVCC. Not sure how well that would work.

3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.

Right. The pg_twophase files solve that exact same issue.

There is clearly a lot of work to do here. I'm marking this as Returned with Feedback in the commitfest, I don't think more review is going to be helpful at this point.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to