Added to 2015-06 commitfest to attract some reviews and comments. On Tue, Feb 17, 2015 at 2:56 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote:
> Hi All, > > Here are the steps and infrastructure for achieving atomic commits across > multiple foreign servers. I have tried to address most of the concerns > raised in this mail thread before. Let me know, if I have left something. > Attached is a WIP patch implementing the same for postgres_fdw. I have > tried to make it FDW-independent. > > A. Steps during transaction processing > ------------------------------------------------ > > 1. When an FDW connects to a foreign server and starts a transaction, it > registers that server with a boolean flag indicating whether that server is > capable of participating in a two phase commit. In the patch this is > implemented using function RegisterXactForeignServer(), which raises an > error, thus aborting the transaction, if there is at least one foreign > server incapable of 2PC in a multiserver transaction. This error thrown as > early as possible. If all the foreign servers involved in the transaction > are capable of 2PC, the function just updates the information. As of now, > in the patch the function is in the form of a stub. > > Whether a foreign server is capable of 2PC, can be > a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it > can build the capabilities which can be used for all the servers using > file_fdw > b. a decision based on server version type etc. thus FDW can decide that > by looking at the server properties for each server > c. a user decision where the FDW can allow a user to specify it in the > form of CREATE/ALTER SERVER option. Implemented in the patch. > > For a transaction involving only a single foreign server, the current code > remains unaltered as two phase commit is not needed. Rest of the discussion > pertains to a transaction involving more than one foreign servers. > At the commit or abort time, the FDW receives call backs with the > appropriate events. FDW then takes following actions on each event. > > 2. On XACT_EVENT_PRE_COMMIT event, the FDW coins one prepared transaction > id per foreign server involved and saves it along with xid, dbid, foreign > server id and user mapping and foreign transaction status = PREPARING > in-memory. The prepared transaction id can be anything represented as byte > string. Same information is flushed to the disk to survive crashes. This is > implemented in the patch as prepare_foreign_xact(). Persistent and > in-memory storages and their usages are discussed later in the mail. FDW > then prepares the transaction on the foreign server. If this step is > successful, the foreign transaction status is changed to PREPARED. If the > step is unsuccessful, the local transaction is aborted and each FDW will > receive XACT_EVENT_ABORT (discussed later). The updates to the foreign > transaction status need not be flushed to the disk, as they can be inferred > from the status of local transaction. > > 3. If the local transaction is committed, the FDW callback will get > XACT_EVENT_COMMIT event. Foreign transaction status is changed to > COMMITTING. FDW tries to commit the foreign transaction with the prepared > transaction id. If the commit is successful, the foreign transaction entry > is removed. If the commit is unsuccessful because of local/foreign server > crash or network failure, the foreign prepared transaction resolver takes > care of the committing it at later point of time. > > 4. If the local transaction is aborted, the FDW callback will get > XACT_EVENT_ABORT event. At this point, the FDW may or may not have prepared > a transaction on foreign server as per step 1 above. If it has not prepared > the transaction, it simply aborts the transaction on foreign server; a > server crash or network failure doesn't alter the ultimate result in this > case. If FDW has prepared the foreign transaction, it updates the foreign > transaction status as ABORTING and tries to rollback the prepared > transaction. If the rollback is successful, the foreign transaction entry > is removed. If the rollback is not successful, the foreign prepared > transaction resolver takes care of aborting it at later point of time. > > B. Foreign prepared transaction resolver > --------------------------------------------------- > In the patch this is implemented as a built-in function pg_fdw_resolve(). > Ideally the functionality should be run by a background worker process > frequently. > > The resolver looks at each entry and invokes the FDW routine to resolve > the transaction. The FDW routine returns boolean status: true if the > prepared transaction was resolved (committed/aborted), false otherwise. > The resolution is as follows - > 1. If foreign transaction status is COMMITTING or ABORTING, commits or > aborts the prepared transaction resp through the FDW routine. If the > transaction is successfully resolved, it removes the foreign transaction > entry. > 2. Else, it checks if the local transaction was committed or aborted, it > update the foreign transaction status accordingly and takes the action > according to above step 1. > 3. The resolver doesn't touch entries created by in-progress local > transactions. > > If server/backend crashes after it has registered the foreign transaction > entry (during step A.1), we will be left with a prepared transaction id, > which was never prepared on the foreign server. Similarly the > server/backend crashes after it has resolved the foreign prepared > transaction but before removing the entry, same situation can arise. FDW > should detect these situations, when foreign server complains about > non-existing prepared transaction ids and consider such foreign > transactions as resolved. > > After looking at all the entries the resolver flushes the entries to the > disk, so as to retain the latest status across shutdown and crash. > > C. Other methods and infrastructure > ------------------------------------------------ > 1. Method to show the current foreign transaction entries (in progress or > waiting to be resolved). Implemented as function pg_fdw_xact() in the patch. > 2. Method to drop foreign transaction entries in case they are resolved by > user/DBA themselves. Not implemented in the patch. > 3. Method to prevent altering or dropping foreign server and user mapping > used to prepare the foreign transaction till the later gets resolved. Not > implemented in the patch. While altering or dropping the foreign server or > user mapping, that portion of the code needs to check if there exists an > foreign transaction entry depending upon the foreign server or user mapping > and should error out. > 4. The information about the xid needs to be available till it is decided > whether to commit or abort the foreign transaction and that decision is > persisted. That should put some constraint on the xid wraparound or oldest > active transaction. Not implemented in the patch. > 5. Method to propagate the foreign transaction information to the slave. > > D. Persistent and in-memory storage considerations > -------------------------------------------------------------------- > I considered following options for persistent storage > 1. in-memory table and file(s) - The foreign transaction entries are saved > and manipulated in shared memory. They are written to file whenever > persistence is necessary e.g. while registering the foreign transaction in > step A.2. Requirements C.1, C.2 need some SQL interface in the form of > built-in functions or SQL commands. > > The patch implements the in-memory foreign transaction table as a fixed > size array of foreign transaction entries (similar to prepared transaction > entries in twophase.c). This puts a restriction on number of foreign > prepared transactions that need to be maintained at a time. We need > separate locks to syncronize the access to the shared memory; the patch > uses only a single LW lock. There is restriction on the length of prepared > transaction id (or prepared transaction information saved by FDW to be > general), since everything is being saved in fixed size memory. We may be > able to overcome that restriction by writing this information to separate > files (one file per foreign prepared transaction). We need to take the same > route as 2PC for C.5. > > 2. New catalog - This method takes out the need to have separate method > for C1, C5 and even C2, also the synchronization will be taken care of by > row locks, there will be no limit on the number of foreign transactions as > well as the size of foreign prepared transaction information. But big > problem with this approach is that, the changes to the catalogs are atomic > with the local transaction. If a foreign prepared transaction can not be > aborted while the local transaction is rolled back, that entry needs to > retained. But since the local transaction is aborting the corresponding > catalog entry would become invisible and thus unavailable to the resolver > (alas! we do not have autonomous transaction support). We may be able to > overcome this, by simulating autonomous transaction through a background > worker (which can also act as a resolver). But the amount of communication > and synchronization, might affect the performance. > > A mixed approach where the backend shifts the entries from storage in > approach 1 to catalog, thus lifting the constraints on size is possible, > but is very complicated. > > Any other ideas to use catalog table as the persistent storage here? Does > anybody think, catalog table is a viable option? > > 3. WAL records - Since the algorithm follows "write ahead of action", WAL > seems to be a possible way to persist the foreign transaction entries. But > WAL records can not be used for repeated scan as is required by the foreign > transaction resolver. Also, replaying WALs is controlled by checkpoint, so > not all WALs are replayed. If a checkpoint happens after a foreign prepared > transaction remains resolved, corresponding WALs will never be replayed, > thus causing the foreign prepared transaction to remain unresolved forever > without a clue. So, WALs alone don't seem to be a fit here. > > The algorithms rely on the FDWs to take right steps to the large extent, > rather than controlling each step explicitly. It expects the FDWs to take > the right steps for each event and call the right functions to manipulate > foreign transaction entries. It does not ensure the correctness of these > steps, by say examining the foreign transaction entries in response to each > event or by making the callback return the information and manipulate the > entries within the core. I am willing to go the stricter but more intrusive > route if the others also think that way. Otherwise, the current approach is > less intrusive and I am fine with that too. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company