On Sat, 18 Jun 2005, Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
Can we figure out another way to solve the race condition? Would it
in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
considering that it usually wouldn't be held for long, since usually the
checkpoint would have very little work to do?

If you're concerned about throughput of 2PC xacts then we can't sit on
the TwoPhaseStateLock while doing I/O; that will block both preparation
and commital of all 2PC xacts for a pretty long period in CPU terms.

Here's a sketch of an idea inspired by your comment above:

1. In each gxact in shared memory, store the WAL offset of the PREPARE
record, which we will know before we are ready to mark the gxact
"valid".

2. When CheckPointTwoPhase runs (which we'll put near the end of the
checkpoint sequence), the only gxacts that need to be fsync'd are those
that are marked valid and have a PREPARE WAL location older than the
checkpoint's redo horizon (anything newer will be replayed from WAL on
crash, so it doesn't need fsync to complete the checkpoint).  If you're
right that the lifespan of a state file is often shorter than the time
needed for a checkpoint, this wins big.  In any case we'll never have to
fsync state files that disappear before the next checkpoint.

3. One way to handle CheckPointTwoPhase is:

* At start, take TwoPhaseStateLock (can be in shared mode) for just long
enough to scan the gxact list and make a list of the XID of things that
need fsync per above rule.

* Without the lock, try to open and fsync each item in the list.
        Success: remove from list
        ENOENT failure on open: add to list of not-there failures
        Any other failure: ereport(ERROR)

* If the failure list is not empty, again take TwoPhaseStateLock in
shared mode, and check that each of the failures is now gone (or at
least marked invalid); if so it's OK, otherwise ereport the ENOENT
error.

In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is marked as valid after the prepare record is written to WAL. If checkpoint runs after the WAL record is written but before the gxact is marked as valid, it doesn't get fsynced. Right?

Otherwise, looks good to me.

Another possibility is to further extend the locking protocol for gxacts
so that the checkpointer can lock just the item it is fsyncing (which is
not possible at the moment because the checkpointer hasn't got an XID,
but probably we could think of another approach).  But that would
certainly delay attempts to commit the item being fsync'd, whereas the
above approach might not have to do so, depending on the filesystem
implementation.

The above sketch is much better.

Now there's a small problem with this approach, which is that we cannot
store the PREPARE WAL record location in the state files, since the
state file has to be completely computed before writing the WAL record.
However, we don't really need to do that: during recovery of a prepared
xact we know the thing has been fsynced (either originally, or when we
rewrote it during the WAL recovery sequence --- we can force an
immediate fsync in that one case).  So we can just put zero, or maybe
better the current end-of-WAL location, into the reconstructed gxact in
memory.

This reminds me of something. What should we do about XID wraparounds and prepared transactions? Should we have some mechanism to freeze prepared transactions, like heap tuples? At the minimum, I think we should issue a warning if the xid counter approaches the oldest prepared transaction.

A transaction shouldn't live that long in normal use, but I can imagine an orphaned transaction sitting there for years if it doesn't hold any locks etc that bother other applications.

I don't think we should implement heuristic commit/rollback, though. That creates a whole new class of problems.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
     subscribe-nomail command to [EMAIL PROTECTED] so that your
     message can get through to the mailing list cleanly

Reply via email to