Patch 0001: + errmsg("could not find free replication slot"),
Suggest: all replication slots are in use + elog(ERROR, "cannot aquire a slot while another slot has been acquired"); Suggest: this backend has already acquired a replication slot Or demote it to Assert(). I'm not really sure why this needs to be checked in non-assert builds. I also wonder if we should use the terminology "attach" instead of "acquire"; that pairs more naturally with "release". Then the message, if we want more than an assert, might be "this backend is already attached to a replication slot". + if (slot == NULL) + { + LWLockRelease(ReplicationSlotCtlLock); + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_OBJECT), + errmsg("could not find replication slot \"%s\"", name))); + } The error will release the LWLock anyway; I'd get rid of the manual LWLockRelease, and the braces. Similarly in ReplicationSlotDrop. + /* acquire spinlock so we can test and set ->active safely */ + SpinLockAcquire(&slot->mutex); + + if (slot->active) + { + SpinLockRelease(&slot->mutex); + LWLockRelease(ReplicationSlotCtlLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("slot \"%s\" already active", name))); + } + + /* we definitely have the slot, no errors possible anymore */ + slot->active = true; + MyReplicationSlot = slot; + SpinLockRelease(&slot->mutex); This doesn't need the LWLockRelease either. It does need the SpinLockRelease, but as I think I noted previously, the right way to write this is probably: SpinLockAcquire(&slot->mutex); was_active = slot->active; slot->active = true; SpinLockRelease(&slot->mutex); if (was_active) ereport(). MyReplicatonSlot = slot. ReplicationSlotsComputeRequiredXmin still acquires ProcArrayLock, and the comment "Provide interlock against concurrent recomputations" doesn't seem adequate to me. I guess the idea here is that we regard ProcArrayLock as protecting ReplicationSlotCtl->catalog_xmin and ReplicationSlotCtl->data_xmin, but if that's the idea then we only need to hold the lock during the time when we actually update those values, not the loop where we compute them. Also, if that's the design, maybe they should be part of PROC_HDR *ProcGlobal rather than here. It seems weird to have some of the values protected by ProcArrayLock live in a completely different data structure managed almost entirely by some other part of the system. It's pretty evident that what's currently patch #4 (only peg the xmin horizon for catalog tables during logical decoding) needs to become patch #1, because it doesn't make any sense to apply this before we do that. I'm still not 100% confident in that approach, but maybe I'd better try to look at it RSN and get confident, because too much of the rest of what you've got here hangs on that to proceed without it. Or to put all that another way, if for any reason we decide that the separate catalog xmin stuff is not viable, the rest of this is going to need a lot of rework, so we'd better sort that now rather than later. With respect to the synchronize-slots-to-disk stuff we're arguing about on the other thread, I think the basic design problem here is that you assume that you can change stuff in memory and then change stuff on disk, without either set of changes being atomic. What I think you need to do is making atomic actions on disk correspond to atomic state changes in memory. IOW, suppose that creating a slot consists of two steps: mkdir() + fsync(). Then I think what you need to do is - do the mkdir(). If it errors out, fine. If it succeeds, the mark the slot half-created. This is just an assignment so it can done immediately after you learn that mkdir() worked with no risk of an intervening failure. Then, try to fsync(). If it errors out, the slot will get left in the half-created state. If it works, then immediately mark the slot as fully created. Now, when the next guy comes along and looks at the slot, he can tell what he needs to do. Specifically, if the slot is half-created, and he wants to do anything other than remove it, he's got to fsync() it first, and if that errors out, so be it. The next access to the slot will merely find it still half-created and simply try the fsync() yet again. Alternatively, since nearly everything we're trying to do here is a two-step operation - do something and then fsync - maybe we have a more generic fsync-pending flag, and each slot operation checks that and retries the fsync() if it's set. But it might be situation dependent which thing we need to fsync, since there are multiple files involved. Broadly, what you're trying to accomplish here is to have something that is crash-safe but without relying on WAL, so that it can work on standbys. If making things crash-safe without WAL were easy to do, we probably wouldn't have WAL at all, so it stands to reason that there are going to be some difficulties here. Making it work reliably is going to require either inventing some special-purpose type of write-ahead logging specific to this particular need, or some analogue of shadow paging, or making sure that every intermediate step is well-defined and recoverable. Right now, you're on that last path, and it's by no means obvious to me that that's the wrong place to be, but I think there's some work left to be done to get it there. Calling a slot "old" or "new" looks liable to cause problems. Maybe change those names to contain a character not allowed in a slot name, if we're going to keep doing it that way. I wonder if it wouldn't be better to get rid of the subdirectories for the individual slots, and just have a file pg_replslot/$SLOTNAME, or not. I know there are later patches that need subdirectories for their own private data, but they could just create pg_replslot/$SLOTNAME.dir and put whatever in it they like, without really implicating this code that much. The advantage of that is that there would be fewer intermediate states. The slot exists if the file exists, and not if it doesn't. You still need half-alive and half-dead until the fsync finishes, but you don't need to worry about tracking both the state of the directory and the state of the file. On startup we fsync the containing directory and all of the slot files we find inside it and refuse to start up if that fails, but once running filesystem failures only prevent changes; they don't kill the system. Patch 0004: I'm not very confident that PROC_IN_LOGICAL_DECODING is the right way to go here. It seems to me that excluding the xmins of backends with slots from globalxmin consideration so that we can fold the same xmin in by some other mechanism is kind of strange. How about letting the xmins of such backends affect the computation as normal, and then having one extra xmin that gets folded in that represents the minima of the xmin of unconnected slots? When a backend with a slot disconnects, an on_shmem_exit hook must move that value backwards if it follows MyPgXact->xmin. That's all for now... ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers