On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > FYI, I was able to reproduce this case in debugger. PSA logs showing > > details. > > > > Thanks for reproducing as I was worried about exactly this case. I > have one question related to logs: > > ## > ## ALTER SUBSCRIPTION to REFRESH the publication > > ## This blocks on some latch until the tablesync worker dies, then it > continues > ## > > Did you check which exact latch or lock blocks this? >
I have checked this myself and the command is waiting on the drop of origin till the tablesync worker is finished because replorigin_drop() requires state->acquired_by to be 0 which will only be true once the tablesync worker exits. I think this is the reason you might have noticed that the command can't be finished until the tablesync worker died. So this can't be an interlock between ALTER SUBSCRIPTION .. REFRESH command and tablesync worker and to that end it seems you have below Fixme's in the patch: + * FIXME - Usually this cleanup would be OK, but will not + * always be OK because the logicalrep_worker_stop_at_commit + * only "flags" the worker to be stopped in the near future + * but meanwhile it may still be running. In this case there + * could be a race between the tablesync worker and this code + * to see who will succeed with the tablesync drop (and the + * loser will ERROR). + * + * FIXME - Also, checking the state is also not guaranteed + * correct because state might be TCOPYDONE when we checked + * but has since progressed to SYNDONE + */ + + if (state == SUBREL_STATE_TCOPYDONE) + { I feel this was okay for an earlier code but now we need to stop the tablesync workers before trying to drop the slot as we do in DropSubscription. Now, if we do that then that would fix the race conditions mentioned in Fixme but still, there are few more things I am worried about: (a) What if the launcher again starts the tablesync worker? One idea could be to acquire AccessExclusiveLock on SubscriptionRelationId as we do in DropSubscription which is not a very good idea but I can't think of any other good way. (b) the patch is just checking SUBREL_STATE_TCOPYDONE before dropping the replication slot but the slot could be created even before that (in SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the slot and if we are not able to drop then we can simply continue assuming it didn't exist. One minor comment: 1. + SpinLockAcquire(&MyLogicalRepWorker->relmutex); MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE; MyLogicalRepWorker->relstate_lsn = current_lsn; - Spurious line removal. -- With Regards, Amit Kapila.