On 07/16/2018 08:09 PM, Robert Haas wrote:
> On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> I'm not sure I understand. Are you suggesting the process might get killed
>> or something, thanks to the CHECK_FOR_INTERRUPTS() call?
> 
> Yes.  CHECK_FOR_INTERRUPTS() can certainly lead to a non-local
> transfer of control.
> 
>> But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's
>> true there are a few places where we do != NULL checks on the result without
>> holding any lock, but I don't see why that would be a problem? And before
>> actually inspecting the contents, the code always does
>> LockHashPartitionLockByProc.
> 
> I think at least some of those cases are a problem.  See below...
> 
>> I don't follow. How could one process put another process into a decoding
>> group? I don't think that's possible.
> 
> Isn't that exactly what AssignDecodeGroupLeader() is doing?  It looks
> up the process that currently has that XID, then turns that process
> into a decode group leader.  Then after that function returns, the
> caller adds itself to the decode group as well.  So it seems entirely
> possible for somebody to swing the decodeGroupLeader pointer for a
> PGPROC from NULL to some other value at an arbitrary point in time.
> 

Oh, right, I forgot the patch also adds the leader into the group, for
some reason (I agree it's unclear why that would be necessary, as you
pointed out later).

But all this is happening while holding the partition lock (in exclusive
mode). And the decoding backends do synchronize on the lock correctly
(although, man, the rechecks are confusing ...).

But now I see ProcKill accesses decodeGroupLeader in multiple places,
and only the first one is protected by the lock, for some reason
(interestingly enough the one in lockGroupLeader block). Is that what
you mean?

FWIW I suspect the ProcKill part is borked due to incorrectly resolved
merge conflict or something, per my initial response from today.

>> I'm not sure about the 'unsalvageable' part, but it needs more work, that's
>> for sure. Unfortunately, all previous attempts to make this work in various
>> other ways failed (see past discussions in this thread), so this is the only
>> approach left :-( So let's see if we can make it work.
> 
> I think that's probably not going to work out, but of course it's up 
> to you how you want to spend your time!
> 

Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
also have patches that require the capability to decode in-progress
transactions.

> After thinking about it a bit more, if you want to try to stick with
> this design, I don't think that this decode group leader/members thing
> has much to recommend it.  In the case of parallel query, the point of
> the lock group stuff is to treat all of those processes as one for
> purposes of heavyweight lock acquisition.  There's no similar need
> here, so the design that makes sure the "leader" is in the list of
> processes that are members of the "group" is, AFAICS, just wasted
> code.  All you really need is a list of processes hung off of the
> PGPROC that must abort before the leader is allowed to abort; the
> leader itself doesn't need to be in the list, and there's no need to
> consider it as a "group".  It's just a list of waiters.
> 

But the way I understand it, it pretty much *is* a list of waiters,
along with a couple of flags to allow the processes to notify the other
side about lock/unlock/abort. It does resemble the lock groups, but I
don't think it has the same goals.

The thing is that the lock/unlock happens for each decoded change
independently, and it'd be silly to modify the list all the time, so
instead it just sets the decodeLocked flag to true/false. Similarly,
when the leader decides to abort, it marks decodeAbortPending and waits
for the decoding backends to complete.

Of course, that's my understanding/interpretation, and perhaps Nikhil as
a patch author has a better explanation.


> That having been said, I still don't see how that's really going to
> work.  Just to take one example, suppose that the leader is trying to
> ERROR out, and the decoding workers are blocked waiting for a lock
> held by the leader.  The system has no way of detecting this deadlock
> and resolving it automatically, which certainly seems unacceptable.
> The only way that's going to work is if the leader waits for the
> worker by trying to acquire a lock held by the worker.  Then the
> deadlock detector would know to abort some transaction.  But that
> doesn't really work either - the deadlock was created by the
> foreground process trying to abort, and if the deadlock detector
> chooses that process as its victim, what then?  We're already trying
> to abort, and the abort code isn't supposed to throw further errors,
> or fail in any way, lest we break all kinds of other things.  Not to
> mention the fact that running the deadlock detector in the abort path
> isn't really safe to begin with, again because we can't throw errors
> when we're already in an abort path.
> 

Fair point, not sure. I'll leave this up to Nikhil.

> If we're only ever talking about decoding prepared transactions, we
> could probably work around all of these problems: have the decoding
> process take a heavyweight lock before it begins decoding.  Have a
> process that wants to execute ROLLBACK PREPARED take a conflicting
> heavyweight lock on the same object.  The net effect would be that
> ROLLBACK PREPARED would simply wait for decoding to finish.  That
> might be rather lousy from a latency point of view since the
> transaction could take an arbitrarily long time to decode, but it
> seems safe enough.  Possibly you could also design a mechanism for the
> ROLLBACK PREPARED command to SIGTERM the processes that are blocking
> its lock acquisition, if they are decoding processes.  The difference
> between this and what you the current patch is doing is that nothing
> complex or fragile is happening in the abort pathway itself.  The
> complicated stuff in both the worker and in the main backend happens
> while the transaction is still good and can still be rolled back at
> need.  This kind of approach won't work if you want to decode
> transactions that aren't yet prepared, so if that is the long term
> goal then we need to think harder.  I'm honestly not sure that problem
> has any reasonable solution.  The assumption that a running process
> can abort at any time is deeply baked into many parts of the system
> and for good reasons.  Trying to undo that is going to be like trying
> to push water up a hill.  I think we need to install interlocks in
> such a way that any waiting happens before we enter the abort path,
> not while we're actually trying to perform the abort.  But I don't
> know how to do that for a foreground task that's still actively doing
> stuff.
> 

Unfortunately it's not just for prepared transactions :-( The reason why
I'm interested in this capability (decoding in-progress xacts) is that
I'd like to use it to stream large transactions before commit, to reduce
replication lag due to limited network bandwidth etc. It's also needed
for things like speculative apply (starting apply before commit) etc.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to