On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <and...@2ndquadrant.com> wrote: >> You'd need some kind of >> API that says "pretend I'm waiting for this lock, but don't really >> wait for it", and you'd need to be darn sure that you removed yourself >> from the wait queue again before doing any other heavyweight lock >> manipulation. Do you have specific thoughts on how to implement this? > > I've thought some about this, and I think it's a bit easier to not do it > on the actual lock waitqueues, but teach deadlock.c about that kind of > blocking. > > deadlock.c is far from simple, and at least I don't find the control > flow to be particularly clear. So it's not easy. It'd be advantageous > to tackle things at that level because it'd avoid the need to acquire > locks on some lock's waitqueue when blocking; we're going to do that a > lot. > > But It seems to me that it should be possible to suceed: In > FindLockCycleRecurse(), in the case that we're not waiting for an actual > lock (checkProc->links.next == NULL) we can add a case that considers > the 'blocking parallelism' case. ISTM that that's just a > FindLockCycleRecurse() call on the process that we're waiting for. We'd > either have to find the other side's locktag for DEADLOCK_INFO or invent > another field in there; but that seems like a solveable problem.
I (finally) had some time to look at this today. Initially, it looked good. Unfortunately, the longer I looked at it, the less convinced I got that we could solve the problem this way. The core of the issue is that the Funnel node in the parallel group leader basically does this: while (!local_scan_done || !remote_scan_done) { attempt a read from each remaining worker's tuple queue, blocking only if local_scan_done; if (we got a tuple) return it; else if (there are no remaining workers) remote_scan_done = true; attempt to produce a tuple just as if we were a worker ourselves; if (we got a tuple) return it; else local_scan_done = true; } Imagine that we're doing a parallel sequential scan; each worker claims one page but goes into the tank before it has returned all of the tuples on that page. The master reads all the remaining pages but must now wait for the workers to finish returning the tuples on the pages they claimed. So what this means is: 1. The master doesn't actually *wait* until the very end of the parallel phase. 2. When the master does wait, it waits for all of the parallel workers at once, not each one individually. So, I don't think anything as simplistic as teaching a blocking shm_mq_receive() to tip off the deadlock detector that we're waiting for the process on the other end of that particular queue can ever work. Because of point #2, that never happens. When I first started thinking about how to fix this, I said, well, that's no big deal, we can just advertise the whole list of processes that we're waiting for in shared memory, rather than just the one. This is a bit tricky, though. Any general API for any backend to advertise that it's waiting for an arbitrary subset of the other backends would require O(n^2) shared memory state. That wouldn't be completely insane, but it's not great, either. For this particular case, we could optimize that down to O(n) by just chaining all of the children of a given parallel group leader in a linked whose nodes are inlined in their PGPROCs, but that doesn't feel very general, because it forecloses the possibility of the children ever using that API, and I think they might need to. If nothing else, they might need to advertise that they're blocking on the master if they are trying to send data, the queue is full, and they have to wait for the master to drain some of it before they can proceed. After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. If they are different queues, it might be a deadlock, but then again maybe not. It may be that A is prepared to accept B's message from one queue, and that upon fully receiving it, it will do some further work that will lead it to write a tuple into the other queue. If so, we're OK; if not, it's a deadlock. I'm not sure whether you'll want to argue that that is an implausible scenario, but I'm not too sure it is. The worker could be saying "hey, I need some additional piece of your backend-local state in order to finish this computation", and the master could then provide it. I don't have any plans like that, but it's been suggested previously by others, so it's not an obviously nonsensical thing to want to do. A further difference in the semantics of these wait-edges is that if process A is awaiting AccessExclusiveLock on a resource held by B, C, and D at AccessShareLock, it needs to wait for all of those processes to release their locks before it can do anything else. On the other hand, if process A is awaiting tuples from B, C, and D, it just needs ONE of those processes to emit tuples in order to make progress. Now maybe that doesn't make any difference in practice, because even if two of those processes are making lively progress and A is receiving tuples from them and processing them like gangbusters, that's probably not going to help the third one get unstuck. If we adopt the approach of discounting that possibility, then as long as the parallel leader can generate tuples locally (that is, local_scan_done = false) we don't report the deadlock, but as soon as it can no longer do that (local_scan_done = true) then we do, even though we could still theoretically read more tuples from the non-stuck workers. So then you have to wonder why we're not solving problem #1, because the deadlock was just as certain before we generated the maximum possible number of tuples locally as it was afterwards. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers