On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenre...@gmx.de> > wrote: >> Hi, >> >> today's testing with master as of d253b0f6e3 yielded two clusters that >> stopped processing queries. Symptoms: >> >> [...] > > Thanks Andreas. Investigating.
First, the hanging stems from reentering dsm_backend_shutdown and trying to acquire DynamicSharedMemoryControlLock which we already acquired further down the stack in dsm_unpin_segment when it raised an error. That's obviously not great, but the real question is how we reached this this-cannot-happen error condition. I reproduced this by inserting a sleep before dsa_attach_in_place, inserting a call to dsa_allocate into ExecInitParallelPlan so that the executor's DSA area owns at least one segment, and then cancelling a parallel query before the sleepy worker has managed to attach. The DSA area is destroyed before the worker attaches, but the worker doesn't know this, and goes on to destroy it again after it learns that the query has been cancelled. In an earlier version of DSA, attaching should have failed in this scenario because the handle would be invalid. Based on complaints about creating an extra DSM segment all the time even if we don't turn out to need it, I implemented "in place" DSA areas where the control object is in user-supplied shared memory, in this case in the parallel query main DSM segment. But that created a new hazard: if you try to attach to a piece of memory that contains the remains of a already-destroyed DSA area, then we don't do anything to detect that. Oops. The attached patch fixes that one way: it detects refcnt == 0 as a defunct DSA area and raises an error when you try to attach. Another approach which I could explore would be to "reset" the DSA area instead of destroying it when the last backend detaches. I'm not sure if that would ever be useful as a feature in its own right, but it would at least allow a very late worker to attach and then detach in an orderly fashion in this query-cancelled case, so you wouldn't get a different error message in the worker in this rare case. Thoughts? -- Thomas Munro http://www.enterprisedb.com
detect-late-dsa-attach-in-place.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers