On Mon, Sep 17, 2018 at 10:42 AM Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: > > While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've > > repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know > > what exactly are the triggering conditions, but the symptoms are these: > > > > ... > > Urgh. Thanks Tomas. I will investigate.
Thanks very much to Tomas for giving me access to his benchmarking machine where this could be reproduced. Tomas was doing performance testing with no assertions, but with a cassert built I was able to hit an assertion failure after a while and eventually figure out what was going wrong. The problem is that the 'segment bins' (linked lists that group segments by the largest contiguous run of free pages) can become corrupted when segments become completely free and are returned to the operating system and then the same segment slot (index number) is recycled, with the right sequence of allocations and frees and timing. There is an LWLock that protects segment slot and bin manipulations, but there is a kind of ABA problem where one backend can finish up looking at the defunct former inhabitant of a slot that another backend has recently create a new segment in. There is handling for that in the form of freed_segment_counter, a kind of generation/invalidation signalling, but there are a couple of paths that fail to check it at the right times. With the attached draft patch, Tomas's benchmark script runs happily for long periods. A bit more study required with fresh eyes, tomorrow. -- Thomas Munro http://www.enterprisedb.com
fix-dsa-segment-free-bug.patch
Description: Binary data