On Sun, Sep 11, 2016 at 7:40 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.ja...@gmail.com> wrote: > > On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.ja...@gmail.com> > wrote: > > > >> > >> I plan to do testing using my own testing harness after changing it to > >> insert a lot of dummy tuples (ones with negative values in the pseudo-pk > >> column, which are never queried by the core part of the harness) and > >> deleting them at random intervals. I think that none of pgbench's > built in > >> tests are likely to give the bucket splitting and squeezing code very > much > >> exercise. > > > > > > > > I've implemented this, by adding lines 197 through 202 to the count.pl > > script. (I'm reattaching the test case) > > > > Within a few minutes of testing, I start getting Errors like these: > > > > 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not > > owned by resource owner Portal > > 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set > > count=count+1 where index=$1 > > > > > > In one test, I also got an error from my test harness itself indicating > > tuples are transiently missing from the index, starting an hour into a > test: > > > > child abnormal exit update did not update 1 row: key 9555 updated 0E0 at > > count.pl line 194.\n at count.pl line 208. > > child abnormal exit update did not update 1 row: key 8870 updated 0E0 at > > count.pl line 194.\n at count.pl line 208. > > child abnormal exit update did not update 1 row: key 8453 updated 0E0 at > > count.pl line 194.\n at count.pl line 208. > > > > Those key values should always find exactly one row to update. > > > > If the tuples were permanently missing from the index, I would keep > getting > > errors on the same key values very frequently. But I don't get that, the > > errors remain infrequent and are on different value each time, so I think > > the tuples are in the index but the scan somehow misses them, either > while > > the bucket is being split or while it is being squeezed. > > > > This on a build without enable-asserts. > > > > Any ideas on how best to go about investigating this? > > > > I think these symptoms indicate the bug in concurrent hash index > patch, but it could be that the problem can be only revealed with WAL > patch. Is it possible to just try this with concurrent hash index > patch? In any case, thanks for testing it, I will look into these > issues. > My test program (as posted) injects crashes and then checks the post-crash-recovery system for consistency, so it cannot be run as-is without the WAL patch. I also ran the test with crashing turned off (just change the JJ* variables at the stop of the do.sh to all be set to the empty string), and in that case I didn't see either problem, but it it could just be that I that I didn't run it long enough. It should have been long enough to detect the rather common "buffer <x> is not owned by resource owner Portal" problem, so that one I think is specific to the WAL patch (probably the part which tries to complete bucket splits when it detects one was started but not completed?) Cheers, Jeff