On 2017-12-21 01:29:40 -0800, Andres Freund wrote: > On 2017-12-21 08:49:46 +0000, Andres Freund wrote: > > Add parallel-aware hash joins. > > There's to relatively mundane failures: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2017-12-21%2008%3A48%3A12 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2017-12-21%2008%3A50%3A08 > > but also one that's a lot more interesting: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=capybara&dt=2017-12-21%2008%3A50%3A08 > > which shows an assert failure: > > #2 0x00000000008687d1 in ExceptionalCondition > (conditionName=conditionName@entry=0xa76a98 > "!(!accessor->sts->participants[i].writing)", > errorType=errorType@entry=0x8b2c49 "FailedAssertion", > fileName=fileName@entry=0xa76991 "sharedtuplestore.c", > lineNumber=lineNumber@entry=273) at assert.c:54 > #3 0x000000000089883e in sts_begin_parallel_scan (accessor=0xfaf780) at > sharedtuplestore.c:273 > #4 0x0000000000634de4 in ExecParallelHashRepartitionRest > (hashtable=0xfaec18) at nodeHash.c:1369 > #5 ExecParallelHashIncreaseNumBatches (hashtable=0xfaec18) at nodeHash.c:1198 > #6 0x000000000063546b in ExecParallelHashTupleAlloc > (hashtable=hashtable@entry=0xfaec18, size=40, > shared=shared@entry=0x7ffee26a8868) at nodeHash.c:2778 > #7 0x00000000006357c8 in ExecParallelHashTableInsert > (hashtable=hashtable@entry=0xfaec18, slot=slot@entry=0xfa76f8, > hashvalue=<optimized out>) at nodeHash.c:1696 > #8 0x0000000000635b5f in MultiExecParallelHash (node=0xf7ebc8) at > nodeHash.c:288 > #9 MultiExecHash (node=node@entry=0xf7ebc8) at nodeHash.c:112 > > which seems to suggest that something in the state machine logic is > borked. ExecParallelHashIncreaseNumBatches() should've ensured that > everyone has called sts_end_write()...
Thomas, I wonder if the problem is that PHJ_GROW_BATCHES_ELECTING updates, via ExecParallelHashJoinSetUpBatches(), HashJoinTable->nbatch, while other backends also access ->nbatch in ExecParallelHashCloseBatchAccessors(). Both happens after waiting for the WAIT_EVENT_HASH_GROW_BATCHES_ELECTING phase. That'd lead to ExecParallelHashCloseBatchAccessors() likely not finish writing all batches (because nbatch < nbatch_old), which seems like it'd explain this? Greetings, Andres Freund