On Fri, Aug 18, 2017 at 3:50 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I wrote: >> Ah-hah, I see my dromedary box is one of the ones failing, so I'll >> have a look there. I can't reproduce it on my other machines. > > OK, so this is a whole lot more broken than I thought :-(. > > Bear in mind that the plan for this (omitting uninteresting detail) is > > Nested Loop Left Join > -> Values Scan on "*VALUES*" > -> Finalize GroupAggregate > -> Gather Merge > -> Partial GroupAggregate > -> Sort > -> Parallel Seq Scan on tenk1 > > What seems to be happening is that: > > 1. On the first pass, the parallel seqscan work gets doled out to several > workers, plus the leader, as expected. > > 2. When the nestloop rescans its right input, ExecReScanGatherMerge > supposes that this is good enough to handle rescanning its subnodes: > > ExecReScan(node->ps.lefttree); > > Leaving aside the question of why that doesn't look like nearly every > other child rescan call, what happens is that that invokes ExecReScanAgg, > which does the more usual thing: > > if (outerPlan->chgParam == NULL) > ExecReScan(outerPlan); > > and so that invokes ExecReScanSort, and then behold what ExecReScanSort > thinks it can optimize away: > > * If subnode is to be rescanned then we forget previous sort results; we > * have to re-read the subplan and re-sort. Also must re-sort if the > * bounded-sort parameters changed or we didn't select randomAccess. > * > * Otherwise we can just rewind and rescan the sorted output. > > So we never get to ExecReScanSeqScan, and thus not to heap_rescan, > with the effect that parallel_scan->phs_nallocated never gets reset. > > 3. On the next pass, we fire up all the workers as expected, but they all > perceive phs_nallocated >= rs_nblocks and conclude they have nothing to > do. Meanwhile, in the leader, nodeSort just re-emits the sorted data it > had last time. Net effect is that the GatherMerge node returns only the > fraction of the data that was scanned by the leader in the first pass. > > 4. The fact that the test succeeds on many machines implies that the > leader process is usually doing *all* of the work. This is in itself not > very good. Even on the machines where it fails, the fact that the tuple > counts are usually a pretty large fraction of the expected values > indicates that the leader usually did most of the work. We need to take > a closer look at why that is. > > However, the bottom line here is that parallel scan is completely broken > for rescans, and it's not (solely) the fault of nodeGatherMerge; rather, > the issue is that nobody bothered to wire up parallelism to the rescan > parameterization mechanism. >
I think we don't generate parallel plans for parameterized paths, so I am not sure whether any work is required in that area. > I imagine that related bugs can be > demonstrated in 9.6 with little effort. > > I think that the correct fix probably involves marking each parallel scan > plan node as dependent on a pseudo executor parameter, which the parent > Gather or GatherMerge node would flag as being changed on each rescan. > This would cue the plan layers in between that they cannot optimize on the > assumption that the leader's instance of the parallel scan will produce > exactly the same rows as it did last time, even when "nothing else > changed". The "wtParam" pseudo parameter that's used for communication > between RecursiveUnion and its descendant WorkTableScan node is a good > model for what needs to happen. > Yeah, that seems like a good idea. I think another way could be to *not* optimize rescanning when we are in parallel mode (IsInParallelMode()), that might be restrictive as compared to what you are suggesting, but will be somewhat simpler. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers