On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > I have implemented this idea (note that I have to expose a new API > shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and > we sum_mq* to call shm_mq_detach) and apart this I have fixed other > problems reported on this thread: > > 1. Execution of initPlan by master backend and then pass the > required PARAM_EXEC parameter values to workers. > 2. Avoid consuming dsm's by freeing the parallel context after > the last tuple is fetched. > 3. Allow execution of Result node in worker backend as that can > be added as a gating filter on top of PartialSeqScan. > 4. Merged parallel heap scan descriptor patch > > To apply the patch, please follow below sequence: > > HEAD Commit-Id: 4d930eee > parallel-mode-v9.patch [1] > assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl in > the patch) > parallel_seqscan_v14.patch (Attached with this mail)
Thanks, this version looks like an improvement. However, I still see some problems: - I believe the separation of concerns between ExecFunnel() and ExecEndFunnel() is not quite right. If the scan is shut down before it runs to completion (e.g. because of LIMIT), then I think we'll call ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I think you probably need to create a static subroutine that is called both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each case cleaning up whatever resources remain. - InitializeParallelWorkers() still mixes together general parallel executor concerns with concerns specific to parallel sequential scan (e.g. EstimatePartialSeqScanSpace). We have to eliminate everything that assumes that what's under a funnel will be, specifically, a partial sequential scan. To make this work properly, I think we should introduce a new function that recurses over the plan tree and invokes some callback for each plan node. I think this could be modeled on this code from ExplainNode(), beginning around line 1593: /* initPlan-s */ if (planstate->initPlan) ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es); /* lefttree */ if (outerPlanState(planstate)) ExplainNode(outerPlanState(planstate), ancestors, "Outer", NULL, es); /* righttree */ if (innerPlanState(planstate)) ExplainNode(innerPlanState(planstate), ancestors, "Inner", NULL, es); /* special child plans */ switch (nodeTag(plan)) { /* a bunch of special cases */ } /* subPlan-s */ if (planstate->subPlan) ExplainSubPlans(planstate->subPlan, ancestors, "SubPlan", es); The new function would do the same sort of thing, but instead of explaining each node, it would invoke a callback for each node. Possibly explain.c could use it instead of having hard-coded logic. Possibly it should use the same sort of return-true convention as expression_tree_walker, query_tree_walker, and friends. So let's call it planstate_tree_walker. Now, instead of directly invoking logic specific to parallel sequential scan, it should call planstate_tree_walker() on its lefttree and pass a new function ExecParallelEstimate() as the callback. That function ignores any node that's not parallel aware, but when it sees a partial sequential scan (or, in the future, some a parallel bitmap scan, parallel sort, or what have you) it does the appropriate estimation work. When ExecParallelEstimate() finishes, we InitializeParallelDSM(). Then, we call planstate_tree_walker() on the lefttree again, and this time we pass another new function ExecParallelInitializeDSM(). Like the previous one, that ignores the callbacks from non-parallel nodes, but if it hits a parallel node, then it fills in the parallel bits (i.e. ParallelHeapScanDesc for a partial sequential scan). - shm_mq_from_handle() is probably reasonable, but can we rename it shm_mq_get_queue()? - It's hard to believe this is right: + if (parallelstmt->inst_options) + receiver = None_Receiver; Really? Flush the tuples if there are *any instrumentation options whatsoever*? At the very least, that doesn't look too future-proof, but I'm suspicious that it's outright incorrect. - I think ParallelStmt probably shouldn't be defined in parsenodes.h. That file is included in a lot of places, and adding all of those extra #includes there doesn't seem like a good idea for modularity reasons even if you don't care about partial rebuilds. Something that includes a shm_mq obviously isn't a "parse" node in any meaningful sense anyway. - I don't think you need both setup cost and startup cost. Starting up more workers isn't particularly more expensive than starting up fewer of them, because most of the overhead is in waiting for them to actually start, and the number of workers is reasonable, then they're all be doing that in parallel with each other. I suggest removing parallel_startup_cost and keeping parallel_setup_cost. - In cost_funnel(), I don't think it's right to divide the run cost by nWorkers + 1. Suppose we've got a plan that looks like this: Funnel -> Hash Join -> Partial Seq Scan on a -> Hash -> Seq Scan on b The sequential scan on b is going to get executed once per worker, whereas the effort for the sequential scan on a is going to be divided over all the workers. So the right way to cost this is as follows: (a) The cost of the partial sequential scan on a is equal to the cost of a regular sequential scan, plus a little bit of overhead to account for communication via the ParallelHeapScanDesc, divided by the number of workers + 1. (b) The cost of the remaining nodes under the funnel works normally. (c) The cost of the funnel is equal to the cost of the hash join plus number of tuples multiplied by per-tuple communication overhead plus a large fixed overhead reflecting the time it takes the workers to start. - While create_parallelscan_paths() is quite right to limit the number of workers to no more than the number of pages, it's pretty obvious that in practice that's way too conservative. I suggest we get significantly more aggressive about that, like limiting ourselves to one worker per thousand pages. We don't really know exactly what the costing factors should be here just yet, but we certainly know that spinning up lots of workers to read a handful of pages each must be dumb. And we can save a significant amount of planning time here by not bothering to generate parallel paths for little tiny relations. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers