Re: [HACKERS] [DESIGN] ParallelAppend
On Mon, Nov 23, 2015 at 10:39 PM, Robert Haaswrote: > > On Mon, Nov 23, 2015 at 7:45 AM, Amit Kapila wrote: > > Without this patch, that 0.5 (or 50% of leaders effort) is considered for > > Gather node irrespective of the number of workers or other factors, but > > I think with Patch that is no longer true and that's what I am worrying > > about. > > Nope, that patch does not change that at all. We probably should, but > this patch does not. > I have taken some performance data with this patch. - Select data from inheritance hierarchy with very few tuples. Create table parent_rel(c1 int, c2 text); Create table child1_rel () Inherits (parent_rel); Create table child2_rel () Inherits (parent_rel); insert into parent_rel values(generate_series(1,15), ''); insert into child1_rel values(generate_series(10,20),'aaa'); insert into child2_rel values(generate_series(20,30),'aaa'); Analyze parent_rel; Analyze child1_rel; Analyze child2_rel; set max_parallel_degree=4; set parallel_setup_cost=0; set parallel_tuple_cost=0.01; postgres=# explain select count(*) from parent_rel; QUERY PLAN -- Aggregate (cost=2.71..2.72 rows=1 width=0) -> Gather (cost=0.00..2.62 rows=37 width=0) Number of Workers: 1 -> Append (cost=0.00..2.25 rows=37 width=0) -> Parallel Seq Scan on parent_rel (cost=0.00..0.77 rows=15 width=0) -> Parallel Seq Scan on child1_rel (cost=0.00..0.74 rows=11 width=0) -> Parallel Seq Scan on child2_rel (cost=0.00..0.74 rows=11 width=0) I have changed parallel_setup_cost and parallel_tuple_cost, so it is selecting Gather path even for a small relation. However, the same won't be true for non-inheritence relation as if the number of pages in relation are below than threshold (1000), it won't select parallel path. Now here we might want to have similar restriction for Append Relation as well, that if combining all the child subpaths doesn't have more than threshold number of pages, then don't try to build the parallel path. - Choose the data set that fits in shared_buffers and then run statements with different selectivity and max_parallel_degree Test setup 1. Use, pgbench -i -s 100 to create initial data. 2. Use attached pgbench_partitions.sql to create 10 partitions with equal data. 3. Use, parallel_append.sh to execute statements with different Selectivity and max_parallel_degree (changed parallel_tuple_cost to 0.001) Selection_criteria – 1% of rows will be selected and used costly function evaluation for each row Head *max_parallel_degree* *exec_time (ms)* *workers_used* 0 76202 0 2 28556 2 4 21620 3 8 21693 3 16 21654 3 32 21579 3 64 21474 3 Patch *max_parallel_degree* *exec_time (ms)* *workers_used* 0 77027 0 2 27088 2 4 16648 4 8 13730 5 16 13787 5 32 13794 5 64 13872 5 So here we can see that with Patch, performance is better, but I think that is mainly due to number of workers working on a plan. It is not clear that if we would have allowed more workers to work at higher max_parallel_degree whether that can give us any substantial benefit, but anyway I think thats a generic worker allocation improvement which is not directly related to this patch. The data at different selectivities can be found in the attached document, more or less that shows a similar trend. Apart from this, I have tried with data set which doesn't fit shared buffers, but fit in RAM, for that also it shows similar trend. Patch looks good, apart from worker allocation stuff, but I think we can deal with that separately. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pgbench_partitions.sql Description: Binary data parallel_append.sh Description: Bourne shell script parallel_append_data.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Nov 20, 2015 at 7:06 PM, Robert Haaswrote: > > On Fri, Nov 20, 2015 at 12:45 AM, Amit Kapila wrote: > > Okay, but I think that's not what I am talking about. I am talking about > > below code in cost_seqscan: > > > > - if (nworkers > 0) > > > > - run_cost = run_cost / (nworkers + 0.5); > > > > + if (path->parallel_degree > 0) > > > > + run_cost = run_cost / (path->parallel_degree + 0.5); > > > > > > It will consider 50% of master backends effort for scan of each child > > relation, > > does that look correct to you? Wouldn't 50% of master backends effort be > > considered to scan all the child relations? > > In the code you originally wrote, you were adding 1 there rather than > 0.5. That meant you were expecting the leader to do as much work as > each of its workers, which is clearly a bad estimate, because the > leader also has to do the work of gathering tuples from the workers. > 0.5 might not be the right value, but it's surely better than 1. > Without this patch, that 0.5 (or 50% of leaders effort) is considered for Gather node irrespective of the number of workers or other factors, but I think with Patch that is no longer true and that's what I am worrying about. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Mon, Nov 23, 2015 at 7:45 AM, Amit Kapilawrote: > Without this patch, that 0.5 (or 50% of leaders effort) is considered for > Gather node irrespective of the number of workers or other factors, but > I think with Patch that is no longer true and that's what I am worrying > about. Nope, that patch does not change that at all. We probably should, but this patch does not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Nov 20, 2015 at 12:45 AM, Amit Kapilawrote: > Okay, but I think that's not what I am talking about. I am talking about > below code in cost_seqscan: > > - if (nworkers > 0) > > - run_cost = run_cost / (nworkers + 0.5); > > + if (path->parallel_degree > 0) > > + run_cost = run_cost / (path->parallel_degree + 0.5); > > > It will consider 50% of master backends effort for scan of each child > relation, > does that look correct to you? Wouldn't 50% of master backends effort be > considered to scan all the child relations? In the code you originally wrote, you were adding 1 there rather than 0.5. That meant you were expecting the leader to do as much work as each of its workers, which is clearly a bad estimate, because the leader also has to do the work of gathering tuples from the workers. 0.5 might not be the right value, but it's surely better than 1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Nov 19, 2015 at 12:27 AM, Robert Haaswrote: > > On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila wrote: > > Don't we need the startup cost incase we need to build partial paths for > > joinpaths like mergepath? > > Also, I think there are other cases for single relation scan where startup > > cost can matter like when there are psuedoconstants in qualification > > (refer cost_qual_eval_walker()) or let us say if someone has disabled > > seq scan (disable_cost is considered as startup cost.) > > I'm not saying that we don't need to compute it. I'm saying we don't > need to take it into consideration when deciding which paths have > merit. Note that consider_statup is set this way: > > rel->consider_startup = (root->tuple_fraction > 0); > Even when consider_startup is false, still startup_cost is used for cost calc, now may be ignoring that is okay for partial paths, but still it seems worth thinking why leaving for partial paths it is okay even though it is used in add_path(). + * We don't generate parameterized partial paths because they seem unlikely + * ever to be worthwhile. The only way we could ever use such a path is + * by executing a nested loop with a complete path on the outer side - thus, + * each worker would scan the entire outer relation - and the partial path + * on the inner side - thus, each worker would scan only part of the inner + * relation. This is silly: a parameterized path is generally going to be + * based on an index scan, and we can't generate a partial path for that. Won't it be useful to consider parameterized paths for below kind of plans where we can push the jointree to worker and each worker can scan the complete outer relation A and then the rest work is divided among workers (ofcourse there can be other ways to parallelize such joins, but still the way described also seems to be possible)? NestLoop -> Seq Scan on A Hash Join Join Condition: B.Y = C.W -> Seq Scan on B -> Index Scan using C_Z_IDX on C Index Condition: C.Z = A.X - Is the main reason to have add_partial_path() is that it has some less checks or is it that current add_path will give wrong answers in any case? If there is no case where add_path can't work, then there is some advanatge in retaining add_path() atleast in terms of maintining the code. +void +add_partial_path(RelOptInfo *parent_rel, Path *new_path) { .. + /* Unless pathkeys are incompable, keep just one of the two paths. */ .. typo - 'incompable' > > A. > > This means that for inheritance child relations for which rel pages are > > less than parallel_threshold, it will always consider the cost shared > > between 1 worker and leader as per below calc in cost_seqscan: > > if (path->parallel_degree > 0) > > run_cost = run_cost / (path->parallel_degree + 0.5); > > > > I think this might not be the appropriate cost model for even for > > non-inheritence relations which has pages more than parallel_threshold, > > but it seems to be even worst for inheritance children which have > > pages less than parallel_threshold > > Why? Because I think the way code is written, it assumes that for each of the inheritence-child relation which has pages lesser than threshold, half the work will be done by master-backend which doesn't seem to be the right distribution. Consider a case where there are three such children each having cost 100 to scan, now it will cost them as 100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is considering 0.5 of master backends work which seems to be wrong. I think for Append case, we should consider this cost during Append path creation in create_append_path(). Basically we can make cost_seqscan to ignore the cost reduction due to parallel_degree for inheritance relations and then during Append path creation we can consider it and also consider work unit of master backend as 0.5 with respect to overall work. - --- a/src/backend/optimizer/README +++ b/src/backend/optimizer/README +plan as possible. Expanding the range of cases in which more work can be +pushed below the Gather (and costly them accurately) is likely to keep us +busy for a long time to come. Seems there is a typo in above text. /costly/cost With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Nov 19, 2015 at 1:29 PM, Amit Kapilawrote: > > On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas wrote: > - > Is the main reason to have add_partial_path() is that it has some > less checks or is it that current add_path will give wrong answers > in any case? > > If there is no case where add_path can't work, then there is some > advanatge in retaining add_path() atleast in terms of maintining > the code. To be specific, I mean to say about the logic of add_path(). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapilawrote: > On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas wrote: >> >> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila >> wrote: >> > Don't we need the startup cost incase we need to build partial paths for >> > joinpaths like mergepath? >> > Also, I think there are other cases for single relation scan where >> > startup >> > cost can matter like when there are psuedoconstants in qualification >> > (refer cost_qual_eval_walker()) or let us say if someone has disabled >> > seq scan (disable_cost is considered as startup cost.) >> >> I'm not saying that we don't need to compute it. I'm saying we don't >> need to take it into consideration when deciding which paths have >> merit. Note that consider_statup is set this way: >> >> rel->consider_startup = (root->tuple_fraction > 0); >> > > Even when consider_startup is false, still startup_cost is used for cost > calc, now may be ignoring that is okay for partial paths, but still it seems > worth thinking why leaving for partial paths it is okay even though it > is used in add_path(). That is explained in the comments, and I just explained it again in my previous email. I'm not sure how much clearer I can make it. For a regular path, it might sometimes be useful to pick a path with a higher total cost if it has a lower startup cost. The reason this could be useful is because we might not run the resulting plan to completion. However, parallel queries are always run to completion, so a lower startup cost isn't useful. We just want the path with the lowest total cost. I don't know what else to say here unless you can ask a more specific question. > Won't it be useful to consider parameterized paths for below kind of > plans where we can push the jointree to worker and each worker can > scan the complete outer relation A and then the rest work is divided > among workers (ofcourse there can be other ways to parallelize such joins, > but still the way described also seems to be possible)? > > NestLoop > -> Seq Scan on A > Hash Join > Join Condition: B.Y = C.W > -> Seq Scan on B > -> Index Scan using C_Z_IDX on C > Index Condition: C.Z = A.X I had thought that this sort of plan wouldn't actually occur in real life, but it seems that it does. What you've written here is a little muddled - the hash join has no hash underneath, for example, and there'd have to be some sort of join order restriction in order to consider a plan of this type. However, somewhat to my surprise, I was able to get a plan much like this by doing this: rhaas=# create table a (x int); CREATE TABLE rhaas=# insert into a values (1); INSERT 0 1 rhaas=# create table b (y int, filler text); CREATE TABLE rhaas=# insert into b select g, random()::text||random()::text||random()::text||random()::text from generate_series(1,100) g; INSERT 0 100 rhaas=# create table c (z int, w int, filler text); CREATE TABLE rhaas=# insert into c select g, g, random()::text||random()::text||random()::text||random()::text from generate_series(1,100) g; INSERT 0 100 rhaas=# create index c_z_idx on c (z); CREATE INDEX rhaas=# vacuum analyze; VACUUM rhaas=# explain analyze select * from A LEFT JOIN (B INNER JOIN C ON B.Y = C.W) ON C.Z = A.x; QUERY PLAN -- Nested Loop Left Join (cost=8.46..26810.48 rows=1 width=152) (actual time=0.076..166.946 rows=1 loops=1) -> Seq Scan on a (cost=0.00..1.01 rows=1 width=4) (actual time=0.015..0.016 rows=1 loops=1) -> Hash Join (cost=8.46..26809.47 rows=1 width=148) (actual time=0.057..166.925 rows=1 loops=1) Hash Cond: (b.y = c.w) -> Seq Scan on b (cost=0.00..23051.00 rows=100 width=72) (actual time=0.012..89.013 rows=100 loops=1) -> Hash (cost=8.44..8.44 rows=1 width=76) (actual time=0.035..0.035 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Index Scan using c_z_idx on c (cost=0.42..8.44 rows=1 width=76) (actual time=0.031..0.032 rows=1 loops=1) Index Cond: (z = a.x) Planning time: 0.394 ms Execution time: 167.015 ms (11 rows) This is extremely narrow. If you have more than one row in A, the planner doesn't pick a nested loop. And if you're actually going to be running queries like this frequently, then you should create an index on B (Y), which causes you to get an execution time of ~ 0.2 ms instead of 167 ms, because we generate a parameterized nestloop over two index scans. But if you have no index on B (Y) and a does contain precisely one row, then you can get this sort of plan, and hey, the runtime is even long enough for parallelism to potentially be useful. But after thinking about it for a while, I realized that even if you think
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Nov 20, 2015 at 1:25 AM, Robert Haaswrote: > > On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila wrote: > > Won't it be useful to consider parameterized paths for below kind of > > plans where we can push the jointree to worker and each worker can > > scan the complete outer relation A and then the rest work is divided > > among workers (ofcourse there can be other ways to parallelize such joins, > > but still the way described also seems to be possible)? > > > > NestLoop > > -> Seq Scan on A > > Hash Join > > Join Condition: B.Y = C.W > > -> Seq Scan on B > > -> Index Scan using C_Z_IDX on C > > Index Condition: C.Z = A.X > > I had thought that this sort of plan wouldn't actually occur in real > life, but it seems that it does. What you've written here is a little > muddled - the hash join has no hash underneath, for example, and > there'd have to be some sort of join order restriction in order to > consider a plan of this type. However, somewhat to my surprise, I was > able to get a plan much like this by doing this: > .. > > So, all in all, I think this isn't a very promising type of plan - > both because we haven't got the infrastructure to make it safe to > execute today, and because even if we did have that infrastructure it > wouldn't be the right choice except in narrow circumstances. > I think not only above type of plan, but it would be helpful to parallelize some other forms of joins ((refer "Parameterized Paths" section in optimiser/README) as well where parametrized params concept will be required. I am not sure if we can say that such cases will be narrow, so let's leave them, but surely we don't have enough infrastructure at the moment to parallelize them. > We can > of course revise that decision in the future if things look different > then. > No issues. The main reason why I brought up this discussion is to see the possibility of keeping logic of add_partial_path() and add_path() same, so that it is easy to maintain. There is no correctness issue here, so I defer it to you. > > > > Because I think the way code is written, it assumes that for each of the > > inheritence-child relation which has pages lesser than threshold, half > > the work will be done by master-backend which doesn't seem to be the > > right distribution. Consider a case where there are three such children > > each having cost 100 to scan, now it will cost them as > > 100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is > > considering 0.5 of master backends work which seems to be wrong. > > > > I think for Append case, we should consider this cost during Append path > > creation in create_append_path(). Basically we can make cost_seqscan > > to ignore the cost reduction due to parallel_degree for inheritance > > relations > > and then during Append path creation we can consider it and also consider > > work unit of master backend as 0.5 with respect to overall work. > > No, I don't think that's right. It's true that the way we're > calculating parallel_degree for each relation is unprincipled right > now, and we need to improve that. But if it were correct, then what > we're doing here would also be correct. If the number of workers > chosen for each child plan reflected the maximum number that could be > used effectively by that child plan, then any extras wouldn't speed > things up even if they were present, > Okay, but I think that's not what I am talking about. I am talking about below code in cost_seqscan: - if (nworkers > 0) - run_cost = run_cost / (nworkers + 0.5); + if (path->parallel_degree > 0) + run_cost = run_cost / (path->parallel_degree + 0.5); It will consider 50% of master backends effort for scan of each child relation, does that look correct to you? Wouldn't 50% of master backends effort be considered to scan all the child relations? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Nov 17, 2015 at 4:59 PM, Thom Brownwrote: > On 17 November 2015 at 20:08, Robert Haas wrote: >> On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown wrote: >> >>> However, the first parallel seq scan shows it getting 170314 rows. >>> Another run shows it getting 194165 rows. The final result is >>> correct, but as you can see from the rows on the Append node (59094295 >>> rows), it doesn't match the number of rows on the Gather node >>> (3000). >> >> Is this the same issue reported in >> http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com >> and not yet fixed? I am inclined to think it probably is. > > Yes, that seems to be the same issue. I've committed a fix for that issue now, so you shouldn't see it any more if you retest this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapilawrote: > Don't we need the startup cost incase we need to build partial paths for > joinpaths like mergepath? > Also, I think there are other cases for single relation scan where startup > cost can matter like when there are psuedoconstants in qualification > (refer cost_qual_eval_walker()) or let us say if someone has disabled > seq scan (disable_cost is considered as startup cost.) I'm not saying that we don't need to compute it. I'm saying we don't need to take it into consideration when deciding which paths have merit. Note that consider_statup is set this way: rel->consider_startup = (root->tuple_fraction > 0); And for a joinrel: joinrel->consider_startup = (root->tuple_fraction > 0); root->tuple_fraction is 0 when we expect all tuples to be retrieved, and parallel query can currently only be used when all tuples will be retrieved. > B. I think partial path is an important concept and desrves some > explanation in src/backend/optimizer/README. > There is already a good explanation about Paths, so I think it > seems that it is better to add explanation about partial paths. Good idea. In the attached, revised version of the patch, I've added a large new section to that README. > 2. > + * costs: parallelism is only used for plans that will be run to > completion. > + *Therefore, this > routine is much simpler than add_path: it needs to > + *consider only pathkeys and total cost. > > There seems to be some spacing issue in last two lines. Fixed. > A. > This means that for inheritance child relations for which rel pages are > less than parallel_threshold, it will always consider the cost shared > between 1 worker and leader as per below calc in cost_seqscan: > if (path->parallel_degree > 0) > run_cost = run_cost / (path->parallel_degree + 0.5); > > I think this might not be the appropriate cost model for even for > non-inheritence relations which has pages more than parallel_threshold, > but it seems to be even worst for inheritance children which have > pages less than parallel_threshold Why? I'm certainly open to patches to improve the cost model, but I don't see why this patch needs to do that. > B. > Will it be possible that if none of the inheritence child rels (or very few > of them) are big enough for parallel scan, then considering Append > node for parallelism of any use or in other words, won't it be better > to generate plan as it is done now without this patch for such cases > considering current execution model of Gather node? I think we should instead extend the Append node as suggested by KaiGai, so that it can have both partial and non-partial children. I think we can leave that to another patch, though. Aside from questions of what the fastest plan are, it's very bad to have Gather nodes under the Append, because the Append could have many children and we could end up with a ton of Gather nodes, each using a DSM and a bunch of workers. That's important to avoid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 012c14b..fe07176 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1588,6 +1588,7 @@ _outPathInfo(StringInfo str, const Path *node) else _outBitmapset(str, NULL); WRITE_BOOL_FIELD(parallel_aware); + WRITE_INT_FIELD(parallel_degree); WRITE_FLOAT_FIELD(rows, "%.0f"); WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); @@ -1764,7 +1765,6 @@ _outGatherPath(StringInfo str, const GatherPath *node) _outPathInfo(str, (const Path *) node); WRITE_NODE_FIELD(subpath); - WRITE_INT_FIELD(num_workers); WRITE_BOOL_FIELD(single_copy); } @@ -1887,6 +1887,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_NODE_FIELD(reltargetlist); WRITE_NODE_FIELD(pathlist); WRITE_NODE_FIELD(ppilist); + WRITE_NODE_FIELD(partial_pathlist); WRITE_NODE_FIELD(cheapest_startup_path); WRITE_NODE_FIELD(cheapest_total_path); WRITE_NODE_FIELD(cheapest_unique_path); diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README index 916a518..5019804 100644 --- a/src/backend/optimizer/README +++ b/src/backend/optimizer/README @@ -851,4 +851,57 @@ lateral reference. (Perhaps now that that stuff works, we could relax the pullup restriction?) --- bjm & tgl +Parallel Query and Partial Paths + + +Parallel query involves dividing up the work that needs to be performed +either by an entire query or some portion of the query in such a way that +some of that work can be done by one or more worker processes, which are +called parallel workers. Parallel workers are a subtype of dynamic +background workers; see src/backend/access/transam/README.parallel for a +fuller description. Academic literature on parallel query suggests that +that parallel
Re: [HACKERS] [DESIGN] ParallelAppend
On Sat, Nov 14, 2015 at 3:39 AM, Robert Haaswrote: > > On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai wrote: > > I'm now designing the parallel feature of Append... > > > > Here is one challenge. How do we determine whether each sub-plan > > allows execution in the background worker context? > > I've been thinking about these questions for a bit now, and I think we > can work on improving Append in multiple phases. The attached patch > shows what I have in mind for phase 1. > Couple of comments and questions regarding this patch: 1. +/* + * add_partial_path .. + * produce the same number of rows. Neither do we need to consider startup + * costs: parallelism is only used for plans that will be run to completion. A. Don't we need the startup cost incase we need to build partial paths for joinpaths like mergepath? Also, I think there are other cases for single relation scan where startup cost can matter like when there are psuedoconstants in qualification (refer cost_qual_eval_walker()) or let us say if someone has disabled seq scan (disable_cost is considered as startup cost.) B. I think partial path is an important concept and desrves some explanation in src/backend/optimizer/README. There is already a good explanation about Paths, so I think it seems that it is better to add explanation about partial paths. 2. + * costs: parallelism is only used for plans that will be run to completion. + *Therefore, this routine is much simpler than add_path: it needs to + *consider only pathkeys and total cost. There seems to be some spacing issue in last two lines. 3. +static void +create_parallel_paths(PlannerInfo *root, RelOptInfo *rel) +{ + int parallel_threshold = 1000; + int parallel_degree = 1; + + /* + * If this relation is too small to be worth a parallel scan, just return + * without doing anything ... unless it's an inheritance child. In that case, + * we want to generate a parallel path here anyway. It might not be worthwhile + * just for this relation, but when combined with all of its inheritance siblings + * it may well pay off. + */ + if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL) + return; A. This means that for inheritance child relations for which rel pages are less than parallel_threshold, it will always consider the cost shared between 1 worker and leader as per below calc in cost_seqscan: if (path->parallel_degree > 0) run_cost = run_cost / (path->parallel_degree + 0.5); I think this might not be the appropriate cost model for even for non-inheritence relations which has pages more than parallel_threshold, but it seems to be even worst for inheritance children which have pages less than parallel_threshold B. Will it be possible that if none of the inheritence child rels (or very few of them) are big enough for parallel scan, then considering Append node for parallelism of any use or in other words, won't it be better to generate plan as it is done now without this patch for such cases considering current execution model of Gather node? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Nov 17, 2015 at 4:26 AM, Thom Brownwrote: > Okay, I've tried this patch. Thanks! > Yes, it's working! Woohoo. > However, the first parallel seq scan shows it getting 170314 rows. > Another run shows it getting 194165 rows. The final result is > correct, but as you can see from the rows on the Append node (59094295 > rows), it doesn't match the number of rows on the Gather node > (3000). Is this the same issue reported in http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com and not yet fixed? I am inclined to think it probably is. > And also, for some reason, I can no longer get this using more than 2 > workers, even with max_worker_processes = 16 and max_parallel_degree = > 12. I don't know if that's anything to do with this patch though. The number of workers is limited based on the size of the largest table involved in the Append. That probably needs considerable improvement, of course, but this patch is still a step forward over not-this-patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On 13 November 2015 at 22:09, Robert Haaswrote: > On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai wrote: >> I'm now designing the parallel feature of Append... >> >> Here is one challenge. How do we determine whether each sub-plan >> allows execution in the background worker context? > > I've been thinking about these questions for a bit now, and I think we > can work on improving Append in multiple phases. The attached patch > shows what I have in mind for phase 1. Currently, if you set up an > inheritance hierarchy, you might get an Append some of whose children > are Gather nodes with Parallel Seq Scans under them, like this: > > Append > -> Gather > -> Parallel Seq Scan > -> Gather > -> Parallel Seq Scan > -> Seq Scan > > This is a crappy plan. Each Gather node will try to launch its own > bunch of workers, which sucks. The attached patch lets you get this > plan instead: > > Gather > -> Append > -> Partial Seq Scan > -> Partial Seq Scan > -> Partial Seq Scan > > That's a much better plan. > > To make this work, this plan introduces a couple of new concepts. > Each RelOptInfo gets a new partial_pathlist field, which stores paths > that need to be gathered to produce a workable plan. Once we've > populated the partial_pathlist with whatever partial paths we know how > to generate, we can either push a Gather node on top of one of those > partial paths to create a real path; or we can use those partial paths > to build more partial paths. The current patch handles only the > simplest case: we can build a partial path for an appendrel by > appending a partial path for each member rel. But eventually I hope > to handle joinrels this way as well: you can join a partial path with > an ordinary path for form a partial path for the joinrel. > > This requires some way of figuring out how many workers to request for > the append-path, so this patch also adds a parallel_degree field to > the path object, which is 0 for non-parallel things and the number of > workers that the path believes to be ideal otherwise. Right now, it > just percolates the highest worker count of any child up to the > appendrel, which might not be right, especially once the append node > knows how to balance workers, but it seems like a reasonable place to > start. > >> Type-A) It can be performed on background worker context and >>picked up by multiple worker processes concurrently. >> (e.g: Parallel SeqScan) >> Type-B) It can be performed on background worker context but >> cannot be picked up by multiple worker processes. >> (e.g: non-parallel aware node) >> Type-C) It should not be performed on background worker context. >>(e.g: plan/path node with volatile functions) > > At the time that we're forming an AppendPath, we can identify what > you're calling type-A paths very easily: they're the ones that are > coming from the partial_pathlist. We can distinguish between type-B > and type-C paths coming from the childrel's pathlist based on the > childrel's consider_parallel flag: if it's false, it's type-C, else > type-B. At some point, we might need a per-path flag to distinguish > cases where a particular path is type-C even though some other plan > for that relation might be type-B. But right now that case doesn't > arise. > > Now, of course, it's not good enough to have this information > available when we're generating the AppendPath; it has to survive > until execution time. But that doesn't mean the paths need to be > self-identifying. We could, of course, decide to make them so, and > maybe that's the best design, but I'm thinking about another approach: > suppose the append node itself, instead of having one list of child > plans, has a list of type-A plans, a list of type-B plans, and a list > of type-C plans. This actually seems quite convenient, because at > execution time, you presumably want the leader to prioritize type-C > plans over any of the others, since it's the only one that can execute > them, and the workers to prioritize type-B plans, since they can only > take one worker each and are thus prone to be the limiting factor on > finishing the whole Append. Having the plans divided in advance > between these three lists (say, restricted_plans, safe_plans, > parallel_plans) makes that easy to implement. > > Incidentally, I think it's subtly wrong to think of the parallel_aware > flag as telling you whether the plan can absorb multiple workers. > That's not really what it's for. It's to tell you whether the plan is > doing *something* parallel aware - that is, whether its Estimate, > InitializeDSM, and InitializeWorker callbacks should do anything. For > SeqScan, flipping parallel_aware actually does split the input among > all the workers, but for Append it's probably just load balances and > for other nodes it might be something else again. The term I'm using > to indicate a path/plan that returns only a subset of the results in
Re: [HACKERS] [DESIGN] ParallelAppend
On 17 November 2015 at 20:08, Robert Haaswrote: > On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown wrote: > >> However, the first parallel seq scan shows it getting 170314 rows. >> Another run shows it getting 194165 rows. The final result is >> correct, but as you can see from the rows on the Append node (59094295 >> rows), it doesn't match the number of rows on the Gather node >> (3000). > > Is this the same issue reported in > http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com > and not yet fixed? I am inclined to think it probably is. Yes, that seems to be the same issue. >> And also, for some reason, I can no longer get this using more than 2 >> workers, even with max_worker_processes = 16 and max_parallel_degree = >> 12. I don't know if that's anything to do with this patch though. > > The number of workers is limited based on the size of the largest > table involved in the Append. That probably needs considerable > improvement, of course, but this patch is still a step forward over > not-this-patch. Ah, okay. I wasn't aware of this. I'll bear that in mind in future. Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigaiwrote: > > I'm now designing the parallel feature of Append... > > > > Here is one challenge. How do we determine whether each sub-plan > > allows execution in the background worker context? > > I've been thinking about these questions for a bit now, and I think we > can work on improving Append in multiple phases. The attached patch > shows what I have in mind for phase 1. Currently, if you set up an > inheritance hierarchy, you might get an Append some of whose children > are Gather nodes with Parallel Seq Scans under them, like this: > > Append > -> Gather > -> Parallel Seq Scan > -> Gather > -> Parallel Seq Scan > -> Seq Scan > > This is a crappy plan. Each Gather node will try to launch its own > bunch of workers, which sucks. The attached patch lets you get this > plan instead: > > Gather > -> Append > -> Partial Seq Scan > -> Partial Seq Scan > -> Partial Seq Scan > > That's a much better plan. > > To make this work, this plan introduces a couple of new concepts. > Each RelOptInfo gets a new partial_pathlist field, which stores paths > that need to be gathered to produce a workable plan. Once we've > populated the partial_pathlist with whatever partial paths we know how > to generate, we can either push a Gather node on top of one of those > partial paths to create a real path; or we can use those partial paths > to build more partial paths. The current patch handles only the > simplest case: we can build a partial path for an appendrel by > appending a partial path for each member rel. But eventually I hope > to handle joinrels this way as well: you can join a partial path with > an ordinary path for form a partial path for the joinrel. > This idea will solve my concern gracefully. The new partial_pathlist keeps candidate of path-nodes to be gathered in this level or upper. Unlike path-nodes in the pathlist already, we don't need to rip off GatherPath later. Can we expect any path-nodes in the partial_pathlist don't contain underlying GatherPath even if and when we would apply this design on joinrel also? If we would be able to ensure the path-nodes in partial_pathlist is safe to run under the Gather node - it never contains Gather itself or any others should not perform on the background worker context, it will make path consideration much simpler than my expectation. > This requires some way of figuring out how many workers to request for > the append-path, so this patch also adds a parallel_degree field to > the path object, which is 0 for non-parallel things and the number of > workers that the path believes to be ideal otherwise. Right now, it > just percolates the highest worker count of any child up to the > appendrel, which might not be right, especially once the append node > knows how to balance workers, but it seems like a reasonable place to > start. > I agree with. The new parallel_degree will give useful information, and one worker per child relation is a good start. > > Type-A) It can be performed on background worker context and > >picked up by multiple worker processes concurrently. > > (e.g: Parallel SeqScan) > > Type-B) It can be performed on background worker context but > > cannot be picked up by multiple worker processes. > > (e.g: non-parallel aware node) > > Type-C) It should not be performed on background worker context. > >(e.g: plan/path node with volatile functions) > > At the time that we're forming an AppendPath, we can identify what > you're calling type-A paths very easily: they're the ones that are > coming from the partial_pathlist. We can distinguish between type-B > and type-C paths coming from the childrel's pathlist based on the > childrel's consider_parallel flag: if it's false, it's type-C, else > type-B. At some point, we might need a per-path flag to distinguish > cases where a particular path is type-C even though some other plan > for that relation might be type-B. But right now that case doesn't > arise. > I also think we eventually have to have a per-path flag when we support parallel capability on joinrel, although, it is not an immediate action. > Now, of course, it's not good enough to have this information > available when we're generating the AppendPath; it has to survive > until execution time. But that doesn't mean the paths need to be > self-identifying. We could, of course, decide to make them so, and > maybe that's the best design, but I'm thinking about another approach: > suppose the append node itself, instead of having one list of child > plans, has a list of type-A plans, a list of type-B plans, and a list > of type-C plans. This actually seems quite convenient, because at > execution time, you presumably want the leader to prioritize type-C > plans over any of the others, since it's the only one that can execute > them, and the workers to prioritize type-B plans, since they can only > take one worker each and are
Re: [HACKERS] [DESIGN] ParallelAppend
On Mon, Nov 16, 2015 at 10:10 AM, Kouhei Kaigaiwrote: > This idea will solve my concern gracefully. > The new partial_pathlist keeps candidate of path-nodes to be gathered > in this level or upper. Unlike path-nodes in the pathlist already, we > don't need to rip off GatherPath later. Cool, yes. > Can we expect any path-nodes in the partial_pathlist don't contain > underlying GatherPath even if and when we would apply this design on > joinrel also? Yes. A path that's already been gathered is not partial any more. However, to create a partial path for a joinrel, we must join a partial path to a complete path. The complete path mustn't be one which internally contains a Gather. This is where we need another per-path flag, I think. > I'd like to agree with this idea. If Append can handle restricted_plans > concurrently with safe_plans and parallel_plans, we don't need to give > up parallelism even if any of child relation has neither safe- nor > parallel-plans. Right. > One thing we need to pay attention is, we have to inform Gather node > to kick local sub-plans if underlying Append node has any restricted > plans. It also needs to distinguish the case when Gather node cannot > launch any background workers, because the first case runs only type-C > but the second case has to run all the sub-plans in local context. I don't think that Gather needs to know anything about what's under the Append. What I think we want is that when we execute the Append: (1) If we're the leader or not in parallel mode, run restricted plans, then parallel plans, then safe plans. (2) If we're a worker, run safe plans, then parallel plans. (3) Either way, never run a safe plan if the leader or some other worker has already begun to execute it. The reason to have the leader prefer parallel plans to safe plans is that it is more likely to become a bottleneck than the workers. Thus it should prefer to do work which can be split up rather than claiming a whole plan for itself. But in the case of restricted plans it has no choice, since no one else can execute those, and it should do them first, since they may be the limiting factor in finishing the whole plan. >> Incidentally, I think it's subtly wrong to think of the parallel_aware >> flag as telling you whether the plan can absorb multiple workers. >> That's not really what it's for. It's to tell you whether the plan is >> doing *something* parallel aware - that is, whether its Estimate, >> InitializeDSM, and InitializeWorker callbacks should do anything. For >> SeqScan, flipping parallel_aware actually does split the input among >> all the workers, but for Append it's probably just load balances and >> for other nodes it might be something else again. The term I'm using >> to indicate a path/plan that returns only a subset of the results in >> each worker is "partial". >> > Therefore, a NestLoop that takes underlying ParallelSeqScan and IndexScan > may not be parallel aware by itself, however, it is exactly partial. Right. > This NestLoop will has parallel_degree likely larger than "1", won't it? Larger than 0. > It seems to me the "partial" is more clear concept to introduce how sub- > plan will perform. Good. >> Whether or not a path is partial is, in the >> design embodied in this patch, indicated both by whether >> path->parallel_degree > 0 and whether the path is in rel->pathlist or >> rel->partial_pathlist. >> > We should have Assert to detect paths with parallel_degree==0 but in > the rel->partial_pathlist or parallel_degree > 1 but not appear in > the rel->partial_pathlist? parallel_degree==0 in the partial_pathlist is bad, but parallel_degree>0 in the regular pathlist is OK, at least if it's a Gather node. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigaiwrote: > I'm now designing the parallel feature of Append... > > Here is one challenge. How do we determine whether each sub-plan > allows execution in the background worker context? I've been thinking about these questions for a bit now, and I think we can work on improving Append in multiple phases. The attached patch shows what I have in mind for phase 1. Currently, if you set up an inheritance hierarchy, you might get an Append some of whose children are Gather nodes with Parallel Seq Scans under them, like this: Append -> Gather -> Parallel Seq Scan -> Gather -> Parallel Seq Scan -> Seq Scan This is a crappy plan. Each Gather node will try to launch its own bunch of workers, which sucks. The attached patch lets you get this plan instead: Gather -> Append -> Partial Seq Scan -> Partial Seq Scan -> Partial Seq Scan That's a much better plan. To make this work, this plan introduces a couple of new concepts. Each RelOptInfo gets a new partial_pathlist field, which stores paths that need to be gathered to produce a workable plan. Once we've populated the partial_pathlist with whatever partial paths we know how to generate, we can either push a Gather node on top of one of those partial paths to create a real path; or we can use those partial paths to build more partial paths. The current patch handles only the simplest case: we can build a partial path for an appendrel by appending a partial path for each member rel. But eventually I hope to handle joinrels this way as well: you can join a partial path with an ordinary path for form a partial path for the joinrel. This requires some way of figuring out how many workers to request for the append-path, so this patch also adds a parallel_degree field to the path object, which is 0 for non-parallel things and the number of workers that the path believes to be ideal otherwise. Right now, it just percolates the highest worker count of any child up to the appendrel, which might not be right, especially once the append node knows how to balance workers, but it seems like a reasonable place to start. > Type-A) It can be performed on background worker context and >picked up by multiple worker processes concurrently. > (e.g: Parallel SeqScan) > Type-B) It can be performed on background worker context but > cannot be picked up by multiple worker processes. > (e.g: non-parallel aware node) > Type-C) It should not be performed on background worker context. >(e.g: plan/path node with volatile functions) At the time that we're forming an AppendPath, we can identify what you're calling type-A paths very easily: they're the ones that are coming from the partial_pathlist. We can distinguish between type-B and type-C paths coming from the childrel's pathlist based on the childrel's consider_parallel flag: if it's false, it's type-C, else type-B. At some point, we might need a per-path flag to distinguish cases where a particular path is type-C even though some other plan for that relation might be type-B. But right now that case doesn't arise. Now, of course, it's not good enough to have this information available when we're generating the AppendPath; it has to survive until execution time. But that doesn't mean the paths need to be self-identifying. We could, of course, decide to make them so, and maybe that's the best design, but I'm thinking about another approach: suppose the append node itself, instead of having one list of child plans, has a list of type-A plans, a list of type-B plans, and a list of type-C plans. This actually seems quite convenient, because at execution time, you presumably want the leader to prioritize type-C plans over any of the others, since it's the only one that can execute them, and the workers to prioritize type-B plans, since they can only take one worker each and are thus prone to be the limiting factor on finishing the whole Append. Having the plans divided in advance between these three lists (say, restricted_plans, safe_plans, parallel_plans) makes that easy to implement. Incidentally, I think it's subtly wrong to think of the parallel_aware flag as telling you whether the plan can absorb multiple workers. That's not really what it's for. It's to tell you whether the plan is doing *something* parallel aware - that is, whether its Estimate, InitializeDSM, and InitializeWorker callbacks should do anything. For SeqScan, flipping parallel_aware actually does split the input among all the workers, but for Append it's probably just load balances and for other nodes it might be something else again. The term I'm using to indicate a path/plan that returns only a subset of the results in each worker is "partial". Whether or not a path is partial is, in the design embodied in this patch, indicated both by whether path->parallel_degree > 0 and whether the path is in rel->pathlist or rel->partial_pathlist. -- Robert Haas
Re: [HACKERS] [DESIGN] ParallelAppend
> On 2015/11/12 14:09, Kouhei Kaigai wrote: > > I'm now designing the parallel feature of Append... > > > > Here is one challenge. How do we determine whether each sub-plan > > allows execution in the background worker context? > > > > The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added > > 'parallel_aware' flag on Path and Plan structure. > > It tells us whether the plan/path-node can perform by multiple > > background worker processes concurrently, but also tells us > > nothing whether the plan/path-node are safe to run on background > > worker process context. > > When I was looking at the recent parallelism related commits, I noticed a > RelOptInfo.consider_parallel flag. That and the function > set_rel_consider_parallel() may be of interest in this regard. > set_append_rel_size() passes the parent's state of this flag down to child > relations but I guess that's not what you're after. > Thanks for this information. Indeed, it shall inform us which base relations are valid for parallel execution. In case of parallel-append, we can give up parallelism if any of underlying base relation is not parallel aware. We can use same logic for join relation cases, potentially. A challenge is how to count up optimal number of background worker process. I assume the number of workers of parallel-append shall be sum of required number of workers by the sub-plans unless it does not exceed max_parallel_degree. Probably, we need pathnode_tree_walker() to count up number of workers required by the sub-plans. BTW, is the idea of consider_parallel flag in RelOptInfo workable for join relations also? In case when A LEFT JOIN B for example, it can be parallel aware if join path has A as outer input, but it cannot be if B would be outer input. I doubt that this kind of information belong to Path, not RelOptInfo. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
I'm now designing the parallel feature of Append... Here is one challenge. How do we determine whether each sub-plan allows execution in the background worker context? The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added 'parallel_aware' flag on Path and Plan structure. It tells us whether the plan/path-node can perform by multiple background worker processes concurrently, but also tells us nothing whether the plan/path-node are safe to run on background worker process context. From the standpoint of parallel execution, I understand here are three types of plan/path nodes. Type-A) It can be performed on background worker context and picked up by multiple worker processes concurrently. (e.g: Parallel SeqScan) Type-B) It can be performed on background worker context but cannot be picked up by multiple worker processes. (e.g: non-parallel aware node) Type-C) It should not be performed on background worker context. (e.g: plan/path node with volatile functions) The 'parallel_aware' flag allows to distinguish the type-A and others, however, we cannot distinguish type-B from type-C. From the standpoint of parallel append, it makes sense to launch background workers even if all the sub-plan are type-B, with no type-A node. Sorry for late. I'd like to suggest to have 'int num_workers' in Path and Plan node as a common field. We give this field the following three meaning. - If num_workers > 1, it is type-A node, thus, parallel aware Append node shall assign more than one workers on this node. - If num_workers == 1, it is type-B node, thus, more than one background worker process shall be never assigned. - If num_workers == 0, it is type-C node, thus, planner shall give up to run this node on background worker context. The num_workers state shall propagate to the upper node. For example, I expect a HashJoin node that takes Parallel SeqScan with num_workers == 4 as outer input will also have num_worker == 4, as long as join clauses are safe to run on background worker side. How about the idea? Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kai...@ak.jp.nec.com> > -Original Message- > From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Saturday, October 31, 2015 1:35 AM > To: Robert Haas > Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI > Subject: Re: [HACKERS] [DESIGN] ParallelAppend > > > On Wed, Oct 28, 2015 at 3:55 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > > > At PGconf.EU, I could have a talk with Robert about this topic, > > > then it became clear we have same idea. > > > > > >> ++ > > >> |sub-plan | * Sub-Plan 1 ... Index Scan on p1 > > >> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 > > >> |shared | * Sub-Plan 2 ... PartialSeqScan on p2 > > >> |memory | * Sub-Plan 2 ... PartialSeqScan on p2 > > >> +-+ * Sub-Plan 3 ... Index Scan on p3 > > >> > > > In the above example, I put non-parallel sub-plan to use only > > > 1 slot of the array, even though a PartialSeqScan takes 3 slots. > > > It is a strict rule; non-parallel aware sub-plan can be picked > > > up once. > > > The index of sub-plan array is initialized to 0, then increased > > > to 5 by each workers when it processes the parallel-aware Append. > > > So, once a worker takes non-parallel sub-plan, other worker can > > > never take the same slot again, thus, no duplicated rows will be > > > produced by non-parallel sub-plan in the parallel aware Append. > > > Also, this array structure will prevent too large number of > > > workers pick up a particular parallel aware sub-plan, because > > > PartialSeqScan occupies 3 slots; that means at most three workers > > > can pick up this sub-plan. If 1st worker took the IndexScan on > > > p1, and 2nd-4th worker took the PartialSeqScan on p2, then the > > > 5th worker (if any) will pick up the IndexScan on p3 even if > > > PartialSeqScan on p2 was not completed. > > > > Actually, this is not exactly what I had in mind. I was thinking that > > we'd have a single array whose length is equal to the number of Append > > subplans, and each element of the array would be a count of the number > > of workers executing that subplan. So there wouldn't be multiple > > entries for the same subplan, as you propose here. To distinguish > > between parallel-aware and non-parallel-aware plans, I plan to put a > > Boolean flag in the plan itself. > > > I don't have strong preference here. Both of design can implement the > requirement; none-parallel
Re: [HACKERS] [DESIGN] ParallelAppend
On 2015/11/12 14:09, Kouhei Kaigai wrote: > I'm now designing the parallel feature of Append... > > Here is one challenge. How do we determine whether each sub-plan > allows execution in the background worker context? > > The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added > 'parallel_aware' flag on Path and Plan structure. > It tells us whether the plan/path-node can perform by multiple > background worker processes concurrently, but also tells us > nothing whether the plan/path-node are safe to run on background > worker process context. When I was looking at the recent parallelism related commits, I noticed a RelOptInfo.consider_parallel flag. That and the function set_rel_consider_parallel() may be of interest in this regard. set_append_rel_size() passes the parent's state of this flag down to child relations but I guess that's not what you're after. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Wed, Oct 28, 2015 at 3:55 PM, Kouhei Kaigaiwrote: > At PGconf.EU, I could have a talk with Robert about this topic, > then it became clear we have same idea. > >> ++ >> |sub-plan | * Sub-Plan 1 ... Index Scan on p1 >> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 >> |shared | * Sub-Plan 2 ... PartialSeqScan on p2 >> |memory | * Sub-Plan 2 ... PartialSeqScan on p2 >> +-+ * Sub-Plan 3 ... Index Scan on p3 >> > In the above example, I put non-parallel sub-plan to use only > 1 slot of the array, even though a PartialSeqScan takes 3 slots. > It is a strict rule; non-parallel aware sub-plan can be picked > up once. > The index of sub-plan array is initialized to 0, then increased > to 5 by each workers when it processes the parallel-aware Append. > So, once a worker takes non-parallel sub-plan, other worker can > never take the same slot again, thus, no duplicated rows will be > produced by non-parallel sub-plan in the parallel aware Append. > Also, this array structure will prevent too large number of > workers pick up a particular parallel aware sub-plan, because > PartialSeqScan occupies 3 slots; that means at most three workers > can pick up this sub-plan. If 1st worker took the IndexScan on > p1, and 2nd-4th worker took the PartialSeqScan on p2, then the > 5th worker (if any) will pick up the IndexScan on p3 even if > PartialSeqScan on p2 was not completed. Actually, this is not exactly what I had in mind. I was thinking that we'd have a single array whose length is equal to the number of Append subplans, and each element of the array would be a count of the number of workers executing that subplan. So there wouldn't be multiple entries for the same subplan, as you propose here. To distinguish between parallel-aware and non-parallel-aware plans, I plan to put a Boolean flag in the plan itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
> On Wed, Oct 28, 2015 at 3:55 PM, Kouhei Kaigaiwrote: > > At PGconf.EU, I could have a talk with Robert about this topic, > > then it became clear we have same idea. > > > >> ++ > >> |sub-plan | * Sub-Plan 1 ... Index Scan on p1 > >> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 > >> |shared | * Sub-Plan 2 ... PartialSeqScan on p2 > >> |memory | * Sub-Plan 2 ... PartialSeqScan on p2 > >> +-+ * Sub-Plan 3 ... Index Scan on p3 > >> > > In the above example, I put non-parallel sub-plan to use only > > 1 slot of the array, even though a PartialSeqScan takes 3 slots. > > It is a strict rule; non-parallel aware sub-plan can be picked > > up once. > > The index of sub-plan array is initialized to 0, then increased > > to 5 by each workers when it processes the parallel-aware Append. > > So, once a worker takes non-parallel sub-plan, other worker can > > never take the same slot again, thus, no duplicated rows will be > > produced by non-parallel sub-plan in the parallel aware Append. > > Also, this array structure will prevent too large number of > > workers pick up a particular parallel aware sub-plan, because > > PartialSeqScan occupies 3 slots; that means at most three workers > > can pick up this sub-plan. If 1st worker took the IndexScan on > > p1, and 2nd-4th worker took the PartialSeqScan on p2, then the > > 5th worker (if any) will pick up the IndexScan on p3 even if > > PartialSeqScan on p2 was not completed. > > Actually, this is not exactly what I had in mind. I was thinking that > we'd have a single array whose length is equal to the number of Append > subplans, and each element of the array would be a count of the number > of workers executing that subplan. So there wouldn't be multiple > entries for the same subplan, as you propose here. To distinguish > between parallel-aware and non-parallel-aware plans, I plan to put a > Boolean flag in the plan itself. > I don't have strong preference here. Both of design can implement the requirement; none-parallel sub-plans are never picked up twice, and parallel-aware sub-plans can be picked up multiple times. So, I'll start with the above your suggestion. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
At PGconf.EU, I could have a talk with Robert about this topic, then it became clear we have same idea. > ++ > |sub-plan | * Sub-Plan 1 ... Index Scan on p1 > |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 > |shared | * Sub-Plan 2 ... PartialSeqScan on p2 > |memory | * Sub-Plan 2 ... PartialSeqScan on p2 > +-+ * Sub-Plan 3 ... Index Scan on p3 > In the above example, I put non-parallel sub-plan to use only 1 slot of the array, even though a PartialSeqScan takes 3 slots. It is a strict rule; non-parallel aware sub-plan can be picked up once. The index of sub-plan array is initialized to 0, then increased to 5 by each workers when it processes the parallel-aware Append. So, once a worker takes non-parallel sub-plan, other worker can never take the same slot again, thus, no duplicated rows will be produced by non-parallel sub-plan in the parallel aware Append. Also, this array structure will prevent too large number of workers pick up a particular parallel aware sub-plan, because PartialSeqScan occupies 3 slots; that means at most three workers can pick up this sub-plan. If 1st worker took the IndexScan on p1, and 2nd-4th worker took the PartialSeqScan on p2, then the 5th worker (if any) will pick up the IndexScan on p3 even if PartialSeqScan on p2 was not completed. One other thought experiment, what happen if parallel-aware Append is underlying another parallel-aware Append. As literal, parallel-aware Append is parallel-aware, thus, it can occupy multiple slots in the array of sub-plans, like: subplans[0] ... SeqScan on p1 subplans[1] ... Parallel Append on p2+p3+p4 subplans[2] ... Parallel Append on p2+p3+p4 subplans[3] ... Parallel Append on p2+p3+p4 subplans[4] ... Parallel Append on p2+p3+p4 subplans[5] ... IndexScan on p5 Also, assume the child parallel-aware Append the following array. subplans[0] ... SeqScan on p2 subplans[1] ... PartialSeqScan on p3 subplans[2] ... PartialSeqScan on p3 subplans[3] ... SeqScan on p4 The Gather node located on top of the upper Append node will launch (at most) 6 workers according to the requirement by the upper Append node. Each worker picks up a particular sub-plan in the array of upper Append node, so some of them (4 workers at most) will execute the child Append node. Then, these 4 workers will also pick up a particular sub-plan in the array of child Append node. It will work even if number of workers are less than the optimal, so I believe the overall design is reasonable. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kai...@ak.jp.nec.com> > -Original Message- > From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai > Sent: Tuesday, October 27, 2015 6:46 AM > To: Robert Haas > Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI > Subject: Re: [HACKERS] [DESIGN] ParallelAppend > > > -Original Message- > > From: pgsql-hackers-ow...@postgresql.org > > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas > > Sent: Monday, October 26, 2015 8:53 PM > > To: Kaigai Kouhei(海外 浩平) > > Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI > > Subject: Re: [HACKERS] [DESIGN] ParallelAppend > > > > On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > > > I entirely agree with your suggestion. > > > > > > We may be able to use an analogy between PartialSeqScan and the > > > parallel- aware Append node. > > > PartialSeqScan fetches blocks pointed by the index on shared memory > > > segment, thus multiple workers eventually co-operate to scan a table > > > using round-robin manner even though individual worker fetches comb- > > > shaped blocks. > > > If we assume individual blocks are individual sub-plans on the parallel > > > aware Append, it performs very similar. A certain number of workers > > > (more than zero) is launched by Gather node, then the parallel aware > > > Append node fetches one of the sub-plans if any. > > > > Exactly, except for the part about the blocks being "comb-shaped", > > which doesn't seem to make sense. > > > > > I think, this design also gives additional flexibility according to > > > the required parallelism by the underlying sub-plans. > > > Please assume the "PartialSeqScan on p2" in the above example wants > > > 3 workers to process the scan, we can expand the virtual array of > > > the sub-plans as follows. Then, if Gather node kicks 5 workers, > > > individual workers are assigned on some of plans. If Gather node > > > could kick less than 5 workers, the first exit worker picks the > >
Re: [HACKERS] [DESIGN] ParallelAppend
On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigaiwrote: > I entirely agree with your suggestion. > > We may be able to use an analogy between PartialSeqScan and the > parallel- aware Append node. > PartialSeqScan fetches blocks pointed by the index on shared memory > segment, thus multiple workers eventually co-operate to scan a table > using round-robin manner even though individual worker fetches comb- > shaped blocks. > If we assume individual blocks are individual sub-plans on the parallel > aware Append, it performs very similar. A certain number of workers > (more than zero) is launched by Gather node, then the parallel aware > Append node fetches one of the sub-plans if any. Exactly, except for the part about the blocks being "comb-shaped", which doesn't seem to make sense. > I think, this design also gives additional flexibility according to > the required parallelism by the underlying sub-plans. > Please assume the "PartialSeqScan on p2" in the above example wants > 3 workers to process the scan, we can expand the virtual array of > the sub-plans as follows. Then, if Gather node kicks 5 workers, > individual workers are assigned on some of plans. If Gather node > could kick less than 5 workers, the first exit worker picks the > second sub-plan, then it eventually provides the best parallelism. > > ++ > |sub-plan | * Sub-Plan 1 ... Index Scan on p1 > |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 > |shared | * Sub-Plan 2 ... PartialSeqScan on p2 > |memory | * Sub-Plan 2 ... PartialSeqScan on p2 > +-+ * Sub-Plan 3 ... Index Scan on p3 I don't think the shared memory chunk should be indexed by worker, but by sub-plan. So with 3 subplans, we would initially have [0,0,0]. The first worker would grab the first subplan, and we get [1,0,0]. The second worker grabs the third subplan, so now we have [1,0,1]. The remaining workers can't join the execution of those plans because they are not truly parallel, so they all gang up on the second subplan. At 5 workers we have [1,3,1]. Workers need not ever decrement the array entries because they only pick a new sub-plan when the one they picked previously is exhausted; thus, at the end of the plan, each element in the array shows the total number of workers that touched it at some point during its execution. > For more generic plan construction, Plan node may have a field for > number of "desirable" workers even though most of plan-nodes are not > parallel aware, and it is not guaranteed. > In above case, the parallel aware Append will want 5 workers in total > (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion > of Gather node how many workers are actually launched, however, it > will give clear information how many workers are best. Yeah, maybe. I haven't thought about this deeply just yet, but I agree it needs more consideration. >> First, we can teach Append that, when running in parallel, >> it should initialize a chunk of dynamic shared memory with an array >> indicating how many workers are currently working on each subplan. > Can the parallel-aware Append can recognize the current mode using > MyBgworkerEntry whether it is valid or not? No - that would be quite wrong. What it needs to do is define ExecAppendEstimate and ExecAppendInitializeDSM and call those functions from ExecParallelEstimate and ExecParallelInitializeDSM. It also needs to define a third callback ExecAppendInitializeWorker which will be called from the ExecParallelInitializeWorker function added by the latest partial seq scan patch. ExecAppendEstimate must estimate required shared memory usage for the shared memory state; ExecAppendInitializeDSM must initialize that state, store a pointer to it in the planstate note, and add a TOC entry; ExecAppendWorker will run in the worker and should look up the TOC entry and store the result in the same planstate node that ExecAppendInitializeDSM populated in the leader. Then ExecAppend can decide what to do based on whether that pointer is set, and based on the data to which it points. Are you going to look at implementing this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
> -Original Message- > From: pgsql-hackers-ow...@postgresql.org > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas > Sent: Monday, October 26, 2015 8:53 PM > To: Kaigai Kouhei(海外 浩平) > Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI > Subject: Re: [HACKERS] [DESIGN] ParallelAppend > > On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > > I entirely agree with your suggestion. > > > > We may be able to use an analogy between PartialSeqScan and the > > parallel- aware Append node. > > PartialSeqScan fetches blocks pointed by the index on shared memory > > segment, thus multiple workers eventually co-operate to scan a table > > using round-robin manner even though individual worker fetches comb- > > shaped blocks. > > If we assume individual blocks are individual sub-plans on the parallel > > aware Append, it performs very similar. A certain number of workers > > (more than zero) is launched by Gather node, then the parallel aware > > Append node fetches one of the sub-plans if any. > > Exactly, except for the part about the blocks being "comb-shaped", > which doesn't seem to make sense. > > > I think, this design also gives additional flexibility according to > > the required parallelism by the underlying sub-plans. > > Please assume the "PartialSeqScan on p2" in the above example wants > > 3 workers to process the scan, we can expand the virtual array of > > the sub-plans as follows. Then, if Gather node kicks 5 workers, > > individual workers are assigned on some of plans. If Gather node > > could kick less than 5 workers, the first exit worker picks the > > second sub-plan, then it eventually provides the best parallelism. > > > > ++ > > |sub-plan | * Sub-Plan 1 ... Index Scan on p1 > > |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 > > |shared | * Sub-Plan 2 ... PartialSeqScan on p2 > > |memory | * Sub-Plan 2 ... PartialSeqScan on p2 > > +-+ * Sub-Plan 3 ... Index Scan on p3 > > I don't think the shared memory chunk should be indexed by worker, but > by sub-plan. So with 3 subplans, we would initially have [0,0,0]. > The first worker would grab the first subplan, and we get [1,0,0]. > The second worker grabs the third subplan, so now we have [1,0,1]. > The remaining workers can't join the execution of those plans because > they are not truly parallel, so they all gang up on the second > subplan. At 5 workers we have [1,3,1]. Workers need not ever > decrement the array entries because they only pick a new sub-plan when > the one they picked previously is exhausted; thus, at the end of the > plan, each element in the array shows the total number of workers that > touched it at some point during its execution. > Sorry, I could not get the point in the above explanation. The 1st worker would grab the first subplan, then [1,0,0]. It's OK. The 2nd worker would grab the last subplan, then [1,0,1]. It's understandable, even though I'm uncertain why it does not pick up the 2nd one. Why remaining worker have to gang up to kick the 2nd (PartialSeqScan; that is parallel-aware execution node)? Even if only one worker is kicked towards the PartialSeqScan, it tries to scan the relation sequentially (because of no parallel workers actually). Then, once other worker gets additionally launched to scan same relation, both of the worker begins to co-operate using a common block-index kept on the shared memory. So, do we need to wait for completion of non-parallel aware nodes here? I assume that it is better to launch PartialSeqScan, even if one worker, than synchronization, because other worker can join the execution later. > > For more generic plan construction, Plan node may have a field for > > number of "desirable" workers even though most of plan-nodes are not > > parallel aware, and it is not guaranteed. > > In above case, the parallel aware Append will want 5 workers in total > > (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion > > of Gather node how many workers are actually launched, however, it > > will give clear information how many workers are best. > > Yeah, maybe. I haven't thought about this deeply just yet, but I > agree it needs more consideration. > OK, I'll build a patch including the concept. > >> First, we can teach Append that, when running in parallel, > >> it should initialize a chunk of dynamic shared memory with an array > >> indicating how many workers are currently working on each subplan. > > Can the parallel-aware Append can recognize the current mode using > > MyBg
Re: [HACKERS] [DESIGN] ParallelAppend
> On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigaiwrote: > > I'm recently working/investigating on ParallelAppend feature > > towards the next commit fest. Below is my design proposal. > > > > 1. Concept > > -- > > Its concept is quite simple anybody might consider more than once. > > ParallelAppend node kicks background worker process to execute > > child nodes in parallel / asynchronous. > > It intends to improve the performance to scan a large partitioned > > tables from standpoint of entire throughput, however, latency of > > the first multi-hundred rows are not scope of this project. > > From standpoint of technology trend, it primarily tries to utilize > > multi-cores capability within a system, but also enables to expand > > distributed database environment using foreign-tables inheritance > > features. > > Its behavior is very similar to Funnel node except for several > > points, thus, we can reuse its infrastructure we have had long- > > standing discussion through the v9.5 development cycle. > > Now that I've got more of the parallel infrastructure committed, I've > been starting to have a little time to think about what we might want > to do after we get PartialSeqScan committed. I'm positive on doing > something with the Append node and parallelism, but I'm negative on > doing what you've described here. > > I don't think the Append node has any business launching workers. > That's the job of Gather. Append may need to do something > parallel-aware, but starting workers is not that thing. Without > making any changes at all to Append, we can use it like this: > > Gather > -> Append > -> Partial Seq Scan on p1 > -> Partial Seq Scan on p2 > -> Partial Seq Scan on p3 > > The Gather node will spin up workers and in each worker we will ask > the Append nodes for tuples. Append will ask the first > not-yet-completed child for tuples, so the workers will cooperatively > scan first p1, then p2, then p3. This is great: instead of merely > doing a parallel seq scan of a single table, we can do a parallel seq > scan of a partitioned table. However, there are two improvements we > can make. First, we can teach Append that, when running in parallel, > it should initialize a chunk of dynamic shared memory with an array > indicating how many workers are currently working on each subplan. > Each new worker should join a subplan with the minimum number of > workers, work on that one until it's completely, and then pick a new > subplan. This minimizes contention. Second, we can teach the Append > to serve as a parent not only for intrinsically parallel nodes like > Partial Seq Scan, but also for other nodes, say, an Index Scan. When > an Append is running in parallel but with non-parallel-aware children, > each such child can be claimed by at most one worker process and will > be run to completion by that worker process. For example: > > Gather > -> Append > -> Index Scan on p1 > -> Partial Seq Scan on p2 > -> Index Scan on p3 > > The first worker which executes the Append should begin the index scan > on p1 and the second should begin the index scan on p3. The remaining > workers, and those two once they're finished, can work on p2. > > Proceeding in this way, I don't think we need a separate Parallel > Append node. Rather, we can just give the existing Append node some > extra smarts that are used only when it's running in parallel. > I entirely agree with your suggestion. We may be able to use an analogy between PartialSeqScan and the parallel- aware Append node. PartialSeqScan fetches blocks pointed by the index on shared memory segment, thus multiple workers eventually co-operate to scan a table using round-robin manner even though individual worker fetches comb- shaped blocks. If we assume individual blocks are individual sub-plans on the parallel aware Append, it performs very similar. A certain number of workers (more than zero) is launched by Gather node, then the parallel aware Append node fetches one of the sub-plans if any. I think, this design also gives additional flexibility according to the required parallelism by the underlying sub-plans. Please assume the "PartialSeqScan on p2" in the above example wants 3 workers to process the scan, we can expand the virtual array of the sub-plans as follows. Then, if Gather node kicks 5 workers, individual workers are assigned on some of plans. If Gather node could kick less than 5 workers, the first exit worker picks the second sub-plan, then it eventually provides the best parallelism. ++ |sub-plan | * Sub-Plan 1 ... Index Scan on p1 |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2 |shared | * Sub-Plan 2 ... PartialSeqScan on p2 |memory | * Sub-Plan 2 ... PartialSeqScan on p2 +-+ * Sub-Plan 3 ... Index Scan on p3 Here is no matter even if Append node has multiple parallel-aware sub-plans. When "PartialSeqScan on p4" is added, all we need
Re: [HACKERS] [DESIGN] ParallelAppend
On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigaiwrote: > I'm recently working/investigating on ParallelAppend feature > towards the next commit fest. Below is my design proposal. > > 1. Concept > -- > Its concept is quite simple anybody might consider more than once. > ParallelAppend node kicks background worker process to execute > child nodes in parallel / asynchronous. > It intends to improve the performance to scan a large partitioned > tables from standpoint of entire throughput, however, latency of > the first multi-hundred rows are not scope of this project. > From standpoint of technology trend, it primarily tries to utilize > multi-cores capability within a system, but also enables to expand > distributed database environment using foreign-tables inheritance > features. > Its behavior is very similar to Funnel node except for several > points, thus, we can reuse its infrastructure we have had long- > standing discussion through the v9.5 development cycle. Now that I've got more of the parallel infrastructure committed, I've been starting to have a little time to think about what we might want to do after we get PartialSeqScan committed. I'm positive on doing something with the Append node and parallelism, but I'm negative on doing what you've described here. I don't think the Append node has any business launching workers. That's the job of Gather. Append may need to do something parallel-aware, but starting workers is not that thing. Without making any changes at all to Append, we can use it like this: Gather -> Append -> Partial Seq Scan on p1 -> Partial Seq Scan on p2 -> Partial Seq Scan on p3 The Gather node will spin up workers and in each worker we will ask the Append nodes for tuples. Append will ask the first not-yet-completed child for tuples, so the workers will cooperatively scan first p1, then p2, then p3. This is great: instead of merely doing a parallel seq scan of a single table, we can do a parallel seq scan of a partitioned table. However, there are two improvements we can make. First, we can teach Append that, when running in parallel, it should initialize a chunk of dynamic shared memory with an array indicating how many workers are currently working on each subplan. Each new worker should join a subplan with the minimum number of workers, work on that one until it's completely, and then pick a new subplan. This minimizes contention. Second, we can teach the Append to serve as a parent not only for intrinsically parallel nodes like Partial Seq Scan, but also for other nodes, say, an Index Scan. When an Append is running in parallel but with non-parallel-aware children, each such child can be claimed by at most one worker process and will be run to completion by that worker process. For example: Gather -> Append -> Index Scan on p1 -> Partial Seq Scan on p2 -> Index Scan on p3 The first worker which executes the Append should begin the index scan on p1 and the second should begin the index scan on p3. The remaining workers, and those two once they're finished, can work on p2. Proceeding in this way, I don't think we need a separate Parallel Append node. Rather, we can just give the existing Append node some extra smarts that are used only when it's running in parallel. We can also push other things in between the Gather and the Append, which wouldn't be possible in your design. For example, consider a join between a partitioned table p and an unpartitioned table q. We could do this: Gather -> Nested Loop -> Append -> Index Scan on p1 -> Partial Seq Scan on p2 -> Index Scan on p3 -> Index Scan on q Index Cond q.x = p.x That's a pretty sweet plan. Assuming p1, p2, and p3 are all reasonably large, we could probably benefit from throwing at least 3 processes at this plan tree - maybe more, if p2 is really big. Only one process can work on each of p1 and p3, but p2, since it has a truly parallel plan, can soak up as many as we want to throw at it (although performance may top out at some point if we're I/O-bound). Sorry for taking so long to give a really substantive reply on this, but it wasn't until this last week or so that I really had time to think about this in detail. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote: Here is one other issue I found. Existing code assumes a TOC segment has only one contents per node type, so it uses pre-defined key (like PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put multiple PlannedStmt or PartialSeqScan node on a TOC segment. We have few keys in parallel-seq-scan patch (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for which multiple structures are shared between master and worker backends. Check if something similar can work for your use case. I think you are possibly missing the point. It could be possible, but let me summarize what I thought would be required for above use case. For Parallel Append, we need to push multiple planned statements in contrast to one planned statement as is done for current patch and then one or more parallel workers needs to work on each planned statement. So if we know in advance how many planned statements are we passing down (which we should), then using ParallelWorkerNumber (ParallelWorkerNumber % num_planned_statements or some other similar way), workers can find the the planned statement on which they need to work and similarly information for PartialSeqScan (which currently is parallel heap scan descriptor information). My problem is that we have no identifier to point a particular element on the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can have multiple items. Please assume a situation when ExecPartialSeqScan() has to lookup a particular item on TOC but multiple PartialSeqScan nodes can exist. Currently, it does: pscan = shm_toc_lookup(node-ss.ps.toc, PARALLEL_KEY_SCAN); However, ExecPartialSeqScan() cannot know which is the index of mine, or it is not reasonable to pay attention on other node in this level. Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also needs to have identifier. I think KaiGai's correct, and I pointed out the same problem to you before. The parallel key for the Partial Seq Scan needs to be allocated on the fly and carried in the node, or we'll never be able to push multiple things below the funnel. Okay, immediately I don't see what is the best way to achieve this but let us discuss this separately on Parallel Seq Scan thread and let me know if you have something specific in your mind. I will also give this a more thought. I want to have 'node_id' in the Plan node, then unique identifier is assigned on the field prior to serialization. It is a property of the Plan node, so we can reproduce this identifier on the background worker side using stringToNode(), then ExecPartialSeqScan can pull out a proper field from the TOC segment by this node_id. Probably, we can co-exist this structure without big changes. 1. Define PARALLEL_KEY_DYNAMIC_LEAST as a least value that is larger than any static TOC key (like PARALLEL_KEY_TUPLE_QUEUE). 2. Run plan-tree node walker on InitializeParallelWorkers, just before nodeToString(), to assign node_id larger than the above label and with increasing for each node. 3. Use node_id instead of the static PARALLEL_KEY_SCAN on ExecPartialSeqScan Even though we need some more trivial fixes are needed, it seems to me the above approach shall work. Also, please note that I don't assume only PartialSeqScan want to have its field on TOC segment, but some CustomScan node also wants to have its own shared field when co-working under Funnel node. On the other hand, I think it is too aggressive to complete the initial work of this patch by the starting day of the next commit fest, so I think the target is middle of October. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Aug 25, 2015 at 6:19 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas robertmh...@gmail.com wrote: It could be possible, but let me summarize what I thought would be required for above use case. For Parallel Append, we need to push multiple planned statements in contrast to one planned statement as is done for current patch and then one or more parallel workers needs to work on each planned statement. So if we know in advance how many planned statements are we passing down (which we should), then using ParallelWorkerNumber (ParallelWorkerNumber % num_planned_statements or some other similar way), workers can find the the planned statement on which they need to work and similarly information for PartialSeqScan (which currently is parallel heap scan descriptor information). My problem is that we have no identifier to point a particular element on the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can have multiple items. Please assume a situation when ExecPartialSeqScan() has to lookup a particular item on TOC but multiple PartialSeqScan nodes can exist. Currently, it does: pscan = shm_toc_lookup(node-ss.ps.toc, PARALLEL_KEY_SCAN); However, ExecPartialSeqScan() cannot know which is the index of mine, or it is not reasonable to pay attention on other node in this level. Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also needs to have identifier. Yes that's right and I think we can find out the same. Basically we need to know the planned statement number on which current worker is working and that anyway we have to do before the worker can start the work. One way is as I have explained above that use ParallelWorkerNumber (ParallelWorkerNumber % num_planned_statements) to find or might need some sophisticated way to find that out, but definitely we need to know that before start of execution by worker and once we know that we can use it find the PARALLEL_KEY_SCAN or whatever key for this worker (as the the position of PARALLEL_KEY_SCAN will be same as of planned stmt for a worker). I think KaiGai's correct, and I pointed out the same problem to you before. The parallel key for the Partial Seq Scan needs to be allocated on the fly and carried in the node, or we'll never be able to push multiple things below the funnel. Okay, immediately I don't see what is the best way to achieve this but let us discuss this separately on Parallel Seq Scan thread and let me know if you have something specific in your mind. I will also give this a more thought. I want to have 'node_id' in the Plan node, then unique identifier is assigned on the field prior to serialization. It is a property of the Plan node, so we can reproduce this identifier on the background worker side using stringToNode(), then ExecPartialSeqScan can pull out a proper field from the TOC segment by this node_id. Okay, this can also work, but why to introduce identifier in plan node, if it can work without it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote: Here is one other issue I found. Existing code assumes a TOC segment has only one contents per node type, so it uses pre-defined key (like PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put multiple PlannedStmt or PartialSeqScan node on a TOC segment. We have few keys in parallel-seq-scan patch (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for which multiple structures are shared between master and worker backends. Check if something similar can work for your use case. I think you are possibly missing the point. It could be possible, but let me summarize what I thought would be required for above use case. For Parallel Append, we need to push multiple planned statements in contrast to one planned statement as is done for current patch and then one or more parallel workers needs to work on each planned statement. So if we know in advance how many planned statements are we passing down (which we should), then using ParallelWorkerNumber (ParallelWorkerNumber % num_planned_statements or some other similar way), workers can find the the planned statement on which they need to work and similarly information for PartialSeqScan (which currently is parallel heap scan descriptor information). I think KaiGai's correct, and I pointed out the same problem to you before. The parallel key for the Partial Seq Scan needs to be allocated on the fly and carried in the node, or we'll never be able to push multiple things below the funnel. Okay, immediately I don't see what is the best way to achieve this but let us discuss this separately on Parallel Seq Scan thread and let me know if you have something specific in your mind. I will also give this a more thought. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote: Here is one other issue I found. Existing code assumes a TOC segment has only one contents per node type, so it uses pre-defined key (like PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put multiple PlannedStmt or PartialSeqScan node on a TOC segment. We have few keys in parallel-seq-scan patch (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for which multiple structures are shared between master and worker backends. Check if something similar can work for your use case. I think you are possibly missing the point. I think KaiGai's correct, and I pointed out the same problem to you before. The parallel key for the Partial Seq Scan needs to be allocated on the fly and carried in the node, or we'll never be able to push multiple things below the funnel. I'm not quite sure what you're trying to explain with this response. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Thu, Aug 13, 2015 at 5:26 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Sure, that is what we should do, however the tricky part would be when the path for doing local scan is extremely cheaper than path for parallel scan for one of the child nodes. For such cases, pulling up Funnel-node can incur more cost. I think some of the other possible ways to make this work could be to extend Funnel so that it is capable of executing both parallel and non-parallel nodes, have a new Funnel like node which has such a capability. I think it is job of (more intelligent) planner but not in the first version. If subplans of Append are mixture of nodes which has or does not have worth of parallel execution, we will be able to arrange the original form: Append + Scan on rel1 (large) + Scan on rel2 (large) + Scan on rel3 (middle) + Scan on rel4 (tiny) + Scan on rel5 (tiny) to Funnel aware form, but partially: Append + Funnel | + Scan on rel1 (large) | + Scan on rel2 (large) | + Scan on rel3 (large) + Scan on rel4 (tiny) + Scan on rel5 (tiny) This is exactly what I have in mind. Here is one other issue I found. Existing code assumes a TOC segment has only one contents per node type, so it uses pre-defined key (like PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put multiple PlannedStmt or PartialSeqScan node on a TOC segment. We have few keys in parallel-seq-scan patch (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for which multiple structures are shared between master and worker backends. Check if something similar can work for your use case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 So if we go this route, then Funnel should have capability to execute non-parallel part of plan as well, like in this case it should be able to execute non-parallel IndexScan on rel3 as well and then it might need to distinguish between parallel and non-parallel part of plans. I think this could make Funnel node complex. It is difference from what I plan now. In the above example, Funnel node has two non-parallel aware node (rel1 and rel3) and one parallel aware node, then three PlannedStmt for each shall be put on the TOC segment. Even though the background workers pick up a PlannedStmt from the three, only one worker can pick up the PlannedStmt for rel1 and rel3, however, rel2 can be executed by multiple workers simultaneously. Okay, now I got your point, but I think the cost of execution of non-parallel node by additional worker is not small considering the communication cost and setting up an addional worker for each sub-plan (assume the case where out of 100-child nodes only few (2 or 3) nodes actually need multiple workers). It is a competition between traditional Append that takes Funnel children and the new appendable Funnel that takes parallel and non-parallel children. Probably, key factors are cpu_tuple_comm_cost, parallel_setup_cost and degree of selectivity of sub-plans. Both cases has advantage and disadvantage depending on the query, so we can never determine which is better without path consideration. Sure, that is what we should do, however the tricky part would be when the path for doing local scan is extremely cheaper than path for parallel scan for one of the child nodes. For such cases, pulling up Funnel-node can incur more cost. I think some of the other possible ways to make this work could be to extend Funnel so that it is capable of executing both parallel and non-parallel nodes, have a new Funnel like node which has such a capability. I think it is job of (more intelligent) planner but not in the first version. If subplans of Append are mixture of nodes which has or does not have worth of parallel execution, we will be able to arrange the original form: Append + Scan on rel1 (large) + Scan on rel2 (large) + Scan on rel3 (middle) + Scan on rel4 (tiny) + Scan on rel5 (tiny) to Funnel aware form, but partially: Append + Funnel | + Scan on rel1 (large) | + Scan on rel2 (large) | + Scan on rel3 (large) + Scan on rel4 (tiny) + Scan on rel5 (tiny) It does not require special functionalities of Append/Funnel more than what we have discussed, as long as planner is enough intelligent. One downside of this approach is, plan tree tends to become more complicated, thus makes logic to pushdown joins also becomes complicated. Here is one other issue I found. Existing code assumes a TOC segment has only one contents per node type, so it uses pre-defined key (like PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put multiple PlannedStmt or PartialSeqScan node on a TOC segment. My idea is enhancement of Plan node to have an unique identifier within a particular plan trees. Once a unique identifier is assigned, we can put individual information on the TOC segment, even if multiple PartialSeqScan nodes are packed. Did we have a discussion about this topic in the past? Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 So if we go this route, then Funnel should have capability to execute non-parallel part of plan as well, like in this case it should be able to execute non-parallel IndexScan on rel3 as well and then it might need to distinguish between parallel and non-parallel part of plans. I think this could make Funnel node complex. It is difference from what I plan now. In the above example, Funnel node has two non-parallel aware node (rel1 and rel3) and one parallel aware node, then three PlannedStmt for each shall be put on the TOC segment. Even though the background workers pick up a PlannedStmt from the three, only one worker can pick up the PlannedStmt for rel1 and rel3, however, rel2 can be executed by multiple workers simultaneously. Okay, now I got your point, but I think the cost of execution of non-parallel node by additional worker is not small considering the communication cost and setting up an addional worker for each sub-plan (assume the case where out of 100-child nodes only few (2 or 3) nodes actually need multiple workers). It is a competition between traditional Append that takes Funnel children and the new appendable Funnel that takes parallel and non-parallel children. Probably, key factors are cpu_tuple_comm_cost, parallel_setup_cost and degree of selectivity of sub-plans. Both cases has advantage and disadvantage depending on the query, so we can never determine which is better without path consideration. Sure, that is what we should do, however the tricky part would be when the path for doing local scan is extremely cheaper than path for parallel scan for one of the child nodes. For such cases, pulling up Funnel-node can incur more cost. I think some of the other possible ways to make this work could be to extend Funnel so that it is capable of executing both parallel and non-parallel nodes, have a new Funnel like node which has such a capability. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I am not sure, but what problem do you see in putting Funnel node for one of the relation scans and not for the others. At this moment, I'm not certain whether background worker can/ought to launch another background workers. If sub-Funnel node is executed by 10-processes then it also launch 10-processes for each, 100-processes will run for each? Yes, that could be more work than current, but what I had in mind is not that way, rather I was thinking that master backend will only kick of workers for Funnel nodes in plan. I agree with, it is fair enough approach, so I mention about pull-up of Funnel node. If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 So if we go this route, then Funnel should have capability to execute non-parallel part of plan as well, like in this case it should be able to execute non-parallel IndexScan on rel3 as well and then it might need to distinguish between parallel and non-parallel part of plans. I think this could make Funnel node complex. It is difference from what I plan now. In the above example, Funnel node has two non-parallel aware node (rel1 and rel3) and one parallel aware node, then three PlannedStmt for each shall be put on the TOC segment. Even though the background workers pick up a PlannedStmt from the three, only one worker can pick up the PlannedStmt for rel1 and rel3, however, rel2 can be executed by multiple workers simultaneously. Okay, now I got your point, but I think the cost of execution of non-parallel node by additional worker is not small considering the communication cost and setting up an addional worker for each sub-plan (assume the case where out of 100-child nodes only few (2 or 3) nodes actually need multiple workers). It is a competition between traditional Append that takes Funnel children and the new appendable Funnel that takes parallel and non-parallel children. Probably, key factors are cpu_tuple_comm_cost, parallel_setup_cost and degree of selectivity of sub-plans. Both cases has advantage and disadvantage depending on the query, so we can never determine which is better without path consideration. I think for a particular PlannedStmt, number of workers must have been decided before start of execution, so if those many workers are available to work on that particular PlannedStmt, then next/new worker should work on next PlannedStmt. My concern about pre-determined number of workers is, it depends on the run-time circumstances of concurrent sessions. Even if planner wants to assign 10-workers on a particular sub-plan, only 4-workers may be available on the run-time because of consumption by side sessions. So, I expect only maximum number of workers is meaningful configuration. In that case, there is possibility that many of the workers are just working on one or two of the nodes and other nodes execution might get starved. I understand this is tricky problem to allocate number of workers for different nodes, however we should try to develop any algorithm where there is some degree of fairness in allocation of workers for different nodes. I like to agree, however, I also want to keep the first version as simple as possible we can. We can develop alternative logic to assign suitable number of workers later. 2. Execution of work by workers and Funnel node and then pass the results back to upper node. I think this needs some more work in addition to ParallelSeqScan patch. I expect we can utilize existing infrastructure here. It just picks up the records come from the underlying workers, then raise it to the upper node. Sure, but still you need some work atleast in the area of making workers understand different node types, I am guessing you need to modify readfuncs.c to support new plan node if any for this work. Yes, it was not a creative work. :-) https://github.com/kaigai/sepgsql/blob/fappend/src/backend/nodes/readfuncs.c#L1479 Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: I am not sure, but what problem do you see in putting Funnel node for one of the relation scans and not for the others. At this moment, I'm not certain whether background worker can/ought to launch another background workers. If sub-Funnel node is executed by 10-processes then it also launch 10-processes for each, 100-processes will run for each? Yes, that could be more work than current, but what I had in mind is not that way, rather I was thinking that master backend will only kick of workers for Funnel nodes in plan. If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 So if we go this route, then Funnel should have capability to execute non-parallel part of plan as well, like in this case it should be able to execute non-parallel IndexScan on rel3 as well and then it might need to distinguish between parallel and non-parallel part of plans. I think this could make Funnel node complex. It is difference from what I plan now. In the above example, Funnel node has two non-parallel aware node (rel1 and rel3) and one parallel aware node, then three PlannedStmt for each shall be put on the TOC segment. Even though the background workers pick up a PlannedStmt from the three, only one worker can pick up the PlannedStmt for rel1 and rel3, however, rel2 can be executed by multiple workers simultaneously. Okay, now I got your point, but I think the cost of execution of non-parallel node by additional worker is not small considering the communication cost and setting up an addional worker for each sub-plan (assume the case where out of 100-child nodes only few (2 or 3) nodes actually need multiple workers). I think for a particular PlannedStmt, number of workers must have been decided before start of execution, so if those many workers are available to work on that particular PlannedStmt, then next/new worker should work on next PlannedStmt. My concern about pre-determined number of workers is, it depends on the run-time circumstances of concurrent sessions. Even if planner wants to assign 10-workers on a particular sub-plan, only 4-workers may be available on the run-time because of consumption by side sessions. So, I expect only maximum number of workers is meaningful configuration. In that case, there is possibility that many of the workers are just working on one or two of the nodes and other nodes execution might get starved. I understand this is tricky problem to allocate number of workers for different nodes, however we should try to develop any algorithm where there is some degree of fairness in allocation of workers for different nodes. 2. Execution of work by workers and Funnel node and then pass the results back to upper node. I think this needs some more work in addition to ParallelSeqScan patch. I expect we can utilize existing infrastructure here. It just picks up the records come from the underlying workers, then raise it to the upper node. Sure, but still you need some work atleast in the area of making workers understand different node types, I am guessing you need to modify readfuncs.c to support new plan node if any for this work. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Monday, July 27, 2015 11:07 PM To: Amit Kapila Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 If Funnel can handle both of horizontal and vertical parallelism, it is a great simplification. I never stick a new node. Once Funnel get a capability to have multiple child nodes, probably, Append node above will have gone. I expect set_append_rel_pathlist() add two paths based on Append and Funnel, then planner will choose the cheaper one according to its cost. In the latest v16 patch, Funnel is declared as follows: typedef struct Funnel { Scanscan; int num_workers; } Funnel; If we try to add Append capability here, I expects the structure will be adjusted as follows, for example: typedef struct Funnel { Scanscan; List *funnel_plans; List *funnel_num_workers; } Funnel; As literal, funnel_plans saves underlying Plan nodes instead of the lefttree. Also, funnel_num_workers saves number of expected workers to be assigned on individual child plans. or shall we have a node like above and name it as FunnelAppend or AppenFunnel? It is better to have smaller number of node types which are capable to kick background workers because of simplification of path construction. Let's assume the case below. When planner considers a path to append child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull up Funnel of rel2, can we? (Append? or Funnel) -- SeqScan on rel1 -- Funnel -- PartialSeqScan on rel2 -- IndexScan on rel3 I am not sure, but what problem do you see in putting Funnel node for one of the relation scans and not for the others. At this moment, I'm not certain whether background worker can/ought to launch another background workers. If sub-Funnel node is executed by 10-processes then it also launch 10-processes for each, 100-processes will run for each? If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 So if we go this route, then Funnel should have capability to execute non-parallel part of plan as well, like in this case it should be able to execute non-parallel IndexScan on rel3 as well and then it might need to distinguish between parallel and non-parallel part of plans. I think this could make Funnel node complex. It is difference from what I plan now. In the above example, Funnel node has two non-parallel aware node (rel1 and rel3) and one parallel aware node, then three PlannedStmt for each shall be put on the TOC segment. Even though the background workers pick up a PlannedStmt from the three, only one worker can pick up the PlannedStmt for rel1 and rel3, however, rel2 can be executed by multiple workers simultaneously. (Note: if number of workers are less than three in this case, PlannedStmt for rel3 shall not be picked up unless any other worker don't complete to run other plan on rel1 or rel2). From the standpoint of the Funnel, it just kicks background workers with: - multiple PlannedStmt nodes - maximum number of workers for each plan in addition to the current form. Then, it continues to fetch records from the shm_mq. Probably, it does not change the current form so much. If all we have to pay attention is Funnel node, it makes the code around path construction and pull-up logic much simpler, rather than multiple node types can kick background workers. Okay, but I think pulling-up Funnel node makes sense only when all nodes beneath it needs to be executed parallely. I think its decision should be based on the cost, that includes additional startup_cost to launch background worker, as long as non-parallel node is also capable to run on the worker side. Even though create_parallelscan_paths() in v16 set num_workers not larger than parallel_seqscan_degree, total number of the concurrent background workers may exceed this configuration if more than two PartialSeqScan nodes are underlying. It is a different configuration from max_worker_processes, so it is not a matter as long as we
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Monday, July 27, 2015 11:07 PM To: Amit Kapila Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 If Funnel can handle both of horizontal and vertical parallelism, it is a great simplification. I never stick a new node. Once Funnel get a capability to have multiple child nodes, probably, Append node above will have gone. I expect set_append_rel_pathlist() add two paths based on Append and Funnel, then planner will choose the cheaper one according to its cost. In the latest v16 patch, Funnel is declared as follows: typedef struct Funnel { Scanscan; int num_workers; } Funnel; If we try to add Append capability here, I expects the structure will be adjusted as follows, for example: typedef struct Funnel { Scanscan; List *funnel_plans; List *funnel_num_workers; } Funnel; As literal, funnel_plans saves underlying Plan nodes instead of the lefttree. Also, funnel_num_workers saves number of expected workers to be assigned on individual child plans. or shall we have a node like above and name it as FunnelAppend or AppenFunnel? It is better to have smaller number of node types which are capable to kick background workers because of simplification of path construction. Let's assume the case below. When planner considers a path to append child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull up Funnel of rel2, can we? (Append? or Funnel) -- SeqScan on rel1 -- Funnel -- PartialSeqScan on rel2 -- IndexScan on rel3 I am not sure, but what problem do you see in putting Funnel node for one of the relation scans and not for the others. If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 So if we go this route, then Funnel should have capability to execute non-parallel part of plan as well, like in this case it should be able to execute non-parallel IndexScan on rel3 as well and then it might need to distinguish between parallel and non-parallel part of plans. I think this could make Funnel node complex. If all we have to pay attention is Funnel node, it makes the code around path construction and pull-up logic much simpler, rather than multiple node types can kick background workers. Okay, but I think pulling-up Funnel node makes sense only when all nodes beneath it needs to be executed parallely. Even though create_parallelscan_paths() in v16 set num_workers not larger than parallel_seqscan_degree, total number of the concurrent background workers may exceed this configuration if more than two PartialSeqScan nodes are underlying. It is a different configuration from max_worker_processes, so it is not a matter as long as we have another restriction. However, how do we control the cap of number of worker processes per appendable Funnel node? For example, if a parent table has 200 child tables but max_worker_processes are configured to 50. It is obviously impossible to launch all the background workers simultaneously. One idea I have is to suspend launch of some plans until earlier ones are completed. Okay, but I think in that idea you need to re-launch the workers again for new set of relation scan's which could turn out to be costly, how about designing some way where workers after completing their assigned work check for new set of task/'s (which in this case would be to scan a new) and then execute the same. I think in this way we can achieve dynamic allocation of work and achieve maximum parallelism with available set of workers. We have achieved this in ParallelSeqScan by scanning at block level, once a worker finishes a block, it checks for new block to scan. Is it possible to put multiple PlannedStmt on TOC, isn't it? Yes, I don't see any problem in doing that way. So here for each different (child) relation, you want to create a separate PlannedStmt or do you have something else in mind? If background worker picks up an uncompleted PlannedStmt first (based on round-robin likely?), it may achieve the maximum parallelism. I think this can work well for
Re: [HACKERS] [DESIGN] ParallelAppend
On 27 July 2015 at 21:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, can I ask some questions? I suppose we can take this as the analog of ParalleSeqScan. I can see not so distinction between Append(ParalleSeqScan) and ParallelAppend(SeqScan). What difference is there between them? If other nodes will have the same functionality as you mention at the last of this proposal, it might be better that some part of this feature is implemented as a part of existing executor itself, but not as a deidicated additional node, just as my asynchronous fdw execution patch patially does. (Although it lacks planner part and bg worker launching..) If that is the case, it might be better that ExecProcNode is modified so that it supports both in-process and inter-bgworker cases by the single API. What do you think about this? I have to say that I really like the thought of us having parallel enabled stuff in Postgres, but I also have to say that I don't think inventing all these special parallel node types is a good idea. If we think about everything that we can parallelise... Perhaps sort, hash join, seqscan, hash, bitmap heap scan, nested loop. I don't want to debate that, but perhaps there's more, perhaps less. Are we really going to duplicate all of the code and add in the parallel stuff as new node types? My other concern here is that I seldom hear people talk about the planner's architectural lack of ability to make a good choice about how many parallel workers to choose. Surely to properly calculate costs you need to know the exact number of parallel workers that will be available at execution time, but you need to know this at planning time!? I can't see how this works, apart from just being very conservative about parallel workers, which I think is really bad, as many databases have busy times in the day, and also quiet times, generally quiet time is when large batch stuff gets done, and that's the time that parallel stuff is likely most useful. Remember queries are not always planned just before they're executed. We could have a PREPAREd query, or we could have better plan caching in the future, or if we build some intelligence into the planner to choose a good number of workers based on the current server load, then what's to say that the server will be under this load at exec time? If we plan during a quiet time, and exec in a busy time all hell may break loose. I really do think that existing nodes should just be initialized in a parallel mode, and each node type can have a function to state if it supports parallelism or not. I'd really like to hear more opinions in the ideas I discussed here: http://www.postgresql.org/message-id/CAApHDvp2STf0=pqfpq+e7wa4qdympfm5qu_ytupe7r0jlnh...@mail.gmail.com This design makes use of the Funnel node that Amit has already made and allows more than 1 node to be executed in parallel at once. It appears that parallel enabling the executor node by node is fundamentally locked into just 1 node being executed in parallel, then perhaps a Funnel node gathering up the parallel worker buffers and streaming those back in serial mode. I believe by design, this does not permit a whole plan branch from executing in parallel and I really feel like doing things this way is going to be very hard to undo and improve later. I might be too stupid to figure it out, but how would parallel hash join work if it can't gather tuples from the inner and outer nodes in parallel? Sorry for the rant, but I just feel like we're painting ourselves into a corner by parallel enabling the executor node by node. Apologies if I've completely misunderstood things. Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Jul 28, 2015 at 12:59 PM, David Rowley david.row...@2ndquadrant.com wrote: On 27 July 2015 at 21:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, can I ask some questions? I suppose we can take this as the analog of ParalleSeqScan. I can see not so distinction between Append(ParalleSeqScan) and ParallelAppend(SeqScan). What difference is there between them? If other nodes will have the same functionality as you mention at the last of this proposal, it might be better that some part of this feature is implemented as a part of existing executor itself, but not as a deidicated additional node, just as my asynchronous fdw execution patch patially does. (Although it lacks planner part and bg worker launching..) If that is the case, it might be better that ExecProcNode is modified so that it supports both in-process and inter-bgworker cases by the single API. What do you think about this? I have to say that I really like the thought of us having parallel enabled stuff in Postgres, but I also have to say that I don't think inventing all these special parallel node types is a good idea. If we think about everything that we can parallelise... Perhaps sort, hash join, seqscan, hash, bitmap heap scan, nested loop. I don't want to debate that, but perhaps there's more, perhaps less. Are we really going to duplicate all of the code and add in the parallel stuff as new node types? My other concern here is that I seldom hear people talk about the planner's architectural lack of ability to make a good choice about how many parallel workers to choose. Surely to properly calculate costs you need to know the exact number of parallel workers that will be available at execution time, but you need to know this at planning time!? I can't see how this works, apart from just being very conservative about parallel workers, which I think is really bad, as many databases have busy times in the day, and also quiet times, generally quiet time is when large batch stuff gets done, and that's the time that parallel stuff is likely most useful. Remember queries are not always planned just before they're executed. We could have a PREPAREd query, or we could have better plan caching in the future, or if we build some intelligence into the planner to choose a good number of workers based on the current server load, then what's to say that the server will be under this load at exec time? If we plan during a quiet time, and exec in a busy time all hell may break loose. I really do think that existing nodes should just be initialized in a parallel mode, and each node type can have a function to state if it supports parallelism or not. I'd really like to hear more opinions in the ideas I discussed here: http://www.postgresql.org/message-id/CAApHDvp2STf0=pqfpq+e7wa4qdympfm5qu_ytupe7r0jlnh...@mail.gmail.com This design makes use of the Funnel node that Amit has already made and allows more than 1 node to be executed in parallel at once. It appears that parallel enabling the executor node by node is fundamentally locked into just 1 node being executed in parallel, then perhaps a Funnel node gathering up the parallel worker buffers and streaming those back in serial mode. I believe by design, this does not permit a whole plan branch from executing in parallel and I really feel like doing things this way is going to be very hard to undo and improve later. I might be too stupid to figure it out, but how would parallel hash join work if it can't gather tuples from the inner and outer nodes in parallel? Sorry for the rant, but I just feel like we're painting ourselves into a corner by parallel enabling the executor node by node. Apologies if I've completely misunderstood things. +1, well articulated. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] [DESIGN] ParallelAppend
On 2015-07-29 AM 11:02, Kouhei Kaigai wrote: ... synchronous Append path vs. parallel asynchronous Append with Funnel (below/above?) it. I guess the asynchronous version would always be cheaper. So, even if we end up with non-parallel sub-plans do we still add a Funnel to make Append asynchronous? Am I missing something? I expect Funnel itself will get Append capability but run sub-plans in background workers, to simplify path constructions. So, if Funnel with multiple sub-plans have cheaper cost than Append, it will replace the AppendPath by FunnelPath. Regarding to the cost estimation, I don't think parallel version is always cheaper than traditional Append, because of the cost to launch background workers. It increases startup cost to process the relation, thus, if upper node prefers small startup cost (like Limit), traditional Append still has advantages. Right, I almost forgot about the start-up cost. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On 2015-07-28 PM 09:58, Kouhei Kaigai wrote: From my understanding of parallel seqscan patch, each worker's PartialSeqScan asks for a block to scan using a shared parallel heap scan descriptor that effectively keeps track of division of work among PartialSeqScans in terms of blocks. What if we invent a PartialAppend which each worker would run in case of a parallelized Append. It would use some kind of shared descriptor to pick a relation (Append member) to scan. The shared structure could be the list of subplans including the mutex for concurrency. It doesn't sound as effective as proposed ParallelHeapScanDescData does for PartialSeqScan but any more granular might be complicated. For example, consider (current_relation, current_block) pair. If there are more workers than subplans/partitions, then multiple workers might start working on the same relation after a round-robin assignment of relations (but of course, a later worker would start scanning from a later block in the same relation). I imagine that might help with parallelism across volumes if that's the case. I initially thought ParallelAppend kicks fixed number of background workers towards sub-plans, according to the estimated cost on the planning stage. However, I'm now inclined that background worker picks up an uncompleted PlannedStmt first. (For more details, please see the reply to Amit Kapila) It looks like less less-grained worker's job distribution. Once number of workers gets larger than number of volumes / partitions, it means more than two workers begin to assign same PartialSeqScan, thus it takes fine-grained job distribution using shared parallel heap scan. I like your idea of using round-robin assignment of partial/non-partial sub-plans to workers. Do you think there are two considerations of cost here: sub-plans themselves could have parallel paths to consider and (I think) your proposal introduces a new consideration - a plain old synchronous Append path vs. parallel asynchronous Append with Funnel (below/above?) it. I guess the asynchronous version would always be cheaper. So, even if we end up with non-parallel sub-plans do we still add a Funnel to make Append asynchronous? Am I missing something? I expect Funnel itself will get Append capability but run sub-plans in background workers, to simplify path constructions. So, if Funnel with multiple sub-plans have cheaper cost than Append, it will replace the AppendPath by FunnelPath. Regarding to the cost estimation, I don't think parallel version is always cheaper than traditional Append, because of the cost to launch background workers. It increases startup cost to process the relation, thus, if upper node prefers small startup cost (like Limit), traditional Append still has advantages. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
KaiGai-san, On 2015-07-28 PM 09:58, Kouhei Kaigai wrote: From my understanding of parallel seqscan patch, each worker's PartialSeqScan asks for a block to scan using a shared parallel heap scan descriptor that effectively keeps track of division of work among PartialSeqScans in terms of blocks. What if we invent a PartialAppend which each worker would run in case of a parallelized Append. It would use some kind of shared descriptor to pick a relation (Append member) to scan. The shared structure could be the list of subplans including the mutex for concurrency. It doesn't sound as effective as proposed ParallelHeapScanDescData does for PartialSeqScan but any more granular might be complicated. For example, consider (current_relation, current_block) pair. If there are more workers than subplans/partitions, then multiple workers might start working on the same relation after a round-robin assignment of relations (but of course, a later worker would start scanning from a later block in the same relation). I imagine that might help with parallelism across volumes if that's the case. I initially thought ParallelAppend kicks fixed number of background workers towards sub-plans, according to the estimated cost on the planning stage. However, I'm now inclined that background worker picks up an uncompleted PlannedStmt first. (For more details, please see the reply to Amit Kapila) It looks like less less-grained worker's job distribution. Once number of workers gets larger than number of volumes / partitions, it means more than two workers begin to assign same PartialSeqScan, thus it takes fine-grained job distribution using shared parallel heap scan. I like your idea of using round-robin assignment of partial/non-partial sub-plans to workers. Do you think there are two considerations of cost here: sub-plans themselves could have parallel paths to consider and (I think) your proposal introduces a new consideration - a plain old synchronous Append path vs. parallel asynchronous Append with Funnel (below/above?) it. I guess the asynchronous version would always be cheaper. So, even if we end up with non-parallel sub-plans do we still add a Funnel to make Append asynchronous? Am I missing something? MergeAppend parallelization might involve a bit more complication but may be feasible with a PartialMergeAppend with slightly different kind of coordination among workers. What do you think of such an approach? Do we need to have something special in ParallelMergeAppend? If individual child nodes are designed to return sorted results, what we have to do seems to me same. Sorry, I was wrongly worried because I did not really know that MergeAppend uses a binaryheap to store tuples before returning. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On 27 July 2015 at 21:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hello, can I ask some questions? I suppose we can take this as the analog of ParalleSeqScan. I can see not so distinction between Append(ParalleSeqScan) and ParallelAppend(SeqScan). What difference is there between them? If other nodes will have the same functionality as you mention at the last of this proposal, it might be better that some part of this feature is implemented as a part of existing executor itself, but not as a deidicated additional node, just as my asynchronous fdw execution patch patially does. (Although it lacks planner part and bg worker launching..) If that is the case, it might be better that ExecProcNode is modified so that it supports both in-process and inter-bgworker cases by the single API. What do you think about this? I have to say that I really like the thought of us having parallel enabled stuff in Postgres, but I also have to say that I don't think inventing all these special parallel node types is a good idea. If we think about everything that we can parallelise... Perhaps sort, hash join, seqscan, hash, bitmap heap scan, nested loop. I don't want to debate that, but perhaps there's more, perhaps less. Are we really going to duplicate all of the code and add in the parallel stuff as new node types? My other concern here is that I seldom hear people talk about the planner's architectural lack of ability to make a good choice about how many parallel workers to choose. Surely to properly calculate costs you need to know the exact number of parallel workers that will be available at execution time, but you need to know this at planning time!? I can't see how this works, apart from just being very conservative about parallel workers, which I think is really bad, as many databases have busy times in the day, and also quiet times, generally quiet time is when large batch stuff gets done, and that's the time that parallel stuff is likely most useful. Remember queries are not always planned just before they're executed. We could have a PREPAREd query, or we could have better plan caching in the future, or if we build some intelligence into the planner to choose a good number of workers based on the current server load, then what's to say that the server will be under this load at exec time? If we plan during a quiet time, and exec in a busy time all hell may break loose. Even though it is not easy to estimate available workers at planning time, it might be possible to define a target number of workers to run. If Funnel cannot get enough number of workers less than target, my preference is to tell other workers (via scoreboard?) not to pick up next PlannedStmt and exit when another Funnel cannot launch enough number of workers. I really do think that existing nodes should just be initialized in a parallel mode, and each node type can have a function to state if it supports parallelism or not. I'd really like to hear more opinions in the ideas I discussed here: http://www.postgresql.org/message-id/CAApHDvp2STf0=pQfpq+e7WA4QdYmpFM5qu_YtU pe7r0jlnh...@mail.gmail.com This design makes use of the Funnel node that Amit has already made and allows more than 1 node to be executed in parallel at once. It appears that parallel enabling the executor node by node is fundamentally locked into just 1 node being executed in parallel, then perhaps a Funnel node gathering up the parallel worker buffers and streaming those back in serial mode. I believe by design, this does not permit a whole plan branch from executing in parallel and I really feel like doing things this way is going to be very hard to undo and improve later. I might be too stupid to figure it out, but how would parallel hash join work if it can't gather tuples from the inner and outer nodes in parallel? Hash-Join and Nest-Loop should not have PartialSeqScan in the inner-side, but outer side can be PartialSeqScan under the Funnel node. In case of Hash-Join, SeqScan of inner-side loads any tuples (*1) to hash-table once, then records come from outer-side shall be combined with the hash-table. Even though inner-side is read redundantly, advantage of parallel join will win as long as inner-side is enough small; This assumption is right on usual pair of master tables (small) and fact table (big). (*1) Our colleague is now working on this feature. It enables to drop unnecessary rows under the partitioned tables. So, we may not need to have entire hash table for each background workers. http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f6...@bpxm15gp.gisp.nec.co.jp Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
Re: [HACKERS] [DESIGN] ParallelAppend
KaiGai-san, On 2015-07-27 PM 11:07, Kouhei Kaigai wrote: Append -- Funnel -- PartialSeqScan on rel1 (num_workers = 4) -- Funnel -- PartialSeqScan on rel2 (num_workers = 8) -- SeqScan on rel3 shall be rewritten to Funnel -- PartialSeqScan on rel1 (num_workers = 4) -- PartialSeqScan on rel2 (num_workers = 8) -- SeqScan on rel3(num_workers = 1) In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1, rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the earlier one to finish? I would think the answer is no because then it would not be different from the former case, right? Because the original premise seems that (partitions) rel1, rel2, rel3 may be on different volumes so parallelism across volumes seems like a goal of parallelizing Append. From my understanding of parallel seqscan patch, each worker's PartialSeqScan asks for a block to scan using a shared parallel heap scan descriptor that effectively keeps track of division of work among PartialSeqScans in terms of blocks. What if we invent a PartialAppend which each worker would run in case of a parallelized Append. It would use some kind of shared descriptor to pick a relation (Append member) to scan. The shared structure could be the list of subplans including the mutex for concurrency. It doesn't sound as effective as proposed ParallelHeapScanDescData does for PartialSeqScan but any more granular might be complicated. For example, consider (current_relation, current_block) pair. If there are more workers than subplans/partitions, then multiple workers might start working on the same relation after a round-robin assignment of relations (but of course, a later worker would start scanning from a later block in the same relation). I imagine that might help with parallelism across volumes if that's the case. I initially thought ParallelAppend kicks fixed number of background workers towards sub-plans, according to the estimated cost on the planning stage. However, I'm now inclined that background worker picks up an uncompleted PlannedStmt first. (For more details, please see the reply to Amit Kapila) It looks like less less-grained worker's job distribution. Once number of workers gets larger than number of volumes / partitions, it means more than two workers begin to assign same PartialSeqScan, thus it takes fine-grained job distribution using shared parallel heap scan. MergeAppend parallelization might involve a bit more complication but may be feasible with a PartialMergeAppend with slightly different kind of coordination among workers. What do you think of such an approach? Do we need to have something special in ParallelMergeAppend? If individual child nodes are designed to return sorted results, what we have to do seems to me same. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Monday, July 27, 2015 11:07 PM To: Amit Kapila Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 If Funnel can handle both of horizontal and vertical parallelism, it is a great simplification. I never stick a new node. Once Funnel get a capability to have multiple child nodes, probably, Append node above will have gone. I expect set_append_rel_pathlist() add two paths based on Append and Funnel, then planner will choose the cheaper one according to its cost. In the latest v16 patch, Funnel is declared as follows: typedef struct Funnel { Scanscan; int num_workers; } Funnel; If we try to add Append capability here, I expects the structure will be adjusted as follows, for example: typedef struct Funnel { Scanscan; List *funnel_plans; List *funnel_num_workers; } Funnel; As literal, funnel_plans saves underlying Plan nodes instead of the lefttree. Also, funnel_num_workers saves number of expected workers to be assigned on individual child plans. or shall we have a node like above and name it as FunnelAppend or AppenFunnel? It is better to have smaller number of node types which are capable to kick background workers because of simplification of path construction. Let's assume the case below. When planner considers a path to append child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull up Funnel of rel2, can we? (Append? or Funnel) -- SeqScan on rel1 -- Funnel -- PartialSeqScan on rel2 -- IndexScan on rel3 If we pull Funnel here, I think the plan shall be as follows: Funnel -- SeqScan on rel1 -- PartialSeqScan on rel2 -- IndexScan on rel3 If all we have to pay attention is Funnel node, it makes the code around path construction and pull-up logic much simpler, rather than multiple node types can kick background workers. Even though create_parallelscan_paths() in v16 set num_workers not larger than parallel_seqscan_degree, total number of the concurrent background workers may exceed this configuration if more than two PartialSeqScan nodes are underlying. It is a different configuration from max_worker_processes, so it is not a matter as long as we have another restriction. However, how do we control the cap of number of worker processes per appendable Funnel node? For example, if a parent table has 200 child tables but max_worker_processes are configured to 50. It is obviously impossible to launch all the background workers simultaneously. One idea I have is to suspend launch of some plans until earlier ones are completed. Okay, but I think in that idea you need to re-launch the workers again for new set of relation scan's which could turn out to be costly, how about designing some way where workers after completing their assigned work check for new set of task/'s (which in this case would be to scan a new) and then execute the same. I think in this way we can achieve dynamic allocation of work and achieve maximum parallelism with available set of workers. We have achieved this in ParallelSeqScan by scanning at block level, once a worker finishes a block, it checks for new block to scan. Is it possible to put multiple PlannedStmt on TOC, isn't it? If background worker picks up an uncompleted PlannedStmt first (based on round-robin likely?), it may achieve the maximum parallelism. Yep, it seems to me a good idea which I want to try. If (num of worker) (num of sub-plans), some of sub-plans can have multiple workers from the beginning, then, other workers also help to execute heavy plans later. It may be better to put PlannedStmt in order of total_cost to bias multi-workers execution from the beginning. TODO: Even if a heavy query occupied most of available worker slots, another session wants to use parallel execution later but during execution of the primary query. We may need to have a 'scoreboard' on shared memory to know how many workers are potentially needed and how much ones are overused by somebody. If someone overconsumed background workers, it should exit first, rather than picking up the next PlannedStmt. We will need to pay attention another issues we will look at when Funnel kicks background worker towards asymmetric relations. If number of rows of individual child nodes are various,
Re: [HACKERS] [DESIGN] ParallelAppend
KaiGai-san, On 2015-07-27 PM 11:07, Kouhei Kaigai wrote: Append -- Funnel -- PartialSeqScan on rel1 (num_workers = 4) -- Funnel -- PartialSeqScan on rel2 (num_workers = 8) -- SeqScan on rel3 shall be rewritten to Funnel -- PartialSeqScan on rel1 (num_workers = 4) -- PartialSeqScan on rel2 (num_workers = 8) -- SeqScan on rel3(num_workers = 1) In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1, rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the earlier one to finish? I would think the answer is no because then it would not be different from the former case, right? Because the original premise seems that (partitions) rel1, rel2, rel3 may be on different volumes so parallelism across volumes seems like a goal of parallelizing Append. From my understanding of parallel seqscan patch, each worker's PartialSeqScan asks for a block to scan using a shared parallel heap scan descriptor that effectively keeps track of division of work among PartialSeqScans in terms of blocks. What if we invent a PartialAppend which each worker would run in case of a parallelized Append. It would use some kind of shared descriptor to pick a relation (Append member) to scan. The shared structure could be the list of subplans including the mutex for concurrency. It doesn't sound as effective as proposed ParallelHeapScanDescData does for PartialSeqScan but any more granular might be complicated. For example, consider (current_relation, current_block) pair. If there are more workers than subplans/partitions, then multiple workers might start working on the same relation after a round-robin assignment of relations (but of course, a later worker would start scanning from a later block in the same relation). I imagine that might help with parallelism across volumes if that's the case. MergeAppend parallelization might involve a bit more complication but may be feasible with a PartialMergeAppend with slightly different kind of coordination among workers. What do you think of such an approach? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DESIGN] ParallelAppend
On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Hello, I'm recently working/investigating on ParallelAppend feature towards the next commit fest. Below is my design proposal. 1. Concept -- Its concept is quite simple anybody might consider more than once. ParallelAppend node kicks background worker process to execute child nodes in parallel / asynchronous. It intends to improve the performance to scan a large partitioned tables from standpoint of entire throughput, however, latency of the first multi-hundred rows are not scope of this project. From standpoint of technology trend, it primarily tries to utilize multi-cores capability within a system, but also enables to expand distributed database environment using foreign-tables inheritance features. Its behavior is very similar to Funnel node except for several points, thus, we can reuse its infrastructure we have had long- standing discussion through the v9.5 development cycle. 2. Problems to be solved - Typical OLAP workloads takes tons of tables join and scan on large tables which are often partitioned, and its KPI is query response time but very small number of sessions are active simultaneously. So, we are required to run a single query as rapid as possible even if it consumes larger computing resources than typical OLTP workloads. Current implementation to scan heap is painful when we look at its behavior from the standpoint - how many rows we can read within a certain time, because of synchronous manner. In the worst case, when SeqScan node tries to fetch the next tuple, heap_getnext() looks up a block on shared buffer, then ReadBuffer() calls storage manager to read the target block from the filesystem if not on the buffer. Next, operating system makes the caller process slept until required i/o get completed. Most of the cases are helped in earlier stage than the above worst case, however, the best scenario we can expect is: the next tuple already appear on top of the message queue (of course visibility checks are already done also) with no fall down to buffer manager or deeper. If we can run multiple scans in parallel / asynchronous, CPU core shall be assigned to another process by operating system, thus, it eventually improves the i/o density and enables higher processing throughput. Append node is an ideal point to be parallelized because - child nodes can have physically different location by tablespace, so further tuning is possible according to the system landscape. - it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains large tables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small tables are by itself. - Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan, ForeignScan or others can perform as usual, but actually in parallel. 3. Implementation -- * Plan Cost ParallelAppend shall appear where Appen can appear except for the usage for dummy. So, I'll enhance set_append_rel_pathlist() to add both of AppendPath and ParallelAppendPath with cost for each. Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Hello, I'm recently working/investigating on ParallelAppend feature towards the next commit fest. Below is my design proposal. 1. Concept -- Its concept is quite simple anybody might consider more than once. ParallelAppend node kicks background worker process to execute child nodes in parallel / asynchronous. It intends to improve the performance to scan a large partitioned tables from standpoint of entire throughput, however, latency of the first multi-hundred rows are not scope of this project. From standpoint of technology trend, it primarily tries to utilize multi-cores capability within a system, but also enables to expand distributed database environment using foreign-tables inheritance features. Its behavior is very similar to Funnel node except for several points, thus, we can reuse its infrastructure we have had long- standing discussion through the v9.5 development cycle. 2. Problems to be solved - Typical OLAP workloads takes tons of tables join and scan on large tables which are often partitioned, and its KPI is query response time but very small number of sessions are active simultaneously. So, we are required to run a single query as rapid as possible even if it consumes larger computing resources than typical OLTP workloads. Current implementation to scan heap is painful when we look at its behavior from the standpoint - how many rows we can read within a certain time, because of synchronous manner. In the worst case, when SeqScan node tries to fetch the next tuple, heap_getnext() looks up a block on shared buffer, then ReadBuffer() calls storage manager to read the target block from the filesystem if not on the buffer. Next, operating system makes the caller process slept until required i/o get completed. Most of the cases are helped in earlier stage than the above worst case, however, the best scenario we can expect is: the next tuple already appear on top of the message queue (of course visibility checks are already done also) with no fall down to buffer manager or deeper. If we can run multiple scans in parallel / asynchronous, CPU core shall be assigned to another process by operating system, thus, it eventually improves the i/o density and enables higher processing throughput. Append node is an ideal point to be parallelized because - child nodes can have physically different location by tablespace, so further tuning is possible according to the system landscape. - it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains large tables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small tables are by itself. - Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan, ForeignScan or others can perform as usual, but actually in parallel. 3. Implementation -- * Plan Cost ParallelAppend shall appear where Appen can appear except for the usage for dummy. So, I'll enhance set_append_rel_pathlist() to add both of AppendPath and ParallelAppendPath with cost for each. Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 If Funnel can handle both of horizontal and vertical parallelism, it is a great simplification. I never stick a new node. Once Funnel get a capability to have multiple child nodes, probably, Append node above will have gone. I expect set_append_rel_pathlist() add two paths based on Append and Funnel, then planner will choose the cheaper one according to its cost. We will need to pay attention another issues we will look at when Funnel kicks background worker towards asymmetric relations. If number of rows of individual child nodes are various, we may want to assign 10 background workers to scan rel1 with PartialSeqScan. On the other hands, rel2 may have very small number of rows thus its total_cost may be smaller than cost to launch a worker. In this case, Funnel has child nodes to be executed asynchronously and synchronously. If cheapest path of the child relation is a pair of Funnel and PartialSeqScan, we have to avoid to stack Funnel node. Probably, Funnel node that performs like Append needs to pull up underlying Funnel and assign equivalen number of workers as follows. Append -- Funnel -- PartialSeqScan on rel1 (num_workers = 4) -- Funnel -- PartialSeqScan on rel2 (num_workers = 8) -- SeqScan on rel3 shall be rewritten to Funnel -- PartialSeqScan on rel1 (num_workers = 4) --
Re: [HACKERS] [DESIGN] ParallelAppend
Hello, can I ask some questions? I suppose we can take this as the analog of ParalleSeqScan. I can see not so distinction between Append(ParalleSeqScan) and ParallelAppend(SeqScan). What difference is there between them? Append does not start to execute the second or later node until first node reaches end of the scan. On the other hands, ParallelAppend will kick all the child nodes (almost) simultaneously. If other nodes will have the same functionality as you mention at the last of this proposal, it might be better that some part of this feature is implemented as a part of existing executor itself, but not as a deidicated additional node, just as my asynchronous fdw execution patch patially does. (Although it lacks planner part and bg worker launching..) If that is the case, it might be better that ExecProcNode is modified so that it supports both in-process and inter-bgworker cases by the single API. What do you think about this? Its downside is that we need to adjust all the existing nodes to follow the new executor's capability. At this moment, we have 38 node types delivered from Plan. I think, it is not an easy job to review a patch that changes multi-dozens files. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com regards, Hello, I'm recently working/investigating on ParallelAppend feature towards the next commit fest. Below is my design proposal. 1. Concept -- Its concept is quite simple anybody might consider more than once. ParallelAppend node kicks background worker process to execute child nodes in parallel / asynchronous. It intends to improve the performance to scan a large partitioned tables from standpoint of entire throughput, however, latency of the first multi-hundred rows are not scope of this project. From standpoint of technology trend, it primarily tries to utilize multi-cores capability within a system, but also enables to expand distributed database environment using foreign-tables inheritance features. Its behavior is very similar to Funnel node except for several points, thus, we can reuse its infrastructure we have had long- standing discussion through the v9.5 development cycle. 2. Problems to be solved - Typical OLAP workloads takes tons of tables join and scan on large tables which are often partitioned, and its KPI is query response time but very small number of sessions are active simultaneously. So, we are required to run a single query as rapid as possible even if it consumes larger computing resources than typical OLTP workloads. Current implementation to scan heap is painful when we look at its behavior from the standpoint - how many rows we can read within a certain time, because of synchronous manner. In the worst case, when SeqScan node tries to fetch the next tuple, heap_getnext() looks up a block on shared buffer, then ReadBuffer() calls storage manager to read the target block from the filesystem if not on the buffer. Next, operating system makes the caller process slept until required i/o get completed. Most of the cases are helped in earlier stage than the above worst case, however, the best scenario we can expect is: the next tuple already appear on top of the message queue (of course visibility checks are already done also) with no fall down to buffer manager or deeper. If we can run multiple scans in parallel / asynchronous, CPU core shall be assigned to another process by operating system, thus, it eventually improves the i/o density and enables higher processing throughput. Append node is an ideal point to be parallelized because - child nodes can have physically different location by tablespace, so further tuning is possible according to the system landscape. - it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains large tables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small tables are by itself. - Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan, ForeignScan or others can perform as usual, but actually in parallel. 3. Implementation -- * Plan Cost ParallelAppend shall appear where Appen can appear except for the usage for dummy. So, I'll enhance set_append_rel_pathlist() to add both of AppendPath and ParallelAppendPath with cost for each. Cost estimation logic shall take further discussions, however, I expect the logic below to estimate the cost for ParallelAppend. 1. Sum startup_cost and run_cost for each child pathnode, but distinguish according to synchronous or asynchronous. Probably, total cost of pathnode is less than: (parallel_setup_cost + its total cost / parallel_append_degree
Re: [HACKERS] [DESIGN] ParallelAppend
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Monday, July 27, 2015 11:07 PM To: Amit Kapila Cc: pgsql-hackers@postgresql.org; Robert Haas; Kyotaro HORIGUCHI Subject: Re: [HACKERS] [DESIGN] ParallelAppend On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Hello, I'm recently working/investigating on ParallelAppend feature towards the next commit fest. Below is my design proposal. 1. Concept -- Its concept is quite simple anybody might consider more than once. ParallelAppend node kicks background worker process to execute child nodes in parallel / asynchronous. It intends to improve the performance to scan a large partitioned tables from standpoint of entire throughput, however, latency of the first multi-hundred rows are not scope of this project. From standpoint of technology trend, it primarily tries to utilize multi-cores capability within a system, but also enables to expand distributed database environment using foreign-tables inheritance features. Its behavior is very similar to Funnel node except for several points, thus, we can reuse its infrastructure we have had long- standing discussion through the v9.5 development cycle. 2. Problems to be solved - Typical OLAP workloads takes tons of tables join and scan on large tables which are often partitioned, and its KPI is query response time but very small number of sessions are active simultaneously. So, we are required to run a single query as rapid as possible even if it consumes larger computing resources than typical OLTP workloads. Current implementation to scan heap is painful when we look at its behavior from the standpoint - how many rows we can read within a certain time, because of synchronous manner. In the worst case, when SeqScan node tries to fetch the next tuple, heap_getnext() looks up a block on shared buffer, then ReadBuffer() calls storage manager to read the target block from the filesystem if not on the buffer. Next, operating system makes the caller process slept until required i/o get completed. Most of the cases are helped in earlier stage than the above worst case, however, the best scenario we can expect is: the next tuple already appear on top of the message queue (of course visibility checks are already done also) with no fall down to buffer manager or deeper. If we can run multiple scans in parallel / asynchronous, CPU core shall be assigned to another process by operating system, thus, it eventually improves the i/o density and enables higher processing throughput. Append node is an ideal point to be parallelized because - child nodes can have physically different location by tablespace, so further tuning is possible according to the system landscape. - it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains large tables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small tables are by itself. - Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan, ForeignScan or others can perform as usual, but actually in parallel. 3. Implementation -- * Plan Cost ParallelAppend shall appear where Appen can appear except for the usage for dummy. So, I'll enhance set_append_rel_pathlist() to add both of AppendPath and ParallelAppendPath with cost for each. Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 If Funnel can handle both of horizontal and vertical parallelism, it is a great simplification. I never stick a new node. Once Funnel get a capability to have multiple child nodes, probably, Append node above will have gone. I expect set_append_rel_pathlist() add two paths based on Append and Funnel, then planner will choose the cheaper one according to its cost. In the latest v16 patch, Funnel is declared as follows: typedef struct Funnel { Scanscan; int num_workers; } Funnel; If we try to add Append capability here, I expects the structure will be adjusted as follows, for example: typedef struct Funnel { Scanscan; List *funnel_plans; List *funnel_num_workers; } Funnel; As literal, funnel_plans saves underlying Plan nodes instead of the lefttree. Also, funnel_num_workers saves number of expected workers to be assigned on individual child plans. Even
Re: [HACKERS] [DESIGN] ParallelAppend
On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Monday, July 27, 2015 11:07 PM To: Amit Kapila Is there a real need to have new node like ParallelAppendPath? Can't we have Funnel node beneath AppendNode and then each worker will be responsible to have SeqScan on each inherited child relation. Something like Append --- Funnel -- SeqScan rel1 -- SeqScan rel2 If Funnel can handle both of horizontal and vertical parallelism, it is a great simplification. I never stick a new node. Once Funnel get a capability to have multiple child nodes, probably, Append node above will have gone. I expect set_append_rel_pathlist() add two paths based on Append and Funnel, then planner will choose the cheaper one according to its cost. In the latest v16 patch, Funnel is declared as follows: typedef struct Funnel { Scanscan; int num_workers; } Funnel; If we try to add Append capability here, I expects the structure will be adjusted as follows, for example: typedef struct Funnel { Scanscan; List *funnel_plans; List *funnel_num_workers; } Funnel; As literal, funnel_plans saves underlying Plan nodes instead of the lefttree. Also, funnel_num_workers saves number of expected workers to be assigned on individual child plans. or shall we have a node like above and name it as FunnelAppend or AppenFunnel? Even though create_parallelscan_paths() in v16 set num_workers not larger than parallel_seqscan_degree, total number of the concurrent background workers may exceed this configuration if more than two PartialSeqScan nodes are underlying. It is a different configuration from max_worker_processes, so it is not a matter as long as we have another restriction. However, how do we control the cap of number of worker processes per appendable Funnel node? For example, if a parent table has 200 child tables but max_worker_processes are configured to 50. It is obviously impossible to launch all the background workers simultaneously. One idea I have is to suspend launch of some plans until earlier ones are completed. Okay, but I think in that idea you need to re-launch the workers again for new set of relation scan's which could turn out to be costly, how about designing some way where workers after completing their assigned work check for new set of task/'s (which in this case would be to scan a new) and then execute the same. I think in this way we can achieve dynamic allocation of work and achieve maximum parallelism with available set of workers. We have achieved this in ParallelSeqScan by scanning at block level, once a worker finishes a block, it checks for new block to scan. We will need to pay attention another issues we will look at when Funnel kicks background worker towards asymmetric relations. If number of rows of individual child nodes are various, we may want to assign 10 background workers to scan rel1 with PartialSeqScan. On the other hands, rel2 may have very small number of rows thus its total_cost may be smaller than cost to launch a worker. In this case, Funnel has child nodes to be executed asynchronously and synchronously. I think this might turn out to be slightly tricky, for example how do we know for what size of relation, how many workers are sufficient? Another way to look at dividing the work in this case could be in terms of chunk-of-blocks, once a worker finishes it current set of block/'s, it should be able to get new set of block's to scan. So let us assume if we decide chunk-size as 32 and total number of blocks in whole inheritance hierarchy are 3200, then the max workers we should allocate to this scan are 100 and if we have parallel_seqscan degree lesser than that then we can use those many workers and then let them scan 32-blocks-at-a-time. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [DESIGN] ParallelAppend
Hello, can I ask some questions? I suppose we can take this as the analog of ParalleSeqScan. I can see not so distinction between Append(ParalleSeqScan) and ParallelAppend(SeqScan). What difference is there between them? If other nodes will have the same functionality as you mention at the last of this proposal, it might be better that some part of this feature is implemented as a part of existing executor itself, but not as a deidicated additional node, just as my asynchronous fdw execution patch patially does. (Although it lacks planner part and bg worker launching..) If that is the case, it might be better that ExecProcNode is modified so that it supports both in-process and inter-bgworker cases by the single API. What do you think about this? regards, Hello, I'm recently working/investigating on ParallelAppend feature towards the next commit fest. Below is my design proposal. 1. Concept -- Its concept is quite simple anybody might consider more than once. ParallelAppend node kicks background worker process to execute child nodes in parallel / asynchronous. It intends to improve the performance to scan a large partitioned tables from standpoint of entire throughput, however, latency of the first multi-hundred rows are not scope of this project. From standpoint of technology trend, it primarily tries to utilize multi-cores capability within a system, but also enables to expand distributed database environment using foreign-tables inheritance features. Its behavior is very similar to Funnel node except for several points, thus, we can reuse its infrastructure we have had long- standing discussion through the v9.5 development cycle. 2. Problems to be solved - Typical OLAP workloads takes tons of tables join and scan on large tables which are often partitioned, and its KPI is query response time but very small number of sessions are active simultaneously. So, we are required to run a single query as rapid as possible even if it consumes larger computing resources than typical OLTP workloads. Current implementation to scan heap is painful when we look at its behavior from the standpoint - how many rows we can read within a certain time, because of synchronous manner. In the worst case, when SeqScan node tries to fetch the next tuple, heap_getnext() looks up a block on shared buffer, then ReadBuffer() calls storage manager to read the target block from the filesystem if not on the buffer. Next, operating system makes the caller process slept until required i/o get completed. Most of the cases are helped in earlier stage than the above worst case, however, the best scenario we can expect is: the next tuple already appear on top of the message queue (of course visibility checks are already done also) with no fall down to buffer manager or deeper. If we can run multiple scans in parallel / asynchronous, CPU core shall be assigned to another process by operating system, thus, it eventually improves the i/o density and enables higher processing throughput. Append node is an ideal point to be parallelized because - child nodes can have physically different location by tablespace, so further tuning is possible according to the system landscape. - it can control whether subplan is actually executed on background worker, per subplan basis. If subplan contains large tables and small tables, ParallelAppend may kick background worker to scan large tables only, but scan on small tables are by itself. - Like as Funnel node, we don't need to care about enhancement of individual node types. SeqScan, IndexScan, ForeignScan or others can perform as usual, but actually in parallel. 3. Implementation -- * Plan Cost ParallelAppend shall appear where Appen can appear except for the usage for dummy. So, I'll enhance set_append_rel_pathlist() to add both of AppendPath and ParallelAppendPath with cost for each. Cost estimation logic shall take further discussions, however, I expect the logic below to estimate the cost for ParallelAppend. 1. Sum startup_cost and run_cost for each child pathnode, but distinguish according to synchronous or asynchronous. Probably, total cost of pathnode is less than: (parallel_setup_cost + its total cost / parallel_append_degree + number of rows * cpu_tuple_comm_cost) is nonsense to run on background worker. 2. parallel_setup_cost * (# of asynchronous nodes) are added to sum of startup_cost of asynchronous nodes. 3. sum of run_cost of asynchronous nodes are divided by parallel_append_degree, then cpu_tuple_comm_cost * (total # of rows by asynchronous nodes) are added. 4. both of synchronous and asynchronous cost are added, then it becomes the cost of ParallelAppend. Obviously, it stand on the viewpoint that says: cost reflects response time of the underlying plan. So,