Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-24 Thread Amit Kapila
On Mon, Nov 23, 2015 at 10:39 PM, Robert Haas  wrote:
>
> On Mon, Nov 23, 2015 at 7:45 AM, Amit Kapila 
wrote:
> > Without this patch, that 0.5 (or 50% of leaders effort) is considered
for
> > Gather node irrespective of the number of workers or other factors, but
> > I think with Patch that is no longer true and that's what I am worrying
> > about.
>
> Nope, that patch does not change that at all.  We probably should, but
> this patch does not.
>

I have taken some performance data with this patch.


- Select data from inheritance hierarchy with very few tuples.


Create table parent_rel(c1 int, c2 text);
Create table child1_rel () Inherits (parent_rel);
Create table child2_rel () Inherits (parent_rel);

insert into parent_rel values(generate_series(1,15), '');
insert into child1_rel values(generate_series(10,20),'aaa');
insert into child2_rel values(generate_series(20,30),'aaa');

Analyze parent_rel;
Analyze child1_rel;
Analyze child2_rel;

set max_parallel_degree=4;
set parallel_setup_cost=0;
set parallel_tuple_cost=0.01;

postgres=# explain select count(*) from parent_rel;
  QUERY PLAN


--
 Aggregate  (cost=2.71..2.72 rows=1 width=0)
   ->  Gather  (cost=0.00..2.62 rows=37 width=0)
 Number of Workers: 1
 ->  Append  (cost=0.00..2.25 rows=37 width=0)
   ->  Parallel Seq Scan on parent_rel  (cost=0.00..0.77
rows=15 width=0)
   ->  Parallel Seq Scan on child1_rel  (cost=0.00..0.74
rows=11 width=0)
   ->  Parallel Seq Scan on child2_rel  (cost=0.00..0.74
rows=11 width=0)


I have changed parallel_setup_cost and parallel_tuple_cost, so
it is selecting Gather path even for a small relation.  However,
the same won't be true for non-inheritence relation as if the number
of pages in relation are below than threshold (1000), it won't select
parallel path.  Now here we might want to have similar restriction for
Append Relation as well, that if combining all the child subpaths doesn't
have more than threshold number of pages, then don't try to build the
parallel path.

- Choose the data set that fits in shared_buffers and then run statements
with different selectivity and max_parallel_degree

Test setup

1. Use,  pgbench -i -s 100  to create initial data.
2. Use attached pgbench_partitions.sql to create 10 partitions with equal
data.
3. Use, parallel_append.sh to execute statements with different Selectivity
and max_parallel_degree (changed parallel_tuple_cost to 0.001)

Selection_criteria – 1% of rows will be selected and used costly function
evaluation for each row



Head



*max_parallel_degree* *exec_time (ms)* *workers_used*
0 76202 0
2 28556 2
4 21620 3
8 21693 3
16 21654 3
32 21579 3
64 21474 3



Patch



*max_parallel_degree* *exec_time (ms)* *workers_used*
0 77027 0
2 27088 2
4 16648 4
8 13730 5
16 13787 5
32 13794 5
64 13872 5


So here we can see that with Patch, performance is better, but I
think that is mainly due to number of workers working on a plan.
It is not clear that if we would have allowed more workers to
work at higher max_parallel_degree whether that can give us any
substantial benefit, but anyway I think thats a generic worker allocation
improvement which is not directly related to this patch.  The data
at different selectivities can be found in the attached document,
more or less that shows a similar trend.  Apart from this, I have tried
with data set which doesn't fit shared buffers, but fit in RAM, for that
also it shows similar trend.

Patch looks good, apart from worker allocation stuff, but I think we
can deal with that separately.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgbench_partitions.sql
Description: Binary data


parallel_append.sh
Description: Bourne shell script


parallel_append_data.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-23 Thread Amit Kapila
On Fri, Nov 20, 2015 at 7:06 PM, Robert Haas  wrote:
>
> On Fri, Nov 20, 2015 at 12:45 AM, Amit Kapila 
wrote:
> > Okay, but I think that's not what I am talking about.  I am talking
about
> > below code in cost_seqscan:
> >
> > - if (nworkers > 0)
> >
> > - run_cost = run_cost / (nworkers + 0.5);
> >
> > + if (path->parallel_degree > 0)
> >
> > + run_cost = run_cost / (path->parallel_degree + 0.5);
> >
> >
> > It will consider 50% of master backends effort for scan of each child
> > relation,
> > does that look correct to you?  Wouldn't 50% of master backends effort
be
> > considered to scan all the child relations?
>
> In the code you originally wrote, you were adding 1 there rather than
> 0.5.  That meant you were expecting the leader to do as much work as
> each of its workers, which is clearly a bad estimate, because the
> leader also has to do the work of gathering tuples from the workers.
> 0.5 might not be the right value, but it's surely better than 1.
>

Without this patch, that 0.5 (or 50% of leaders effort) is considered for
Gather node irrespective of the number of workers or other factors, but
I think with Patch that is no longer true and that's what I am worrying
about.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-23 Thread Robert Haas
On Mon, Nov 23, 2015 at 7:45 AM, Amit Kapila  wrote:
> Without this patch, that 0.5 (or 50% of leaders effort) is considered for
> Gather node irrespective of the number of workers or other factors, but
> I think with Patch that is no longer true and that's what I am worrying
> about.

Nope, that patch does not change that at all.  We probably should, but
this patch does not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-20 Thread Robert Haas
On Fri, Nov 20, 2015 at 12:45 AM, Amit Kapila  wrote:
> Okay, but I think that's not what I am talking about.  I am talking about
> below code in cost_seqscan:
>
> - if (nworkers > 0)
>
> - run_cost = run_cost / (nworkers + 0.5);
>
> + if (path->parallel_degree > 0)
>
> + run_cost = run_cost / (path->parallel_degree + 0.5);
>
>
> It will consider 50% of master backends effort for scan of each child
> relation,
> does that look correct to you?  Wouldn't 50% of master backends effort be
> considered to scan all the child relations?

In the code you originally wrote, you were adding 1 there rather than
0.5.  That meant you were expecting the leader to do as much work as
each of its workers, which is clearly a bad estimate, because the
leader also has to do the work of gathering tuples from the workers.
0.5 might not be the right value, but it's surely better than 1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Amit Kapila
On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas  wrote:
>
> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila 
wrote:
> > Don't we need the startup cost incase we need to build partial paths for
> > joinpaths like mergepath?
> > Also, I think there are other cases for single relation scan where
startup
> > cost can matter like when there are psuedoconstants in qualification
> > (refer cost_qual_eval_walker()) or let us say if someone has disabled
> > seq scan (disable_cost is considered as startup cost.)
>
> I'm not saying that we don't need to compute it.  I'm saying we don't
> need to take it into consideration when deciding which paths have
> merit.   Note that consider_statup is set this way:
>
> rel->consider_startup = (root->tuple_fraction > 0);
>

Even when consider_startup is false, still startup_cost is used for cost
calc, now may be ignoring that is okay for partial paths, but still it seems
worth thinking why leaving for partial paths it is okay even though it
is used in add_path().

+ *  We don't generate parameterized partial paths because they seem
unlikely
+ *  ever to be
worthwhile.  The only way we could ever use such a path is
+ *  by executing a nested loop with a complete
path on the outer side - thus,
+ *  each worker would scan the entire outer relation - and the partial
path
+ *  on the inner side - thus, each worker would scan only part of the inner
+ *  relation.  This is
silly: a parameterized path is generally going to be
+ *  based on an index scan, and we can't generate a
partial path for that.

Won't it be useful to consider parameterized paths for below kind of
plans where we can push the jointree to worker and each worker can
scan the complete outer relation A and then the rest work is divided
among workers (ofcourse there can be other ways to parallelize such joins,
but still the way described also seems to be possible)?

NestLoop
-> Seq Scan on A
Hash Join
Join Condition: B.Y = C.W
-> Seq Scan on B
-> Index Scan using C_Z_IDX on C
Index Condition: C.Z = A.X

-
Is the main reason to have add_partial_path() is that it has some
less checks or is it that current add_path will give wrong answers
in any case?

If there is no case where add_path can't work, then there is some
advanatge in retaining add_path() atleast in terms of maintining
the code.

+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
{
..
+ /* Unless pathkeys are incompable, keep just one of the two paths. */
..

typo - 'incompable'


> > A.
> > This means that for inheritance child relations for which rel pages are
> > less than parallel_threshold, it will always consider the cost shared
> > between 1 worker and leader as per below calc in cost_seqscan:
> > if (path->parallel_degree > 0)
> > run_cost = run_cost / (path->parallel_degree + 0.5);
> >
> > I think this might not be the appropriate cost model for even for
> > non-inheritence relations which has pages more than parallel_threshold,
> > but it seems to be even worst for inheritance children which have
> > pages less than parallel_threshold
>
> Why?

Because I think the way code is written, it assumes that for each of the
inheritence-child relation which has pages lesser than threshold, half
the work will be done by master-backend which doesn't seem to be the
right distribution.  Consider a case where there are three such children
each having cost 100 to scan, now it will cost them as
100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is
considering 0.5 of master backends work which seems to be wrong.

I think for Append case, we should consider this cost during Append path
creation in create_append_path().  Basically we can make cost_seqscan
to ignore the cost reduction due to parallel_degree for inheritance
relations
and then during Append path creation we can consider it and also consider
work unit of master backend as 0.5 with respect to overall work.

-
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
+plan as possible.  Expanding the range of cases in which more work can be
+pushed below the Gather (and
costly them accurately) is likely to keep us
+busy for a long time to come.

Seems there is a typo in above text.
/costly/cost

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Amit Kapila
On Thu, Nov 19, 2015 at 1:29 PM, Amit Kapila 
wrote:
>
> On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas 
wrote:
> -
> Is the main reason to have add_partial_path() is that it has some
> less checks or is it that current add_path will give wrong answers
> in any case?
>
> If there is no case where add_path can't work, then there is some
> advanatge in retaining add_path() atleast in terms of maintining
> the code.

To be specific, I mean to say about the logic of add_path().


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Robert Haas
On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila  wrote:
> On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas  wrote:
>>
>> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila 
>> wrote:
>> > Don't we need the startup cost incase we need to build partial paths for
>> > joinpaths like mergepath?
>> > Also, I think there are other cases for single relation scan where
>> > startup
>> > cost can matter like when there are psuedoconstants in qualification
>> > (refer cost_qual_eval_walker()) or let us say if someone has disabled
>> > seq scan (disable_cost is considered as startup cost.)
>>
>> I'm not saying that we don't need to compute it.  I'm saying we don't
>> need to take it into consideration when deciding which paths have
>> merit.   Note that consider_statup is set this way:
>>
>> rel->consider_startup = (root->tuple_fraction > 0);
>>
>
> Even when consider_startup is false, still startup_cost is used for cost
> calc, now may be ignoring that is okay for partial paths, but still it seems
> worth thinking why leaving for partial paths it is okay even though it
> is used in add_path().

That is explained in the comments, and I just explained it again in my
previous email.  I'm not sure how much clearer I can make it.  For a
regular path, it might sometimes be useful to pick a path with a
higher total cost if it has a lower startup cost.  The reason this
could be useful is because we might not run the resulting plan to
completion.  However, parallel queries are always run to completion,
so a lower startup cost isn't useful.  We just want the path with the
lowest total cost.  I don't know what else to say here unless you can
ask a more specific question.

> Won't it be useful to consider parameterized paths for below kind of
> plans where we can push the jointree to worker and each worker can
> scan the complete outer relation A and then the rest work is divided
> among workers (ofcourse there can be other ways to parallelize such joins,
> but still the way described also seems to be possible)?
>
> NestLoop
> -> Seq Scan on A
> Hash Join
> Join Condition: B.Y = C.W
> -> Seq Scan on B
> -> Index Scan using C_Z_IDX on C
> Index Condition: C.Z = A.X

I had thought that this sort of plan wouldn't actually occur in real
life, but it seems that it does.  What you've written here is a little
muddled - the hash join has no hash underneath, for example, and
there'd have to be some sort of join order restriction in order to
consider a plan of this type.  However, somewhat to my surprise, I was
able to get a plan much like this by doing this:

rhaas=# create table a (x int);
CREATE TABLE
rhaas=# insert into a values (1);
INSERT 0 1
rhaas=# create table b (y int, filler text);
CREATE TABLE
rhaas=# insert into b select g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,100) g;
INSERT 0 100
rhaas=# create table c (z int, w int, filler text);
CREATE TABLE
rhaas=# insert into c select g, g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,100) g;
INSERT 0 100
rhaas=# create index c_z_idx on c (z);
CREATE INDEX
rhaas=# vacuum analyze;
VACUUM
rhaas=# explain analyze select * from A LEFT JOIN (B INNER JOIN C ON
B.Y = C.W) ON C.Z = A.x;
  QUERY PLAN
--
 Nested Loop Left Join  (cost=8.46..26810.48 rows=1 width=152) (actual
time=0.076..166.946 rows=1 loops=1)
   ->  Seq Scan on a  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.015..0.016 rows=1 loops=1)
   ->  Hash Join  (cost=8.46..26809.47 rows=1 width=148) (actual
time=0.057..166.925 rows=1 loops=1)
 Hash Cond: (b.y = c.w)
 ->  Seq Scan on b  (cost=0.00..23051.00 rows=100
width=72) (actual time=0.012..89.013 rows=100 loops=1)
 ->  Hash  (cost=8.44..8.44 rows=1 width=76) (actual
time=0.035..0.035 rows=1 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 9kB
   ->  Index Scan using c_z_idx on c  (cost=0.42..8.44
rows=1 width=76) (actual time=0.031..0.032 rows=1 loops=1)
 Index Cond: (z = a.x)
 Planning time: 0.394 ms
 Execution time: 167.015 ms
(11 rows)

This is extremely narrow.  If you have more than one row in A, the
planner doesn't pick a nested loop.  And if you're actually going to
be running queries like this frequently, then you should create an
index on B (Y), which causes you to get an execution time of ~ 0.2 ms
instead of 167 ms, because we generate a parameterized nestloop over
two index scans.  But if you have no index on B (Y) and a does contain
precisely one row, then you can get this sort of plan, and hey, the
runtime is even long enough for parallelism to potentially be useful.

But after thinking about it for a while, I realized that even if you
think 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Amit Kapila
On Fri, Nov 20, 2015 at 1:25 AM, Robert Haas  wrote:
>
> On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila 
wrote:
> > Won't it be useful to consider parameterized paths for below kind of
> > plans where we can push the jointree to worker and each worker can
> > scan the complete outer relation A and then the rest work is divided
> > among workers (ofcourse there can be other ways to parallelize such
joins,
> > but still the way described also seems to be possible)?
> >
> > NestLoop
> > -> Seq Scan on A
> > Hash Join
> > Join Condition: B.Y = C.W
> > -> Seq Scan on B
> > -> Index Scan using C_Z_IDX on C
> > Index Condition: C.Z = A.X
>
> I had thought that this sort of plan wouldn't actually occur in real
> life, but it seems that it does.  What you've written here is a little
> muddled - the hash join has no hash underneath, for example, and
> there'd have to be some sort of join order restriction in order to
> consider a plan of this type.  However, somewhat to my surprise, I was
> able to get a plan much like this by doing this:
>
..
>
> So, all in all, I think this isn't a very promising type of plan -
> both because we haven't got the infrastructure to make it safe to
> execute today, and because even if we did have that infrastructure it
> wouldn't be the right choice except in narrow circumstances.
>

I think not only above type of plan, but it would be helpful to parallelize
some other forms of joins ((refer "Parameterized Paths" section in
optimiser/README) as well where parametrized params concept
will be required.  I am not sure if we can say that such cases will be
narrow, so let's leave them, but surely we don't have enough infrastructure
at the moment to parallelize them.


>  We can
> of course revise that decision in the future if things look different
> then.
>

No issues.  The main reason why I brought up this discussion is to
see the possibility of keeping logic of add_partial_path() and add_path()
same, so that it is easy to maintain.  There is no correctness issue here,
so I defer it to you.

> >
> > Because I think the way code is written, it assumes that for each of the
> > inheritence-child relation which has pages lesser than threshold, half
> > the work will be done by master-backend which doesn't seem to be the
> > right distribution.  Consider a case where there are three such children
> > each having cost 100 to scan, now it will cost them as
> > 100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is
> > considering 0.5 of master backends work which seems to be wrong.
> >
> > I think for Append case, we should consider this cost during Append path
> > creation in create_append_path().  Basically we can make cost_seqscan
> > to ignore the cost reduction due to parallel_degree for inheritance
> > relations
> > and then during Append path creation we can consider it and also
consider
> > work unit of master backend as 0.5 with respect to overall work.
>
> No, I don't think that's right.  It's true that the way we're
> calculating parallel_degree for each relation is unprincipled right
> now, and we need to improve that.  But if it were correct, then what
> we're doing here would also be correct.  If the number of workers
> chosen for each child plan reflected the maximum number that could be
> used effectively by that child plan, then any extras wouldn't speed
> things up even if they were present,
>

Okay, but I think that's not what I am talking about.  I am talking about
below code in cost_seqscan:

- if (nworkers > 0)

- run_cost = run_cost / (nworkers + 0.5);

+ if (path->parallel_degree > 0)

+ run_cost = run_cost / (path->parallel_degree + 0.5);


It will consider 50% of master backends effort for scan of each child
relation,
does that look correct to you?  Wouldn't 50% of master backends effort be
considered to scan all the child relations?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-18 Thread Robert Haas
On Tue, Nov 17, 2015 at 4:59 PM, Thom Brown  wrote:
> On 17 November 2015 at 20:08, Robert Haas  wrote:
>> On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
>>
>>> However, the first parallel seq scan shows it getting 170314 rows.
>>> Another run shows it getting 194165 rows.  The final result is
>>> correct, but as you can see from the rows on the Append node (59094295
>>> rows), it doesn't match the number of rows on the Gather node
>>> (3000).
>>
>> Is this the same issue reported in
>> http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
>> and not yet fixed?  I am inclined to think it probably is.
>
> Yes, that seems to be the same issue.

I've committed a fix for that issue now, so you shouldn't see it any
more if you retest this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-18 Thread Robert Haas
On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila  wrote:
> Don't we need the startup cost incase we need to build partial paths for
> joinpaths like mergepath?
> Also, I think there are other cases for single relation scan where startup
> cost can matter like when there are psuedoconstants in qualification
> (refer cost_qual_eval_walker()) or let us say if someone has disabled
> seq scan (disable_cost is considered as startup cost.)

I'm not saying that we don't need to compute it.  I'm saying we don't
need to take it into consideration when deciding which paths have
merit.   Note that consider_statup is set this way:

rel->consider_startup = (root->tuple_fraction > 0);

And for a joinrel:

joinrel->consider_startup = (root->tuple_fraction > 0);

root->tuple_fraction is 0 when we expect all tuples to be retrieved,
and parallel query can currently only be used when all tuples will be
retrieved.

> B. I think partial path is an important concept and desrves some
> explanation in src/backend/optimizer/README.
> There is already a good explanation about Paths, so I think it
> seems that it is better to add explanation about partial paths.

Good idea.  In the attached, revised version of the patch, I've added
a large new section to that README.

> 2.
> + *  costs: parallelism is only used for plans that will be run to
> completion.
> + *Therefore, this
> routine is much simpler than add_path: it needs to
> + *consider only pathkeys and total cost.
>
> There seems to be some spacing issue in last two lines.

Fixed.

> A.
> This means that for inheritance child relations for which rel pages are
> less than parallel_threshold, it will always consider the cost shared
> between 1 worker and leader as per below calc in cost_seqscan:
> if (path->parallel_degree > 0)
> run_cost = run_cost / (path->parallel_degree + 0.5);
>
> I think this might not be the appropriate cost model for even for
> non-inheritence relations which has pages more than parallel_threshold,
> but it seems to be even worst for inheritance children which have
> pages less than parallel_threshold

Why?  I'm certainly open to patches to improve the cost model, but I
don't see why this patch needs to do that.

> B.
> Will it be possible that if none of the inheritence child rels (or very few
> of them) are big enough for parallel scan, then considering Append
> node for parallelism of any use or in other words, won't it be better
> to generate plan as it is done now without this patch for such cases
> considering current execution model of Gather node?

I think we should instead extend the Append node as suggested by
KaiGai, so that it can have both partial and non-partial children.  I
think we can leave that to another patch, though.  Aside from
questions of what the fastest plan are, it's very bad to have Gather
nodes under the Append, because the Append could have many children
and we could end up with a ton of Gather nodes, each using a DSM and a
bunch of workers.  That's important to avoid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 012c14b..fe07176 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1588,6 +1588,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	else
 		_outBitmapset(str, NULL);
 	WRITE_BOOL_FIELD(parallel_aware);
+	WRITE_INT_FIELD(parallel_degree);
 	WRITE_FLOAT_FIELD(rows, "%.0f");
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1764,7 +1765,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
 	_outPathInfo(str, (const Path *) node);
 
 	WRITE_NODE_FIELD(subpath);
-	WRITE_INT_FIELD(num_workers);
 	WRITE_BOOL_FIELD(single_copy);
 }
 
@@ -1887,6 +1887,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
+	WRITE_NODE_FIELD(partial_pathlist);
 	WRITE_NODE_FIELD(cheapest_startup_path);
 	WRITE_NODE_FIELD(cheapest_total_path);
 	WRITE_NODE_FIELD(cheapest_unique_path);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 916a518..5019804 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -851,4 +851,57 @@ lateral reference.  (Perhaps now that that stuff works, we could relax the
 pullup restriction?)
 
 
--- bjm & tgl
+Parallel Query and Partial Paths
+
+
+Parallel query involves dividing up the work that needs to be performed
+either by an entire query or some portion of the query in such a way that
+some of that work can be done by one or more worker processes, which are
+called parallel workers.  Parallel workers are a subtype of dynamic
+background workers; see src/backend/access/transam/README.parallel for a
+fuller description.  Academic literature on parallel query suggests that
+that parallel 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-18 Thread Amit Kapila
On Sat, Nov 14, 2015 at 3:39 AM, Robert Haas  wrote:
>
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai 
wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.
>

Couple of comments and questions regarding this patch:

1.
+/*
+ * add_partial_path
..
+ *  produce the same number of rows.  Neither do we need to consider
startup
+ *  costs: parallelism
is only used for plans that will be run to completion.

A.
Don't we need the startup cost incase we need to build partial paths for
joinpaths like mergepath?
Also, I think there are other cases for single relation scan where startup
cost can matter like when there are psuedoconstants in qualification
(refer cost_qual_eval_walker()) or let us say if someone has disabled
seq scan (disable_cost is considered as startup cost.)

B. I think partial path is an important concept and desrves some
explanation in src/backend/optimizer/README.
There is already a good explanation about Paths, so I think it
seems that it is better to add explanation about partial paths.

2.
+ *  costs: parallelism is only used for plans that will be run to
completion.
+ *Therefore, this
routine is much simpler than add_path: it needs to
+ *consider only pathkeys and total cost.

There seems to be some spacing issue in last two lines.

3.
+static void
+create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+ int parallel_threshold = 1000;
+ int parallel_degree = 1;
+
+ /*
+ * If this relation is too small to be worth a parallel scan, just return
+ * without doing anything ... unless it's an inheritance child.  In that
case,
+ * we want to generate a parallel path here anyway.  It might not be
worthwhile
+ * just for this relation, but when combined with all of its inheritance
siblings
+ * it may well pay off.
+ */
+ if (rel->pages < parallel_threshold && rel->reloptkind == RELOPT_BASEREL)
+ return;

A.
This means that for inheritance child relations for which rel pages are
less than parallel_threshold, it will always consider the cost shared
between 1 worker and leader as per below calc in cost_seqscan:
if (path->parallel_degree > 0)
run_cost = run_cost / (path->parallel_degree + 0.5);

I think this might not be the appropriate cost model for even for
non-inheritence relations which has pages more than parallel_threshold,
but it seems to be even worst for inheritance children which have
pages less than parallel_threshold

B.
Will it be possible that if none of the inheritence child rels (or very few
of them) are big enough for parallel scan, then considering Append
node for parallelism of any use or in other words, won't it be better
to generate plan as it is done now without this patch for such cases
considering current execution model of Gather node?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Robert Haas
On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
> Okay, I've tried this patch.

Thanks!

> Yes, it's working!

Woohoo.

> However, the first parallel seq scan shows it getting 170314 rows.
> Another run shows it getting 194165 rows.  The final result is
> correct, but as you can see from the rows on the Append node (59094295
> rows), it doesn't match the number of rows on the Gather node
> (3000).

Is this the same issue reported in
http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
and not yet fixed?  I am inclined to think it probably is.

> And also, for some reason, I can no longer get this using more than 2
> workers, even with max_worker_processes = 16 and max_parallel_degree =
> 12.  I don't know if that's anything to do with this patch though.

The number of workers is limited based on the size of the largest
table involved in the Append.  That probably needs considerable
improvement, of course, but this patch is still a step forward over
not-this-patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Thom Brown
On 13 November 2015 at 22:09, Robert Haas  wrote:
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai  wrote:
>> I'm now designing the parallel feature of Append...
>>
>> Here is one challenge. How do we determine whether each sub-plan
>> allows execution in the background worker context?
>
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.  Currently, if you set up an
> inheritance hierarchy, you might get an Append some of whose children
> are Gather nodes with Parallel Seq Scans under them, like this:
>
> Append
> -> Gather
>   -> Parallel Seq Scan
> -> Gather
>   -> Parallel Seq Scan
> -> Seq Scan
>
> This is a crappy plan.  Each Gather node will try to launch its own
> bunch of workers, which sucks.  The attached patch lets you get this
> plan instead:
>
> Gather
> -> Append
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>
> That's a much better plan.
>
> To make this work, this plan introduces a couple of new concepts.
> Each RelOptInfo gets a new partial_pathlist field, which stores paths
> that need to be gathered to produce a workable plan.  Once we've
> populated the partial_pathlist with whatever partial paths we know how
> to generate, we can either push a Gather node on top of one of those
> partial paths to create a real path; or we can use those partial paths
> to build more partial paths.  The current patch handles only the
> simplest case: we can build a partial path for an appendrel by
> appending a partial path for each member rel.  But eventually I hope
> to handle joinrels this way as well: you can join a partial path with
> an ordinary path for form a partial path for the joinrel.
>
> This requires some way of figuring out how many workers to request for
> the append-path, so this patch also adds a parallel_degree field to
> the path object, which is 0 for non-parallel things and the number of
> workers that the path believes to be ideal otherwise.  Right now, it
> just percolates the highest worker count of any child up to the
> appendrel, which might not be right, especially once the append node
> knows how to balance workers, but it seems like a reasonable place to
> start.
>
>> Type-A) It can be performed on background worker context and
>>picked up by multiple worker processes concurrently.
>>   (e.g: Parallel SeqScan)
>> Type-B) It can be performed on background worker context but
>>   cannot be picked up by multiple worker processes.
>>   (e.g: non-parallel aware node)
>> Type-C) It should not be performed on background worker context.
>>(e.g: plan/path node with volatile functions)
>
> At the time that we're forming an AppendPath, we can identify what
> you're calling type-A paths very easily: they're the ones that are
> coming from the partial_pathlist.  We can distinguish between type-B
> and type-C paths coming from the childrel's pathlist based on the
> childrel's consider_parallel flag: if it's false, it's type-C, else
> type-B.  At some point, we might need a per-path flag to distinguish
> cases where a particular path is type-C even though some other plan
> for that relation might be type-B.  But right now that case doesn't
> arise.
>
> Now, of course, it's not good enough to have this information
> available when we're generating the AppendPath; it has to survive
> until execution time.  But that doesn't mean the paths need to be
> self-identifying.  We could, of course, decide to make them so, and
> maybe that's the best design, but I'm thinking about another approach:
> suppose the append node itself, instead of having one list of child
> plans, has a list of type-A plans, a list of type-B plans, and a list
> of type-C plans.  This actually seems quite convenient, because at
> execution time, you presumably want the leader to prioritize type-C
> plans over any of the others, since it's the only one that can execute
> them, and the workers to prioritize type-B plans, since they can only
> take one worker each and are thus prone to be the limiting factor on
> finishing the whole Append.  Having the plans divided in advance
> between these three lists (say, restricted_plans, safe_plans,
> parallel_plans) makes that easy to implement.
>
> Incidentally, I think it's subtly wrong to think of the parallel_aware
> flag as telling you whether the plan can absorb multiple workers.
> That's not really what it's for.  It's to tell you whether the plan is
> doing *something* parallel aware - that is, whether its Estimate,
> InitializeDSM, and InitializeWorker callbacks should do anything.  For
> SeqScan, flipping parallel_aware actually does split the input among
> all the workers, but for Append it's probably just load balances and
> for other nodes it might be something else again.  The term I'm using
> to indicate a path/plan that returns only a subset of the results in

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-17 Thread Thom Brown
On 17 November 2015 at 20:08, Robert Haas  wrote:
> On Tue, Nov 17, 2015 at 4:26 AM, Thom Brown  wrote:
>
>> However, the first parallel seq scan shows it getting 170314 rows.
>> Another run shows it getting 194165 rows.  The final result is
>> correct, but as you can see from the rows on the Append node (59094295
>> rows), it doesn't match the number of rows on the Gather node
>> (3000).
>
> Is this the same issue reported in
> http://www.postgresql.org/message-id/CAFj8pRBF-i=qdg9b5nzrxyfchzbezwmthxyphidqvwomojh...@mail.gmail.com
> and not yet fixed?  I am inclined to think it probably is.

Yes, that seems to be the same issue.

>> And also, for some reason, I can no longer get this using more than 2
>> workers, even with max_worker_processes = 16 and max_parallel_degree =
>> 12.  I don't know if that's anything to do with this patch though.
>
> The number of workers is limited based on the size of the largest
> table involved in the Append.  That probably needs considerable
> improvement, of course, but this patch is still a step forward over
> not-this-patch.

Ah, okay.  I wasn't aware of this.  I'll bear that in mind in future.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-16 Thread Kouhei Kaigai
> On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai  wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
> 
> I've been thinking about these questions for a bit now, and I think we
> can work on improving Append in multiple phases.  The attached patch
> shows what I have in mind for phase 1.  Currently, if you set up an
> inheritance hierarchy, you might get an Append some of whose children
> are Gather nodes with Parallel Seq Scans under them, like this:
> 
> Append
> -> Gather
>   -> Parallel Seq Scan
> -> Gather
>   -> Parallel Seq Scan
> -> Seq Scan
> 
> This is a crappy plan.  Each Gather node will try to launch its own
> bunch of workers, which sucks.  The attached patch lets you get this
> plan instead:
> 
> Gather
> -> Append
>   -> Partial Seq Scan
>   -> Partial Seq Scan
>   -> Partial Seq Scan
> 
> That's a much better plan.
> 
> To make this work, this plan introduces a couple of new concepts.
> Each RelOptInfo gets a new partial_pathlist field, which stores paths
> that need to be gathered to produce a workable plan.  Once we've
> populated the partial_pathlist with whatever partial paths we know how
> to generate, we can either push a Gather node on top of one of those
> partial paths to create a real path; or we can use those partial paths
> to build more partial paths.  The current patch handles only the
> simplest case: we can build a partial path for an appendrel by
> appending a partial path for each member rel.  But eventually I hope
> to handle joinrels this way as well: you can join a partial path with
> an ordinary path for form a partial path for the joinrel.
>
This idea will solve my concern gracefully.
The new partial_pathlist keeps candidate of path-nodes to be gathered
in this level or upper. Unlike path-nodes in the pathlist already, we
don't need to rip off GatherPath later.

Can we expect any path-nodes in the partial_pathlist don't contain
underlying GatherPath even if and when we would apply this design on
joinrel also?

If we would be able to ensure the path-nodes in partial_pathlist is
safe to run under the Gather node - it never contains Gather itself
or any others should not perform on the background worker context,
it will make path consideration much simpler than my expectation.

> This requires some way of figuring out how many workers to request for
> the append-path, so this patch also adds a parallel_degree field to
> the path object, which is 0 for non-parallel things and the number of
> workers that the path believes to be ideal otherwise.  Right now, it
> just percolates the highest worker count of any child up to the
> appendrel, which might not be right, especially once the append node
> knows how to balance workers, but it seems like a reasonable place to
> start.
>
I agree with. The new parallel_degree will give useful information,
and one worker per child relation is a good start.

> > Type-A) It can be performed on background worker context and
> >picked up by multiple worker processes concurrently.
> >   (e.g: Parallel SeqScan)
> > Type-B) It can be performed on background worker context but
> >   cannot be picked up by multiple worker processes.
> >   (e.g: non-parallel aware node)
> > Type-C) It should not be performed on background worker context.
> >(e.g: plan/path node with volatile functions)
> 
> At the time that we're forming an AppendPath, we can identify what
> you're calling type-A paths very easily: they're the ones that are
> coming from the partial_pathlist.  We can distinguish between type-B
> and type-C paths coming from the childrel's pathlist based on the
> childrel's consider_parallel flag: if it's false, it's type-C, else
> type-B.  At some point, we might need a per-path flag to distinguish
> cases where a particular path is type-C even though some other plan
> for that relation might be type-B.  But right now that case doesn't
> arise.
>
I also think we eventually have to have a per-path flag when we support
parallel capability on joinrel, although, it is not an immediate action.

> Now, of course, it's not good enough to have this information
> available when we're generating the AppendPath; it has to survive
> until execution time.  But that doesn't mean the paths need to be
> self-identifying.  We could, of course, decide to make them so, and
> maybe that's the best design, but I'm thinking about another approach:
> suppose the append node itself, instead of having one list of child
> plans, has a list of type-A plans, a list of type-B plans, and a list
> of type-C plans.  This actually seems quite convenient, because at
> execution time, you presumably want the leader to prioritize type-C
> plans over any of the others, since it's the only one that can execute
> them, and the workers to prioritize type-B plans, since they can only
> take one worker each and are 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-16 Thread Robert Haas
On Mon, Nov 16, 2015 at 10:10 AM, Kouhei Kaigai  wrote:
> This idea will solve my concern gracefully.
> The new partial_pathlist keeps candidate of path-nodes to be gathered
> in this level or upper. Unlike path-nodes in the pathlist already, we
> don't need to rip off GatherPath later.

Cool, yes.

> Can we expect any path-nodes in the partial_pathlist don't contain
> underlying GatherPath even if and when we would apply this design on
> joinrel also?

Yes.  A path that's already been gathered is not partial any more.
However, to create a partial path for a joinrel, we must join a
partial path to a complete path.  The complete path mustn't be one
which internally contains a Gather.   This is where we need another
per-path flag, I think.

> I'd like to agree with this idea. If Append can handle restricted_plans
> concurrently with safe_plans and parallel_plans, we don't need to give
> up parallelism even if any of child relation has neither safe- nor
> parallel-plans.

Right.

> One thing we need to pay attention is, we have to inform Gather node
> to kick local sub-plans if underlying Append node has any restricted
> plans. It also needs to distinguish the case when Gather node cannot
> launch any background workers, because the first case runs only type-C
> but the second case has to run all the sub-plans in local context.

I don't think that Gather needs to know anything about what's under
the Append.  What I think we want is that when we execute the Append:

(1) If we're the leader or not in parallel mode, run restricted plans,
then parallel plans, then safe plans.
(2) If we're a worker, run safe plans, then parallel plans.
(3) Either way, never run a safe plan if the leader or some other
worker has already begun to execute it.

The reason to have the leader prefer parallel plans to safe plans is
that it is more likely to become a bottleneck than the workers.  Thus
it should prefer to do work which can be split up rather than claiming
a whole plan for itself.  But in the case of restricted plans it has
no choice, since no one else can execute those, and it should do them
first, since they may be the limiting factor in finishing the whole
plan.

>> Incidentally, I think it's subtly wrong to think of the parallel_aware
>> flag as telling you whether the plan can absorb multiple workers.
>> That's not really what it's for.  It's to tell you whether the plan is
>> doing *something* parallel aware - that is, whether its Estimate,
>> InitializeDSM, and InitializeWorker callbacks should do anything.  For
>> SeqScan, flipping parallel_aware actually does split the input among
>> all the workers, but for Append it's probably just load balances and
>> for other nodes it might be something else again.  The term I'm using
>> to indicate a path/plan that returns only a subset of the results in
>> each worker is "partial".
>>
> Therefore, a NestLoop that takes underlying ParallelSeqScan and IndexScan
> may not be parallel aware by itself, however, it is exactly partial.

Right.

> This NestLoop will has parallel_degree likely larger than "1", won't it?

Larger than 0.

> It seems to me the "partial" is more clear concept to introduce how sub-
> plan will perform.

Good.

>> Whether or not a path is partial is, in the
>> design embodied in this patch, indicated both by whether
>> path->parallel_degree > 0 and whether the path is in rel->pathlist or
>> rel->partial_pathlist.
>>
> We should have Assert to detect paths with parallel_degree==0 but in
> the rel->partial_pathlist or parallel_degree > 1 but not appear in
> the rel->partial_pathlist?

parallel_degree==0 in the partial_pathlist is bad, but
parallel_degree>0 in the regular pathlist is OK, at least if it's a
Gather node.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-13 Thread Robert Haas
On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai  wrote:
> I'm now designing the parallel feature of Append...
>
> Here is one challenge. How do we determine whether each sub-plan
> allows execution in the background worker context?

I've been thinking about these questions for a bit now, and I think we
can work on improving Append in multiple phases.  The attached patch
shows what I have in mind for phase 1.  Currently, if you set up an
inheritance hierarchy, you might get an Append some of whose children
are Gather nodes with Parallel Seq Scans under them, like this:

Append
-> Gather
  -> Parallel Seq Scan
-> Gather
  -> Parallel Seq Scan
-> Seq Scan

This is a crappy plan.  Each Gather node will try to launch its own
bunch of workers, which sucks.  The attached patch lets you get this
plan instead:

Gather
-> Append
  -> Partial Seq Scan
  -> Partial Seq Scan
  -> Partial Seq Scan

That's a much better plan.

To make this work, this plan introduces a couple of new concepts.
Each RelOptInfo gets a new partial_pathlist field, which stores paths
that need to be gathered to produce a workable plan.  Once we've
populated the partial_pathlist with whatever partial paths we know how
to generate, we can either push a Gather node on top of one of those
partial paths to create a real path; or we can use those partial paths
to build more partial paths.  The current patch handles only the
simplest case: we can build a partial path for an appendrel by
appending a partial path for each member rel.  But eventually I hope
to handle joinrels this way as well: you can join a partial path with
an ordinary path for form a partial path for the joinrel.

This requires some way of figuring out how many workers to request for
the append-path, so this patch also adds a parallel_degree field to
the path object, which is 0 for non-parallel things and the number of
workers that the path believes to be ideal otherwise.  Right now, it
just percolates the highest worker count of any child up to the
appendrel, which might not be right, especially once the append node
knows how to balance workers, but it seems like a reasonable place to
start.

> Type-A) It can be performed on background worker context and
>picked up by multiple worker processes concurrently.
>   (e.g: Parallel SeqScan)
> Type-B) It can be performed on background worker context but
>   cannot be picked up by multiple worker processes.
>   (e.g: non-parallel aware node)
> Type-C) It should not be performed on background worker context.
>(e.g: plan/path node with volatile functions)

At the time that we're forming an AppendPath, we can identify what
you're calling type-A paths very easily: they're the ones that are
coming from the partial_pathlist.  We can distinguish between type-B
and type-C paths coming from the childrel's pathlist based on the
childrel's consider_parallel flag: if it's false, it's type-C, else
type-B.  At some point, we might need a per-path flag to distinguish
cases where a particular path is type-C even though some other plan
for that relation might be type-B.  But right now that case doesn't
arise.

Now, of course, it's not good enough to have this information
available when we're generating the AppendPath; it has to survive
until execution time.  But that doesn't mean the paths need to be
self-identifying.  We could, of course, decide to make them so, and
maybe that's the best design, but I'm thinking about another approach:
suppose the append node itself, instead of having one list of child
plans, has a list of type-A plans, a list of type-B plans, and a list
of type-C plans.  This actually seems quite convenient, because at
execution time, you presumably want the leader to prioritize type-C
plans over any of the others, since it's the only one that can execute
them, and the workers to prioritize type-B plans, since they can only
take one worker each and are thus prone to be the limiting factor on
finishing the whole Append.  Having the plans divided in advance
between these three lists (say, restricted_plans, safe_plans,
parallel_plans) makes that easy to implement.

Incidentally, I think it's subtly wrong to think of the parallel_aware
flag as telling you whether the plan can absorb multiple workers.
That's not really what it's for.  It's to tell you whether the plan is
doing *something* parallel aware - that is, whether its Estimate,
InitializeDSM, and InitializeWorker callbacks should do anything.  For
SeqScan, flipping parallel_aware actually does split the input among
all the workers, but for Append it's probably just load balances and
for other nodes it might be something else again.  The term I'm using
to indicate a path/plan that returns only a subset of the results in
each worker is "partial".  Whether or not a path is partial is, in the
design embodied in this patch, indicated both by whether
path->parallel_degree > 0 and whether the path is in rel->pathlist or
rel->partial_pathlist.

-- 
Robert Haas

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-12 Thread Kouhei Kaigai
> On 2015/11/12 14:09, Kouhei Kaigai wrote:
> > I'm now designing the parallel feature of Append...
> >
> > Here is one challenge. How do we determine whether each sub-plan
> > allows execution in the background worker context?
> >
> > The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added
> > 'parallel_aware' flag on Path and Plan structure.
> > It tells us whether the plan/path-node can perform by multiple
> > background worker processes concurrently, but also tells us
> > nothing whether the plan/path-node are safe to run on background
> > worker process context.
> 
> When I was looking at the recent parallelism related commits, I noticed a
> RelOptInfo.consider_parallel flag. That and the function
> set_rel_consider_parallel() may be of interest in this regard.
> set_append_rel_size() passes the parent's state of this flag down to child
> relations but I guess that's not  what you're after.
>
Thanks for this information. Indeed, it shall inform us which base
relations are valid for parallel execution.
In case of parallel-append, we can give up parallelism if any of
underlying base relation is not parallel aware. We can use same
logic for join relation cases, potentially.

A challenge is how to count up optimal number of background worker
process. I assume the number of workers of parallel-append shall be
sum of required number of workers by the sub-plans unless it does not
exceed max_parallel_degree.
Probably, we need pathnode_tree_walker() to count up number of workers
required by the sub-plans.

BTW, is the idea of consider_parallel flag in RelOptInfo workable for
join relations also? In case when A LEFT JOIN B for example, it can be
parallel aware if join path has A as outer input, but it cannot be if
B would be outer input.
I doubt that this kind of information belong to Path, not RelOptInfo.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-11 Thread Kouhei Kaigai
I'm now designing the parallel feature of Append...

Here is one challenge. How do we determine whether each sub-plan
allows execution in the background worker context?

The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added
'parallel_aware' flag on Path and Plan structure.
It tells us whether the plan/path-node can perform by multiple
background worker processes concurrently, but also tells us
nothing whether the plan/path-node are safe to run on background
worker process context.

From the standpoint of parallel execution, I understand here are
three types of plan/path nodes.

Type-A) It can be performed on background worker context and
   picked up by multiple worker processes concurrently.
   (e.g: Parallel SeqScan)
Type-B) It can be performed on background worker context but
   cannot be picked up by multiple worker processes.
   (e.g: non-parallel aware node)
Type-C) It should not be performed on background worker context.
   (e.g: plan/path node with volatile functions)

The 'parallel_aware' flag allows to distinguish the type-A and
others, however, we cannot distinguish type-B from type-C.
From the standpoint of parallel append, it makes sense to launch
background workers even if all the sub-plan are type-B, with no
type-A node.

Sorry for late. I'd like to suggest to have 'int num_workers'
in Path and Plan node as a common field.
We give this field the following three meaning.
- If num_workers > 1, it is type-A node, thus, parallel aware
  Append node shall assign more than one workers on this node.
- If num_workers == 1, it is type-B node, thus, more than one
  background worker process shall be never assigned.
- If num_workers == 0, it is type-C node, thus, planner shall
  give up to run this node on background worker context.

The num_workers state shall propagate to the upper node.
For example, I expect a HashJoin node that takes Parallel
SeqScan with num_workers == 4 as outer input will also have
num_worker == 4, as long as join clauses are safe to run on
background worker side.

How about the idea?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Saturday, October 31, 2015 1:35 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> 
> > On Wed, Oct 28, 2015 at 3:55 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > > At PGconf.EU, I could have a talk with Robert about this topic,
> > > then it became clear we have same idea.
> > >
> > >> ++
> > >> |sub-plan |   * Sub-Plan 1 ... Index Scan on p1
> > >> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
> > >> |shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
> > >> |memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
> > >> +-+   * Sub-Plan 3 ... Index Scan on p3
> > >>
> > > In the above example, I put non-parallel sub-plan to use only
> > > 1 slot of the array, even though a PartialSeqScan takes 3 slots.
> > > It is a strict rule; non-parallel aware sub-plan can be picked
> > > up once.
> > > The index of sub-plan array is initialized to 0, then increased
> > > to 5 by each workers when it processes the parallel-aware Append.
> > > So, once a worker takes non-parallel sub-plan, other worker can
> > > never take the same slot again, thus, no duplicated rows will be
> > > produced by non-parallel sub-plan in the parallel aware Append.
> > > Also, this array structure will prevent too large number of
> > > workers pick up a particular parallel aware sub-plan, because
> > > PartialSeqScan occupies 3 slots; that means at most three workers
> > > can pick up this sub-plan. If 1st worker took the IndexScan on
> > > p1, and 2nd-4th worker took the PartialSeqScan on p2, then the
> > > 5th worker (if any) will pick up the IndexScan on p3 even if
> > > PartialSeqScan on p2 was not completed.
> >
> > Actually, this is not exactly what I had in mind.  I was thinking that
> > we'd have a single array whose length is equal to the number of Append
> > subplans, and each element of the array would be a count of the number
> > of workers executing that subplan.  So there wouldn't be multiple
> > entries for the same subplan, as you propose here.  To distinguish
> > between parallel-aware and non-parallel-aware plans, I plan to put a
> > Boolean flag in the plan itself.
> >
> I don't have strong preference here. Both of design can implement the
> requirement; none-parallel 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-11 Thread Amit Langote
On 2015/11/12 14:09, Kouhei Kaigai wrote:
> I'm now designing the parallel feature of Append...
> 
> Here is one challenge. How do we determine whether each sub-plan
> allows execution in the background worker context?
> 
> The commit f0661c4e8c44c0ec7acd4ea7c82e85b265447398 added
> 'parallel_aware' flag on Path and Plan structure.
> It tells us whether the plan/path-node can perform by multiple
> background worker processes concurrently, but also tells us
> nothing whether the plan/path-node are safe to run on background
> worker process context.

When I was looking at the recent parallelism related commits, I noticed a
RelOptInfo.consider_parallel flag. That and the function
set_rel_consider_parallel() may be of interest in this regard.
set_append_rel_size() passes the parent's state of this flag down to child
relations but I guess that's not  what you're after.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-30 Thread Robert Haas
On Wed, Oct 28, 2015 at 3:55 PM, Kouhei Kaigai  wrote:
> At PGconf.EU, I could have a talk with Robert about this topic,
> then it became clear we have same idea.
>
>> ++
>> |sub-plan |   * Sub-Plan 1 ... Index Scan on p1
>> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
>> |shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
>> |memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
>> +-+   * Sub-Plan 3 ... Index Scan on p3
>>
> In the above example, I put non-parallel sub-plan to use only
> 1 slot of the array, even though a PartialSeqScan takes 3 slots.
> It is a strict rule; non-parallel aware sub-plan can be picked
> up once.
> The index of sub-plan array is initialized to 0, then increased
> to 5 by each workers when it processes the parallel-aware Append.
> So, once a worker takes non-parallel sub-plan, other worker can
> never take the same slot again, thus, no duplicated rows will be
> produced by non-parallel sub-plan in the parallel aware Append.
> Also, this array structure will prevent too large number of
> workers pick up a particular parallel aware sub-plan, because
> PartialSeqScan occupies 3 slots; that means at most three workers
> can pick up this sub-plan. If 1st worker took the IndexScan on
> p1, and 2nd-4th worker took the PartialSeqScan on p2, then the
> 5th worker (if any) will pick up the IndexScan on p3 even if
> PartialSeqScan on p2 was not completed.

Actually, this is not exactly what I had in mind.  I was thinking that
we'd have a single array whose length is equal to the number of Append
subplans, and each element of the array would be a count of the number
of workers executing that subplan.  So there wouldn't be multiple
entries for the same subplan, as you propose here.  To distinguish
between parallel-aware and non-parallel-aware plans, I plan to put a
Boolean flag in the plan itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-30 Thread Kouhei Kaigai
> On Wed, Oct 28, 2015 at 3:55 PM, Kouhei Kaigai  wrote:
> > At PGconf.EU, I could have a talk with Robert about this topic,
> > then it became clear we have same idea.
> >
> >> ++
> >> |sub-plan |   * Sub-Plan 1 ... Index Scan on p1
> >> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
> >> |shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
> >> |memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
> >> +-+   * Sub-Plan 3 ... Index Scan on p3
> >>
> > In the above example, I put non-parallel sub-plan to use only
> > 1 slot of the array, even though a PartialSeqScan takes 3 slots.
> > It is a strict rule; non-parallel aware sub-plan can be picked
> > up once.
> > The index of sub-plan array is initialized to 0, then increased
> > to 5 by each workers when it processes the parallel-aware Append.
> > So, once a worker takes non-parallel sub-plan, other worker can
> > never take the same slot again, thus, no duplicated rows will be
> > produced by non-parallel sub-plan in the parallel aware Append.
> > Also, this array structure will prevent too large number of
> > workers pick up a particular parallel aware sub-plan, because
> > PartialSeqScan occupies 3 slots; that means at most three workers
> > can pick up this sub-plan. If 1st worker took the IndexScan on
> > p1, and 2nd-4th worker took the PartialSeqScan on p2, then the
> > 5th worker (if any) will pick up the IndexScan on p3 even if
> > PartialSeqScan on p2 was not completed.
> 
> Actually, this is not exactly what I had in mind.  I was thinking that
> we'd have a single array whose length is equal to the number of Append
> subplans, and each element of the array would be a count of the number
> of workers executing that subplan.  So there wouldn't be multiple
> entries for the same subplan, as you propose here.  To distinguish
> between parallel-aware and non-parallel-aware plans, I plan to put a
> Boolean flag in the plan itself.
>
I don't have strong preference here. Both of design can implement the
requirement; none-parallel sub-plans are never picked up twice, and
parallel-aware sub-plans can be picked up multiple times.
So, I'll start with the above your suggestion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-28 Thread Kouhei Kaigai
At PGconf.EU, I could have a talk with Robert about this topic,
then it became clear we have same idea.

> ++
> |sub-plan |   * Sub-Plan 1 ... Index Scan on p1
> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
> |shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
> |memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
> +-+   * Sub-Plan 3 ... Index Scan on p3
>
In the above example, I put non-parallel sub-plan to use only
1 slot of the array, even though a PartialSeqScan takes 3 slots.
It is a strict rule; non-parallel aware sub-plan can be picked
up once.
The index of sub-plan array is initialized to 0, then increased
to 5 by each workers when it processes the parallel-aware Append.
So, once a worker takes non-parallel sub-plan, other worker can
never take the same slot again, thus, no duplicated rows will be
produced by non-parallel sub-plan in the parallel aware Append.
Also, this array structure will prevent too large number of
workers pick up a particular parallel aware sub-plan, because
PartialSeqScan occupies 3 slots; that means at most three workers
can pick up this sub-plan. If 1st worker took the IndexScan on
p1, and 2nd-4th worker took the PartialSeqScan on p2, then the
5th worker (if any) will pick up the IndexScan on p3 even if
PartialSeqScan on p2 was not completed.



One other thought experiment, what happen if parallel-aware
Append is underlying another parallel-aware Append.
As literal, parallel-aware Append is parallel-aware, thus, it
can occupy multiple slots in the array of sub-plans, like:

subplans[0] ... SeqScan on p1
subplans[1] ... Parallel Append on p2+p3+p4
subplans[2] ... Parallel Append on p2+p3+p4
subplans[3] ... Parallel Append on p2+p3+p4
subplans[4] ... Parallel Append on p2+p3+p4
subplans[5] ... IndexScan on p5

Also, assume the child parallel-aware Append the following
array.
subplans[0] ... SeqScan on p2
subplans[1] ... PartialSeqScan on p3
subplans[2] ... PartialSeqScan on p3
subplans[3] ... SeqScan on p4

The Gather node located on top of the upper Append node will
launch (at most) 6 workers according to the requirement by
the upper Append node.
Each worker picks up a particular sub-plan in the array of
upper Append node, so some of them (4 workers at most) will
execute the child Append node.
Then, these 4 workers will also pick up a particular sub-plan
in the array of child Append node.
It will work even if number of workers are less than the optimal,
so I believe the overall design is reasonable.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Tuesday, October 27, 2015 6:46 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> 
> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> > Sent: Monday, October 26, 2015 8:53 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI
> > Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> >
> > On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > > I entirely agree with your suggestion.
> > >
> > > We may be able to use an analogy between PartialSeqScan and the
> > > parallel- aware Append node.
> > > PartialSeqScan fetches blocks pointed by the index on shared memory
> > > segment, thus multiple workers eventually co-operate to scan a table
> > > using round-robin manner even though individual worker fetches comb-
> > > shaped blocks.
> > > If we assume individual blocks are individual sub-plans on the parallel
> > > aware Append, it performs very similar. A certain number of workers
> > > (more than zero) is launched by Gather node, then the parallel aware
> > > Append node fetches one of the sub-plans if any.
> >
> > Exactly, except for the part about the blocks being "comb-shaped",
> > which doesn't seem to make sense.
> >
> > > I think, this design also gives additional flexibility according to
> > > the required parallelism by the underlying sub-plans.
> > > Please assume the "PartialSeqScan on p2" in the above example wants
> > > 3 workers to process the scan, we can expand the virtual array of
> > > the sub-plans as follows. Then, if Gather node kicks 5 workers,
> > > individual workers are assigned on some of plans. If Gather node
> > > could kick less than 5 workers, the first exit worker picks the
> > 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-26 Thread Robert Haas
On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai  wrote:
> I entirely agree with your suggestion.
>
> We may be able to use an analogy between PartialSeqScan and the
> parallel- aware Append node.
> PartialSeqScan fetches blocks pointed by the index on shared memory
> segment, thus multiple workers eventually co-operate to scan a table
> using round-robin manner even though individual worker fetches comb-
> shaped blocks.
> If we assume individual blocks are individual sub-plans on the parallel
> aware Append, it performs very similar. A certain number of workers
> (more than zero) is launched by Gather node, then the parallel aware
> Append node fetches one of the sub-plans if any.

Exactly, except for the part about the blocks being "comb-shaped",
which doesn't seem to make sense.

> I think, this design also gives additional flexibility according to
> the required parallelism by the underlying sub-plans.
> Please assume the "PartialSeqScan on p2" in the above example wants
> 3 workers to process the scan, we can expand the virtual array of
> the sub-plans as follows. Then, if Gather node kicks 5 workers,
> individual workers are assigned on some of plans. If Gather node
> could kick less than 5 workers, the first exit worker picks the
> second sub-plan, then it eventually provides the best parallelism.
>
> ++
> |sub-plan |   * Sub-Plan 1 ... Index Scan on p1
> |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
> |shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
> |memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
> +-+   * Sub-Plan 3 ... Index Scan on p3

I don't think the shared memory chunk should be indexed by worker, but
by sub-plan.  So with 3 subplans, we would initially have [0,0,0].
The first worker would grab the first subplan, and we get [1,0,0].
The second worker grabs the third subplan, so now we have [1,0,1].
The remaining workers can't join the execution of those plans because
they are not truly parallel, so they all gang up on the second
subplan.  At 5 workers we have [1,3,1].  Workers need not ever
decrement the array entries because they only pick a new sub-plan when
the one they picked previously is exhausted; thus, at the end of the
plan, each element in the array shows the total number of workers that
touched it at some point during its execution.

> For more generic plan construction, Plan node may have a field for
> number of "desirable" workers even though most of plan-nodes are not
> parallel aware, and it is not guaranteed.
> In above case, the parallel aware Append will want 5 workers in total
> (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
> of Gather node how many workers are actually launched, however, it
> will give clear information how many workers are best.

Yeah, maybe.  I haven't thought about this deeply just yet, but I
agree it needs more consideration.

>> First, we can teach Append that, when running in parallel,
>> it should initialize a chunk of dynamic shared memory with an array
>> indicating how many workers are currently working on each subplan.
> Can the parallel-aware Append can recognize the current mode using
> MyBgworkerEntry whether it is valid or not?

No - that would be quite wrong.  What it needs to do is define
ExecAppendEstimate and ExecAppendInitializeDSM and call those
functions from ExecParallelEstimate and ExecParallelInitializeDSM.  It
also needs to define a third callback ExecAppendInitializeWorker which
will be called from the ExecParallelInitializeWorker function added by
the latest partial seq scan patch.  ExecAppendEstimate must estimate
required shared memory usage for the shared memory state;
ExecAppendInitializeDSM must initialize that state, store a pointer to
it in the planstate note, and add a TOC entry; ExecAppendWorker will
run in the worker and should look up the TOC entry and store the
result in the same planstate node that ExecAppendInitializeDSM
populated in the leader.  Then ExecAppend can decide what to do based
on whether that pointer is set, and based on the data to which it
points.

Are you going to look at implementing this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-26 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Monday, October 26, 2015 8:53 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Amit Kapila; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> 
> On Sun, Oct 25, 2015 at 9:23 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > I entirely agree with your suggestion.
> >
> > We may be able to use an analogy between PartialSeqScan and the
> > parallel- aware Append node.
> > PartialSeqScan fetches blocks pointed by the index on shared memory
> > segment, thus multiple workers eventually co-operate to scan a table
> > using round-robin manner even though individual worker fetches comb-
> > shaped blocks.
> > If we assume individual blocks are individual sub-plans on the parallel
> > aware Append, it performs very similar. A certain number of workers
> > (more than zero) is launched by Gather node, then the parallel aware
> > Append node fetches one of the sub-plans if any.
> 
> Exactly, except for the part about the blocks being "comb-shaped",
> which doesn't seem to make sense.
> 
> > I think, this design also gives additional flexibility according to
> > the required parallelism by the underlying sub-plans.
> > Please assume the "PartialSeqScan on p2" in the above example wants
> > 3 workers to process the scan, we can expand the virtual array of
> > the sub-plans as follows. Then, if Gather node kicks 5 workers,
> > individual workers are assigned on some of plans. If Gather node
> > could kick less than 5 workers, the first exit worker picks the
> > second sub-plan, then it eventually provides the best parallelism.
> >
> > ++
> > |sub-plan |   * Sub-Plan 1 ... Index Scan on p1
> > |index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
> > |shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
> > |memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
> > +-+   * Sub-Plan 3 ... Index Scan on p3
> 
> I don't think the shared memory chunk should be indexed by worker, but
> by sub-plan.  So with 3 subplans, we would initially have [0,0,0].
> The first worker would grab the first subplan, and we get [1,0,0].
> The second worker grabs the third subplan, so now we have [1,0,1].
> The remaining workers can't join the execution of those plans because
> they are not truly parallel, so they all gang up on the second
> subplan.  At 5 workers we have [1,3,1].  Workers need not ever
> decrement the array entries because they only pick a new sub-plan when
> the one they picked previously is exhausted; thus, at the end of the
> plan, each element in the array shows the total number of workers that
> touched it at some point during its execution.
>
Sorry, I could not get the point in the above explanation.
The 1st worker would grab the first subplan, then [1,0,0]. It's OK.
The 2nd worker would grab the last subplan, then [1,0,1]. It's
understandable, even though I'm uncertain why it does not pick up
the 2nd one.
Why remaining worker have to gang up to kick the 2nd (PartialSeqScan;
that is parallel-aware execution node)?

Even if only one worker is kicked towards the PartialSeqScan, it tries
to scan the relation sequentially (because of no parallel workers actually).
Then, once other worker gets additionally launched to scan same relation,
both of the worker begins to co-operate using a common block-index kept
on the shared memory.
So, do we need to wait for completion of non-parallel aware nodes here?

I assume that it is better to launch PartialSeqScan, even if one worker,
than synchronization, because other worker can join the execution later.

> > For more generic plan construction, Plan node may have a field for
> > number of "desirable" workers even though most of plan-nodes are not
> > parallel aware, and it is not guaranteed.
> > In above case, the parallel aware Append will want 5 workers in total
> > (2 by 2 index-scans, plus 3 by a partial-seq-scan). It is a discretion
> > of Gather node how many workers are actually launched, however, it
> > will give clear information how many workers are best.
> 
> Yeah, maybe.  I haven't thought about this deeply just yet, but I
> agree it needs more consideration.
>
OK, I'll build a patch including the concept.

> >> First, we can teach Append that, when running in parallel,
> >> it should initialize a chunk of dynamic shared memory with an array
> >> indicating how many workers are currently working on each subplan.
> > Can the parallel-aware Append can recognize the current mode using
> > MyBg

Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-25 Thread Kouhei Kaigai
> On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigai  wrote:
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > --
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
> 
> Now that I've got more of the parallel infrastructure committed, I've
> been starting to have a little time to think about what we might want
> to do after we get PartialSeqScan committed.  I'm positive on doing
> something with the Append node and parallelism, but I'm negative on
> doing what you've described here.
> 
> I don't think the Append node has any business launching workers.
> That's the job of Gather.  Append may need to do something
> parallel-aware, but starting workers is not that thing.  Without
> making any changes at all to Append, we can use it like this:
> 
> Gather
> -> Append
>   -> Partial Seq Scan on p1
>   -> Partial Seq Scan on p2
>   -> Partial Seq Scan on p3
> 
> The Gather node will spin up workers and in each worker we will ask
> the Append nodes for tuples.  Append will ask the first
> not-yet-completed child for tuples, so the workers will cooperatively
> scan first p1, then p2, then p3.  This is great: instead of merely
> doing a parallel seq scan of a single table, we can do a parallel seq
> scan of a partitioned table.  However, there are two improvements we
> can make.  First, we can teach Append that, when running in parallel,
> it should initialize a chunk of dynamic shared memory with an array
> indicating how many workers are currently working on each subplan.
> Each new worker should join a subplan with the minimum number of
> workers, work on that one until it's completely, and then pick a new
> subplan.  This minimizes contention.  Second, we can teach the Append
> to serve as a parent not only for intrinsically parallel nodes like
> Partial Seq Scan, but also for other nodes, say, an Index Scan.  When
> an Append is running in parallel but with non-parallel-aware children,
> each such child can be claimed by at most one worker process and will
> be run to completion by that worker process.  For example:
> 
> Gather
> -> Append
>   -> Index Scan on p1
>   -> Partial Seq Scan on p2
>   -> Index Scan on p3
> 
> The first worker which executes the Append should begin the index scan
> on p1 and the second should begin the index scan on p3.  The remaining
> workers, and those two once they're finished, can work on p2.
> 
> Proceeding in this way, I don't think we need a separate Parallel
> Append node.  Rather, we can just give the existing Append node some
> extra smarts that are used only when it's running in parallel.
>
I entirely agree with your suggestion.

We may be able to use an analogy between PartialSeqScan and the
parallel- aware Append node.
PartialSeqScan fetches blocks pointed by the index on shared memory
segment, thus multiple workers eventually co-operate to scan a table
using round-robin manner even though individual worker fetches comb-
shaped blocks.
If we assume individual blocks are individual sub-plans on the parallel
aware Append, it performs very similar. A certain number of workers
(more than zero) is launched by Gather node, then the parallel aware
Append node fetches one of the sub-plans if any.

I think, this design also gives additional flexibility according to
the required parallelism by the underlying sub-plans.
Please assume the "PartialSeqScan on p2" in the above example wants
3 workers to process the scan, we can expand the virtual array of
the sub-plans as follows. Then, if Gather node kicks 5 workers,
individual workers are assigned on some of plans. If Gather node
could kick less than 5 workers, the first exit worker picks the
second sub-plan, then it eventually provides the best parallelism.

++
|sub-plan |   * Sub-Plan 1 ... Index Scan on p1
|index on *-> * Sub-Plan 2 ... PartialSeqScan on p2
|shared   |   * Sub-Plan 2 ... PartialSeqScan on p2
|memory   |   * Sub-Plan 2 ... PartialSeqScan on p2
+-+   * Sub-Plan 3 ... Index Scan on p3

Here is no matter even if Append node has multiple parallel-aware
sub-plans. When "PartialSeqScan on p4" is added, all we need 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-10-23 Thread Robert Haas
On Sat, Jul 25, 2015 at 11:13 PM, Kouhei Kaigai  wrote:
> I'm recently working/investigating on ParallelAppend feature
> towards the next commit fest. Below is my design proposal.
>
> 1. Concept
> --
> Its concept is quite simple anybody might consider more than once.
> ParallelAppend node kicks background worker process to execute
> child nodes in parallel / asynchronous.
> It intends to improve the performance to scan a large partitioned
> tables from standpoint of entire throughput, however, latency of
> the first multi-hundred rows are not scope of this project.
> From standpoint of technology trend, it primarily tries to utilize
> multi-cores capability within a system, but also enables to expand
> distributed database environment using foreign-tables inheritance
> features.
> Its behavior is very similar to Funnel node except for several
> points, thus, we can reuse its infrastructure we have had long-
> standing discussion through the v9.5 development cycle.

Now that I've got more of the parallel infrastructure committed, I've
been starting to have a little time to think about what we might want
to do after we get PartialSeqScan committed.  I'm positive on doing
something with the Append node and parallelism, but I'm negative on
doing what you've described here.

I don't think the Append node has any business launching workers.
That's the job of Gather.  Append may need to do something
parallel-aware, but starting workers is not that thing.  Without
making any changes at all to Append, we can use it like this:

Gather
-> Append
  -> Partial Seq Scan on p1
  -> Partial Seq Scan on p2
  -> Partial Seq Scan on p3

The Gather node will spin up workers and in each worker we will ask
the Append nodes for tuples.  Append will ask the first
not-yet-completed child for tuples, so the workers will cooperatively
scan first p1, then p2, then p3.  This is great: instead of merely
doing a parallel seq scan of a single table, we can do a parallel seq
scan of a partitioned table.  However, there are two improvements we
can make.  First, we can teach Append that, when running in parallel,
it should initialize a chunk of dynamic shared memory with an array
indicating how many workers are currently working on each subplan.
Each new worker should join a subplan with the minimum number of
workers, work on that one until it's completely, and then pick a new
subplan.  This minimizes contention.  Second, we can teach the Append
to serve as a parent not only for intrinsically parallel nodes like
Partial Seq Scan, but also for other nodes, say, an Index Scan.  When
an Append is running in parallel but with non-parallel-aware children,
each such child can be claimed by at most one worker process and will
be run to completion by that worker process.  For example:

Gather
-> Append
  -> Index Scan on p1
  -> Partial Seq Scan on p2
  -> Index Scan on p3

The first worker which executes the Append should begin the index scan
on p1 and the second should begin the index scan on p3.  The remaining
workers, and those two once they're finished, can work on p2.

Proceeding in this way, I don't think we need a separate Parallel
Append node.  Rather, we can just give the existing Append node some
extra smarts that are used only when it's running in parallel.

We can also push other things in between the Gather and the Append,
which wouldn't be possible in your design.  For example, consider a
join between a partitioned table p and an unpartitioned table q.  We
could do this:

Gather
  -> Nested Loop
-> Append
  -> Index Scan on p1
  -> Partial Seq Scan on p2
  -> Index Scan on p3
  -> Index Scan on q
  Index Cond q.x = p.x

That's a pretty sweet plan.  Assuming p1, p2, and p3 are all
reasonably large, we could probably benefit from throwing at least 3
processes at this plan tree - maybe more, if p2 is really big.  Only
one process can work on each of p1 and p3, but p2, since it has a
truly parallel plan, can soak up as many as we want to throw at it
(although performance may top out at some point if we're I/O-bound).

Sorry for taking so long to give a really substantive reply on this,
but it wasn't until this last week or so that I really had time to
think about this in detail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-24 Thread Kouhei Kaigai
 On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas robertmh...@gmail.com wrote:
 
  On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com 
  wrote:
   Here is one other issue I found. Existing code assumes a TOC segment has
   only one contents per node type, so it uses pre-defined key (like
   PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
   multiple PlannedStmt or PartialSeqScan node on a TOC segment.
  
   We have few keys in parallel-seq-scan patch
   (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
   which multiple structures are shared between master and worker backends.
  
   Check if something similar can work for your use case.
 
  I think you are possibly missing the point.
 
 It could be possible, but let me summarize what I thought would be required
 for above use case.  For Parallel Append, we need to push multiple
 planned statements in contrast to one planned statement as is done for
 current patch and then one or more parallel workers needs to work on each
 planned statement. So if we know in advance how many planned statements
 are we passing down (which we should), then using ParallelWorkerNumber
 (ParallelWorkerNumber % num_planned_statements or some other similar
 way), workers can find the the planned statement on which they need to work
 and similarly information for PartialSeqScan (which currently is parallel heap
 scan descriptor information).

My problem is that we have no identifier to point a particular element on
the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can
have multiple items.
Please assume a situation when ExecPartialSeqScan() has to lookup
a particular item on TOC but multiple PartialSeqScan nodes can exist.

Currently, it does:
pscan = shm_toc_lookup(node-ss.ps.toc, PARALLEL_KEY_SCAN);

However, ExecPartialSeqScan() cannot know which is the index of mine,
or it is not reasonable to pay attention on other node in this level.
Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also
needs to have identifier.

   I think KaiGai's correct,
  and I pointed out the same problem to you before.  The parallel key
  for the Partial Seq Scan needs to be allocated on the fly and carried
  in the node, or we'll never be able to push multiple things below the
  funnel.
 
 Okay, immediately I don't see what is the best way to achieve this but
 let us discuss this separately on Parallel Seq Scan thread and let me
 know if you have something specific in your mind.  I will also give this
 a more thought.

I want to have 'node_id' in the Plan node, then unique identifier is
assigned on the field prior to serialization. It is a property of the
Plan node, so we can reproduce this identifier on the background worker
side using stringToNode(), then ExecPartialSeqScan can pull out a proper
field from the TOC segment by this node_id.
Probably, we can co-exist this structure without big changes.

1. Define PARALLEL_KEY_DYNAMIC_LEAST as a least value that is larger
   than any static TOC key (like PARALLEL_KEY_TUPLE_QUEUE).
2. Run plan-tree node walker on InitializeParallelWorkers, just before
   nodeToString(), to assign node_id larger than the above label and
   with increasing for each node.
3. Use node_id instead of the static PARALLEL_KEY_SCAN on
   ExecPartialSeqScan

Even though we need some more trivial fixes are needed, it seems to
me the above approach shall work.
Also, please note that I don't assume only PartialSeqScan want to
have its field on TOC segment, but some CustomScan node also wants
to have its own shared field when co-working under Funnel node.


On the other hand, I think it is too aggressive to complete the
initial work of this patch by the starting day of the next commit
fest, so I think the target is middle of October.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-24 Thread Amit Kapila
On Tue, Aug 25, 2015 at 6:19 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas robertmh...@gmail.com
wrote:
 
  It could be possible, but let me summarize what I thought would be
required
  for above use case.  For Parallel Append, we need to push multiple
  planned statements in contrast to one planned statement as is done for
  current patch and then one or more parallel workers needs to work on
each
  planned statement. So if we know in advance how many planned statements
  are we passing down (which we should), then using ParallelWorkerNumber
  (ParallelWorkerNumber % num_planned_statements or some other similar
  way), workers can find the the planned statement on which they need to
work
  and similarly information for PartialSeqScan (which currently is
parallel heap
  scan descriptor information).
 
 My problem is that we have no identifier to point a particular element on
 the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can
 have multiple items.
 Please assume a situation when ExecPartialSeqScan() has to lookup
 a particular item on TOC but multiple PartialSeqScan nodes can exist.

 Currently, it does:
 pscan = shm_toc_lookup(node-ss.ps.toc, PARALLEL_KEY_SCAN);

 However, ExecPartialSeqScan() cannot know which is the index of mine,
 or it is not reasonable to pay attention on other node in this level.
 Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also
 needs to have identifier.


Yes that's right and I think we can find out the same.  Basically we need to
know the planned statement number on which current worker is working and
that anyway we have to do before the worker can start the work.  One way is
as I have explained above that use ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements) to find or might need
some sophisticated way to find that out, but definitely we need to know that
before start of execution by worker and once we know that we can use it
find the PARALLEL_KEY_SCAN or whatever key for this worker (as the
the position of PARALLEL_KEY_SCAN will be same as of planned stmt
for a worker).


I think KaiGai's correct,
   and I pointed out the same problem to you before.  The parallel key
   for the Partial Seq Scan needs to be allocated on the fly and carried
   in the node, or we'll never be able to push multiple things below the
   funnel.
 
  Okay, immediately I don't see what is the best way to achieve this but
  let us discuss this separately on Parallel Seq Scan thread and let me
  know if you have something specific in your mind.  I will also give this
  a more thought.
 
 I want to have 'node_id' in the Plan node, then unique identifier is
 assigned on the field prior to serialization. It is a property of the
 Plan node, so we can reproduce this identifier on the background worker
 side using stringToNode(), then ExecPartialSeqScan can pull out a proper
 field from the TOC segment by this node_id.


Okay, this can also work, but why to introduce identifier in plan node, if
it
can work without it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-22 Thread Amit Kapila
On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Here is one other issue I found. Existing code assumes a TOC segment
has
  only one contents per node type, so it uses pre-defined key (like
  PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
  multiple PlannedStmt or PartialSeqScan node on a TOC segment.
 
  We have few keys in parallel-seq-scan patch
  (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
  which multiple structures are shared between master and worker backends.
 
  Check if something similar can work for your use case.

 I think you are possibly missing the point.

It could be possible, but let me summarize what I thought would be required
for above use case.  For Parallel Append, we need to push multiple
planned statements in contrast to one planned statement as is done for
current patch and then one or more parallel workers needs to work on each
planned statement. So if we know in advance how many planned statements
are we passing down (which we should), then using ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements or some other similar
way), workers can find the the planned statement on which they need to work
and similarly information for PartialSeqScan (which currently is parallel
heap
scan descriptor information).


  I think KaiGai's correct,
 and I pointed out the same problem to you before.  The parallel key
 for the Partial Seq Scan needs to be allocated on the fly and carried
 in the node, or we'll never be able to push multiple things below the
 funnel.

Okay, immediately I don't see what is the best way to achieve this but
let us discuss this separately on Parallel Seq Scan thread and let me
know if you have something specific in your mind.  I will also give this
a more thought.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-21 Thread Robert Haas
On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 Here is one other issue I found. Existing code assumes a TOC segment has
 only one contents per node type, so it uses pre-defined key (like
 PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
 multiple PlannedStmt or PartialSeqScan node on a TOC segment.

 We have few keys in parallel-seq-scan patch
 (PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
 which multiple structures are shared between master and worker backends.

 Check if something similar can work for your use case.

I think you are possibly missing the point.  I think KaiGai's correct,
and I pointed out the same problem to you before.  The parallel key
for the Partial Seq Scan needs to be allocated on the fly and carried
in the node, or we'll never be able to push multiple things below the
funnel.  I'm not quite sure what you're trying to explain with this
response.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-18 Thread Amit Kapila
On Thu, Aug 13, 2015 at 5:26 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
  
 
  Sure, that is what we should do, however the tricky part would be when
  the path for doing local scan is extremely cheaper than path for parallel
  scan for one of the child nodes.  For such cases, pulling up Funnel-node
  can incur more cost.  I think some of the other possible ways to make
 this
  work could be to extend Funnel so that it is capable of executing both
 parallel
  and non-parallel nodes, have a new Funnel like node which has such a
  capability.
 
 I think it is job of (more intelligent) planner but not in the first
 version. If subplans of Append are mixture of nodes which has or does
 not have worth of parallel execution, we will be able to arrange the
 original form:

   Append
+ Scan on rel1 (large)
+ Scan on rel2 (large)
+ Scan on rel3 (middle)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)

 to Funnel aware form, but partially:

   Append
+ Funnel
|  + Scan on rel1 (large)
|  + Scan on rel2 (large)
|  + Scan on rel3 (large)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)


This is exactly what I have in mind.



 Here is one other issue I found. Existing code assumes a TOC segment has
 only one contents per node type, so it uses pre-defined key (like
 PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
 multiple PlannedStmt or PartialSeqScan node on a TOC segment.


We have few keys in parallel-seq-scan patch
(PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
which multiple structures are shared between master and worker backends.

Check if something similar can work for your use case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-13 Thread Kouhei Kaigai
 On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
   On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com 
   wrote:
   
 
  If we pull Funnel here, I think the plan shall be as follows:
Funnel
 -- SeqScan on rel1
 -- PartialSeqScan on rel2
 -- IndexScan on rel3
 

 So if we go this route, then Funnel should have capability
 to execute non-parallel part of plan as well, like in this
 case it should be able to execute non-parallel IndexScan on
 rel3 as well and then it might need to distinguish between
 parallel and non-parallel part of plans.  I think this could
 make Funnel node complex.

It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.
  
   Okay, now I got your point, but I think the cost of execution
   of non-parallel node by additional worker is not small considering
   the communication cost and setting up an addional worker for
   each sub-plan (assume the case where out of 100-child nodes
   only few (2 or 3) nodes actually need multiple workers).
  
  It is a competition between traditional Append that takes Funnel
  children and the new appendable Funnel that takes parallel and
  non-parallel children. Probably, key factors are cpu_tuple_comm_cost,
  parallel_setup_cost and degree of selectivity of sub-plans.
  Both cases has advantage and disadvantage depending on the query,
  so we can never determine which is better without path consideration.
 
 Sure, that is what we should do, however the tricky part would be when
 the path for doing local scan is extremely cheaper than path for parallel
 scan for one of the child nodes.  For such cases, pulling up Funnel-node
 can incur more cost.  I think some of the other possible ways to make this
 work could be to extend Funnel so that it is capable of executing both 
 parallel
 and non-parallel nodes, have a new Funnel like node which has such a
 capability.

I think it is job of (more intelligent) planner but not in the first
version. If subplans of Append are mixture of nodes which has or does
not have worth of parallel execution, we will be able to arrange the
original form:

  Append
   + Scan on rel1 (large)
   + Scan on rel2 (large)
   + Scan on rel3 (middle)
   + Scan on rel4 (tiny)
   + Scan on rel5 (tiny)

to Funnel aware form, but partially:

  Append
   + Funnel
   |  + Scan on rel1 (large)
   |  + Scan on rel2 (large)
   |  + Scan on rel3 (large)  
   + Scan on rel4 (tiny)
   + Scan on rel5 (tiny)

It does not require special functionalities of Append/Funnel more
than what we have discussed, as long as planner is enough intelligent.
One downside of this approach is, plan tree tends to become more
complicated, thus makes logic to pushdown joins also becomes complicated.


Here is one other issue I found. Existing code assumes a TOC segment has
only one contents per node type, so it uses pre-defined key (like
PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
multiple PlannedStmt or PartialSeqScan node on a TOC segment.
My idea is enhancement of Plan node to have an unique identifier within
a particular plan trees. Once a unique identifier is assigned, we can
put individual information on the TOC segment, even if multiple
PartialSeqScan nodes are packed.
Did we have a discussion about this topic in the past?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-07 Thread Amit Kapila
On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com
wrote:
  

 If we pull Funnel here, I think the plan shall be as follows:
   Funnel
-- SeqScan on rel1
-- PartialSeqScan on rel2
-- IndexScan on rel3

   
So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans.  I think this could
make Funnel node complex.
   
   It is difference from what I plan now. In the above example,
   Funnel node has two non-parallel aware node (rel1 and rel3)
   and one parallel aware node, then three PlannedStmt for each
   shall be put on the TOC segment. Even though the background
   workers pick up a PlannedStmt from the three, only one worker
   can pick up the PlannedStmt for rel1 and rel3, however, rel2
   can be executed by multiple workers simultaneously.
 
  Okay, now I got your point, but I think the cost of execution
  of non-parallel node by additional worker is not small considering
  the communication cost and setting up an addional worker for
  each sub-plan (assume the case where out of 100-child nodes
  only few (2 or 3) nodes actually need multiple workers).
 
 It is a competition between traditional Append that takes Funnel
 children and the new appendable Funnel that takes parallel and
 non-parallel children. Probably, key factors are cpu_tuple_comm_cost,
 parallel_setup_cost and degree of selectivity of sub-plans.
 Both cases has advantage and disadvantage depending on the query,
 so we can never determine which is better without path consideration.

Sure, that is what we should do, however the tricky part would be when
the path for doing local scan is extremely cheaper than path for parallel
scan for one of the child nodes.  For such cases, pulling up Funnel-node
can incur more cost.  I think some of the other possible ways to make this
work could be to extend Funnel so that it is capable of executing both
parallel
and non-parallel nodes, have a new Funnel like node which has such a
capability.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-07 Thread Kouhei Kaigai
 On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
   On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com 
   wrote:
  
   I am not sure, but what problem do you see in putting Funnel node
   for one of the relation scans and not for the others.
  
  At this moment, I'm not certain whether background worker can/ought
  to launch another background workers.
  If sub-Funnel node is executed by 10-processes then it also launch
  10-processes for each, 100-processes will run for each?
 
 
 Yes, that could be more work than current, but what I had in mind
 is not that way, rather I was thinking that master backend will only
 kick of workers for Funnel nodes in plan.

I agree with, it is fair enough approach, so I mention about
pull-up of Funnel node.

If we pull Funnel here, I think the plan shall be as follows:
  Funnel
   -- SeqScan on rel1
   -- PartialSeqScan on rel2
   -- IndexScan on rel3
   
  
   So if we go this route, then Funnel should have capability
   to execute non-parallel part of plan as well, like in this
   case it should be able to execute non-parallel IndexScan on
   rel3 as well and then it might need to distinguish between
   parallel and non-parallel part of plans.  I think this could
   make Funnel node complex.
  
  It is difference from what I plan now. In the above example,
  Funnel node has two non-parallel aware node (rel1 and rel3)
  and one parallel aware node, then three PlannedStmt for each
  shall be put on the TOC segment. Even though the background
  workers pick up a PlannedStmt from the three, only one worker
  can pick up the PlannedStmt for rel1 and rel3, however, rel2
  can be executed by multiple workers simultaneously.
 
 Okay, now I got your point, but I think the cost of execution
 of non-parallel node by additional worker is not small considering
 the communication cost and setting up an addional worker for
 each sub-plan (assume the case where out of 100-child nodes
 only few (2 or 3) nodes actually need multiple workers).

It is a competition between traditional Append that takes Funnel
children and the new appendable Funnel that takes parallel and
non-parallel children. Probably, key factors are cpu_tuple_comm_cost,
parallel_setup_cost and degree of selectivity of sub-plans.
Both cases has advantage and disadvantage depending on the query,
so we can never determine which is better without path consideration.

   I think for a particular PlannedStmt, number of workers must have
   been decided before start of execution, so if those many workers are
   available to work on that particular PlannedStmt, then next/new
   worker should work on next PlannedStmt.
  
  My concern about pre-determined number of workers is, it depends on the
  run-time circumstances of concurrent sessions. Even if planner wants to
  assign 10-workers on a particular sub-plan, only 4-workers may be
  available on the run-time because of consumption by side sessions.
  So, I expect only maximum number of workers is meaningful configuration.
 
 
 In that case, there is possibility that many of the workers are just
 working on one or two of the nodes and other nodes execution might
 get starved.  I understand this is tricky problem to allocate number
 of workers for different nodes, however we should try to develop any
 algorithm where there is some degree of fairness in allocation of workers
 for different nodes.

I like to agree, however, I also want to keep the first version as
simple as possible we can. We can develop alternative logic to assign
suitable number of workers later.

   2. Execution of work by workers and Funnel node and then pass
   the results back to upper node.  I think this needs some more
   work in addition to ParallelSeqScan patch.
  
  I expect we can utilize existing infrastructure here. It just picks
  up the records come from the underlying workers, then raise it to
  the upper node.
 
 
 
 Sure, but still you need some work atleast in the area of making
 workers understand different node types, I am guessing you need
 to modify readfuncs.c to support new plan node if any for this
 work.
 
Yes, it was not a creative work. :-)
https://github.com/kaigai/sepgsql/blob/fappend/src/backend/nodes/readfuncs.c#L1479

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-05 Thread Amit Kapila
On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com
wrote:
 
  I am not sure, but what problem do you see in putting Funnel node
  for one of the relation scans and not for the others.
 
 At this moment, I'm not certain whether background worker can/ought
 to launch another background workers.
 If sub-Funnel node is executed by 10-processes then it also launch
 10-processes for each, 100-processes will run for each?


Yes, that could be more work than current, but what I had in mind
is not that way, rather I was thinking that master backend will only
kick of workers for Funnel nodes in plan.

   If we pull Funnel here, I think the plan shall be as follows:
 Funnel
  -- SeqScan on rel1
  -- PartialSeqScan on rel2
  -- IndexScan on rel3
  
 
  So if we go this route, then Funnel should have capability
  to execute non-parallel part of plan as well, like in this
  case it should be able to execute non-parallel IndexScan on
  rel3 as well and then it might need to distinguish between
  parallel and non-parallel part of plans.  I think this could
  make Funnel node complex.
 
 It is difference from what I plan now. In the above example,
 Funnel node has two non-parallel aware node (rel1 and rel3)
 and one parallel aware node, then three PlannedStmt for each
 shall be put on the TOC segment. Even though the background
 workers pick up a PlannedStmt from the three, only one worker
 can pick up the PlannedStmt for rel1 and rel3, however, rel2
 can be executed by multiple workers simultaneously.

Okay, now I got your point, but I think the cost of execution
of non-parallel node by additional worker is not small considering
the communication cost and setting up an addional worker for
each sub-plan (assume the case where out of 100-child nodes
only few (2 or 3) nodes actually need multiple workers).

 
  I think for a particular PlannedStmt, number of workers must have
  been decided before start of execution, so if those many workers are
  available to work on that particular PlannedStmt, then next/new
  worker should work on next PlannedStmt.
 
 My concern about pre-determined number of workers is, it depends on the
 run-time circumstances of concurrent sessions. Even if planner wants to
 assign 10-workers on a particular sub-plan, only 4-workers may be
 available on the run-time because of consumption by side sessions.
 So, I expect only maximum number of workers is meaningful configuration.


In that case, there is possibility that many of the workers are just
working on one or two of the nodes and other nodes execution might
get starved.  I understand this is tricky problem to allocate number
of workers for different nodes, however we should try to develop any
algorithm where there is some degree of fairness in allocation of workers
for different nodes.


  2. Execution of work by workers and Funnel node and then pass
  the results back to upper node.  I think this needs some more
  work in addition to ParallelSeqScan patch.
 
 I expect we can utilize existing infrastructure here. It just picks
 up the records come from the underlying workers, then raise it to
 the upper node.


Sure, but still you need some work atleast in the area of making
workers understand different node types, I am guessing you need
to modify readfuncs.c to support new plan node if any for this
work.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-01 Thread Kouhei Kaigai
 On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
   On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com 
   wrote:
   
 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
 Sent: Monday, July 27, 2015 11:07 PM
 To: Amit Kapila
 
  Is there a real need to have new node like ParallelAppendPath?
  Can't we have Funnel node beneath AppendNode and then each
  worker will be responsible to have SeqScan on each inherited child
  relation.  Something like
 
  Append
 --- Funnel
-- SeqScan rel1
-- SeqScan rel2
 
 If Funnel can handle both of horizontal and vertical parallelism,
 it is a great simplification. I never stick a new node.

 Once Funnel get a capability to have multiple child nodes, probably,
 Append node above will have gone. I expect set_append_rel_pathlist()
 add two paths based on Append and Funnel, then planner will choose
 the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:
   
  typedef struct Funnel
  {
  Scanscan;
  int num_workers;
  } Funnel;
   
If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:
   
  typedef struct Funnel
  {
  Scanscan;
  List   *funnel_plans;
  List   *funnel_num_workers;
  } Funnel;
   
As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.
   
  
   or shall we have a node like above and name it as FunnelAppend or
   AppenFunnel?
  
  It is better to have smaller number of node types which are capable to
  kick background workers because of simplification of path construction.
 
  Let's assume the case below. When planner considers a path to append
  child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
  Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
  up Funnel of rel2, can we?
 
(Append? or Funnel)
 -- SeqScan on rel1
 -- Funnel
  -- PartialSeqScan on rel2
 -- IndexScan on rel3
 
 
 I am not sure, but what problem do you see in putting Funnel node
 for one of the relation scans and not for the others.

At this moment, I'm not certain whether background worker can/ought
to launch another background workers.
If sub-Funnel node is executed by 10-processes then it also launch
10-processes for each, 100-processes will run for each?

  If we pull Funnel here, I think the plan shall be as follows:
Funnel
 -- SeqScan on rel1
 -- PartialSeqScan on rel2
 -- IndexScan on rel3
 
 
 So if we go this route, then Funnel should have capability
 to execute non-parallel part of plan as well, like in this
 case it should be able to execute non-parallel IndexScan on
 rel3 as well and then it might need to distinguish between
 parallel and non-parallel part of plans.  I think this could
 make Funnel node complex.

It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.
(Note: if number of workers are less than three in this case,
PlannedStmt for rel3 shall not be picked up unless any other
worker don't complete to run other plan on rel1 or rel2).

From the standpoint of the Funnel, it just kicks background
workers with:
 - multiple PlannedStmt nodes
 - maximum number of workers for each plan
in addition to the current form.

Then, it continues to fetch records from the shm_mq.
Probably, it does not change the current form so much.

  If all we have to pay attention is Funnel node, it makes the code
  around path construction and pull-up logic much simpler, rather than
  multiple node types can kick background workers.
 
 
 Okay, but I think pulling-up Funnel node makes sense only when all
 nodes beneath it needs to be executed parallely.

I think its decision should be based on the cost, that includes
additional startup_cost to launch background worker, as long as
non-parallel node is also capable to run on the worker side.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-31 Thread Amit Kapila
On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com
wrote:
  
-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei
Kaigai
Sent: Monday, July 27, 2015 11:07 PM
To: Amit Kapila

 Is there a real need to have new node like ParallelAppendPath?
 Can't we have Funnel node beneath AppendNode and then each
 worker will be responsible to have SeqScan on each inherited child
 relation.  Something like

 Append
--- Funnel
   -- SeqScan rel1
   -- SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.
   
Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.
   
   In the latest v16 patch, Funnel is declared as follows:
  
 typedef struct Funnel
 {
 Scanscan;
 int num_workers;
 } Funnel;
  
   If we try to add Append capability here, I expects the structure will
   be adjusted as follows, for example:
  
 typedef struct Funnel
 {
 Scanscan;
 List   *funnel_plans;
 List   *funnel_num_workers;
 } Funnel;
  
   As literal, funnel_plans saves underlying Plan nodes instead of the
   lefttree. Also, funnel_num_workers saves number of expected workers
   to be assigned on individual child plans.
  
 
  or shall we have a node like above and name it as FunnelAppend or
  AppenFunnel?
 
 It is better to have smaller number of node types which are capable to
 kick background workers because of simplification of path construction.

 Let's assume the case below. When planner considers a path to append
 child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
 Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
 up Funnel of rel2, can we?

   (Append? or Funnel)
-- SeqScan on rel1
-- Funnel
 -- PartialSeqScan on rel2
-- IndexScan on rel3


I am not sure, but what problem do you see in putting Funnel node
for one of the relation scans and not for the others.

 If we pull Funnel here, I think the plan shall be as follows:
   Funnel
-- SeqScan on rel1
-- PartialSeqScan on rel2
-- IndexScan on rel3


So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans.  I think this could
make Funnel node complex.


 If all we have to pay attention is Funnel node, it makes the code
 around path construction and pull-up logic much simpler, rather than
 multiple node types can kick background workers.


Okay, but I think pulling-up Funnel node makes sense only when all
nodes beneath it needs to be executed parallely.

   Even though create_parallelscan_paths() in v16 set num_workers not
   larger than parallel_seqscan_degree, total number of the concurrent
   background workers may exceed this configuration if more than two
   PartialSeqScan nodes are underlying.
   It is a different configuration from max_worker_processes, so it is
   not a matter as long as we have another restriction.
   However, how do we control the cap of number of worker processes per
   appendable Funnel node? For example, if a parent table has 200
   child tables but max_worker_processes are configured to 50.
   It is obviously impossible to launch all the background workers
   simultaneously. One idea I have is to suspend launch of some plans
   until earlier ones are completed.
  
 
  Okay, but I think in that idea you need to re-launch the workers again
for
  new set of relation scan's which could turn out to be costly, how about
  designing some way where workers after completing their assigned work
  check for new set of task/'s (which in this case would be to scan a
new) and
  then execute the same.  I think in this way we can achieve dynamic
allocation
  of work and achieve maximum parallelism with available set of workers.
  We have achieved this in ParallelSeqScan by scanning at block level,
once
  a worker finishes a block, it checks for new block to scan.
 
 Is it possible to put multiple PlannedStmt on TOC, isn't it?

Yes, I don't see any problem in doing that way.  So here for
each different (child) relation, you want to create a separate
PlannedStmt or do you have something else in mind?

 If background worker picks up an uncompleted PlannedStmt first
 (based on round-robin likely?), it may achieve the maximum
 parallelism.

I think this can work well for 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread David Rowley
On 27 July 2015 at 21:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp
 wrote:

 Hello, can I ask some questions?

 I suppose we can take this as the analog of ParalleSeqScan.  I
 can see not so distinction between Append(ParalleSeqScan) and
 ParallelAppend(SeqScan). What difference is there between them?

 If other nodes will have the same functionality as you mention at
 the last of this proposal, it might be better that some part of
 this feature is implemented as a part of existing executor
 itself, but not as a deidicated additional node, just as my
 asynchronous fdw execution patch patially does. (Although it
 lacks planner part and bg worker launching..) If that is the
 case, it might be better that ExecProcNode is modified so that it
 supports both in-process and inter-bgworker cases by the single
 API.

 What do you think about this?


I have to say that I really like the thought of us having parallel enabled
stuff in Postgres, but I also have to say that I don't think inventing all
these special parallel node types is a good idea. If we think about
everything that we can parallelise...

Perhaps sort, hash join, seqscan, hash, bitmap heap scan, nested loop.
I don't want to debate that, but perhaps there's more, perhaps less.
Are we really going to duplicate all of the code and add in the parallel
stuff as new node types?

My other concern here is that I seldom hear people talk about the planner's
architectural lack of ability to make a good choice about how many parallel
workers to choose. Surely to properly calculate costs you need to know the
exact number of parallel workers that will be available at execution time,
but you need to know this at planning time!? I can't see how this works,
apart from just being very conservative about parallel workers, which I
think is really bad, as many databases have busy times in the day, and also
quiet times, generally quiet time is when large batch stuff gets done, and
that's the time that parallel stuff is likely most useful. Remember queries
are not always planned just before they're executed. We could have a
PREPAREd query, or we could have better plan caching in the future, or if
we build some intelligence into the planner to choose a good number of
workers based on the current server load, then what's to say that the
server will be under this load at exec time? If we plan during a quiet
time, and exec in a busy time all hell may break loose.

I really do think that existing nodes should just be initialized in a
parallel mode, and each node type can have a function to state if it
supports parallelism or not.

I'd really like to hear more opinions in the ideas I discussed here:

http://www.postgresql.org/message-id/CAApHDvp2STf0=pqfpq+e7wa4qdympfm5qu_ytupe7r0jlnh...@mail.gmail.com


This design makes use of the Funnel node that Amit has already made and
allows more than 1 node to be executed in parallel at once.

It appears that parallel enabling the executor node by node is
fundamentally locked into just 1 node being executed in parallel, then
perhaps a Funnel node gathering up the parallel worker buffers and
streaming those back in serial mode. I believe by design, this does not
permit a whole plan branch from executing in parallel and I really feel
like doing things this way is going to be very hard to undo and improve
later. I might be too stupid to figure it out, but how would parallel hash
join work if it can't gather tuples from the inner and outer nodes in
parallel?

Sorry for the rant, but I just feel like we're painting ourselves into a
corner by parallel enabling the executor node by node.
Apologies if I've completely misunderstood things.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Ashutosh Bapat
On Tue, Jul 28, 2015 at 12:59 PM, David Rowley david.row...@2ndquadrant.com
 wrote:


 On 27 July 2015 at 21:09, Kyotaro HORIGUCHI 
 horiguchi.kyot...@lab.ntt.co.jp wrote:

 Hello, can I ask some questions?

 I suppose we can take this as the analog of ParalleSeqScan.  I
 can see not so distinction between Append(ParalleSeqScan) and
 ParallelAppend(SeqScan). What difference is there between them?

 If other nodes will have the same functionality as you mention at
 the last of this proposal, it might be better that some part of
 this feature is implemented as a part of existing executor
 itself, but not as a deidicated additional node, just as my
 asynchronous fdw execution patch patially does. (Although it
 lacks planner part and bg worker launching..) If that is the
 case, it might be better that ExecProcNode is modified so that it
 supports both in-process and inter-bgworker cases by the single
 API.

 What do you think about this?


 I have to say that I really like the thought of us having parallel enabled
 stuff in Postgres, but I also have to say that I don't think inventing all
 these special parallel node types is a good idea. If we think about
 everything that we can parallelise...

 Perhaps sort, hash join, seqscan, hash, bitmap heap scan, nested loop.
 I don't want to debate that, but perhaps there's more, perhaps less.
 Are we really going to duplicate all of the code and add in the parallel
 stuff as new node types?

 My other concern here is that I seldom hear people talk about the
 planner's architectural lack of ability to make a good choice about how
 many parallel workers to choose. Surely to properly calculate costs you
 need to know the exact number of parallel workers that will be available at
 execution time, but you need to know this at planning time!? I can't see
 how this works, apart from just being very conservative about parallel
 workers, which I think is really bad, as many databases have busy times in
 the day, and also quiet times, generally quiet time is when large batch
 stuff gets done, and that's the time that parallel stuff is likely most
 useful. Remember queries are not always planned just before they're
 executed. We could have a PREPAREd query, or we could have better plan
 caching in the future, or if we build some intelligence into the planner to
 choose a good number of workers based on the current server load, then
 what's to say that the server will be under this load at exec time? If we
 plan during a quiet time, and exec in a busy time all hell may break loose.

 I really do think that existing nodes should just be initialized in a
 parallel mode, and each node type can have a function to state if it
 supports parallelism or not.

 I'd really like to hear more opinions in the ideas I discussed here:


 http://www.postgresql.org/message-id/CAApHDvp2STf0=pqfpq+e7wa4qdympfm5qu_ytupe7r0jlnh...@mail.gmail.com


 This design makes use of the Funnel node that Amit has already made and
 allows more than 1 node to be executed in parallel at once.

 It appears that parallel enabling the executor node by node is
 fundamentally locked into just 1 node being executed in parallel, then
 perhaps a Funnel node gathering up the parallel worker buffers and
 streaming those back in serial mode. I believe by design, this does not
 permit a whole plan branch from executing in parallel and I really feel
 like doing things this way is going to be very hard to undo and improve
 later. I might be too stupid to figure it out, but how would parallel hash
 join work if it can't gather tuples from the inner and outer nodes in
 parallel?

 Sorry for the rant, but I just feel like we're painting ourselves into a
 corner by parallel enabling the executor node by node.
 Apologies if I've completely misunderstood things.


+1, well articulated.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Amit Langote
On 2015-07-29 AM 11:02, Kouhei Kaigai wrote:

 ...
 synchronous Append path vs. parallel asynchronous Append with Funnel
 (below/above?) it. I guess the asynchronous version would always be
 cheaper. So, even if we end up with non-parallel sub-plans do we still add
 a Funnel to make Append asynchronous? Am I missing something?

 I expect Funnel itself will get Append capability but run sub-plans in
 background workers, to simplify path constructions. So, if Funnel with
 multiple sub-plans have cheaper cost than Append, it will replace the
 AppendPath by FunnelPath.
 
 Regarding to the cost estimation, I don't think parallel version is always
 cheaper than traditional Append, because of the cost to launch background
 workers. It increases startup cost to process the relation, thus, if upper
 node prefers small startup cost (like Limit), traditional Append still has
 advantages.
 

Right, I almost forgot about the start-up cost.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Kouhei Kaigai
 On 2015-07-28 PM 09:58, Kouhei Kaigai wrote:
 
  From my understanding of parallel seqscan patch, each worker's
  PartialSeqScan asks for a block to scan using a shared parallel heap scan
  descriptor that effectively keeps track of division of work among
  PartialSeqScans in terms of blocks. What if we invent a PartialAppend
  which each worker would run in case of a parallelized Append. It would use
  some kind of shared descriptor to pick a relation (Append member) to scan.
  The shared structure could be the list of subplans including the mutex for
  concurrency. It doesn't sound as effective as proposed
  ParallelHeapScanDescData does for PartialSeqScan but any more granular
  might be complicated. For example, consider (current_relation,
  current_block) pair. If there are more workers than subplans/partitions,
  then multiple workers might start working on the same relation after a
  round-robin assignment of relations (but of course, a later worker would
  start scanning from a later block in the same relation). I imagine that
  might help with parallelism across volumes if that's the case.
 
  I initially thought ParallelAppend kicks fixed number of background workers
  towards sub-plans, according to the estimated cost on the planning stage.
  However, I'm now inclined that background worker picks up an uncompleted
  PlannedStmt first. (For more details, please see the reply to Amit Kapila)
  It looks like less less-grained worker's job distribution.
  Once number of workers gets larger than number of volumes / partitions,
  it means more than two workers begin to assign same PartialSeqScan, thus
  it takes fine-grained job distribution using shared parallel heap scan.
 
 
 I like your idea of using round-robin assignment of partial/non-partial
 sub-plans to workers. Do you think there are two considerations of cost
 here: sub-plans themselves could have parallel paths to consider and (I
 think) your proposal introduces a new consideration - a plain old
 synchronous Append path vs. parallel asynchronous Append with Funnel
 (below/above?) it. I guess the asynchronous version would always be
 cheaper. So, even if we end up with non-parallel sub-plans do we still add
 a Funnel to make Append asynchronous? Am I missing something?

I expect Funnel itself will get Append capability but run sub-plans in
background workers, to simplify path constructions. So, if Funnel with
multiple sub-plans have cheaper cost than Append, it will replace the
AppendPath by FunnelPath.

Regarding to the cost estimation, I don't think parallel version is always
cheaper than traditional Append, because of the cost to launch background
workers. It increases startup cost to process the relation, thus, if upper
node prefers small startup cost (like Limit), traditional Append still has
advantages.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Amit Langote

KaiGai-san,

On 2015-07-28 PM 09:58, Kouhei Kaigai wrote:

 From my understanding of parallel seqscan patch, each worker's
 PartialSeqScan asks for a block to scan using a shared parallel heap scan
 descriptor that effectively keeps track of division of work among
 PartialSeqScans in terms of blocks. What if we invent a PartialAppend
 which each worker would run in case of a parallelized Append. It would use
 some kind of shared descriptor to pick a relation (Append member) to scan.
 The shared structure could be the list of subplans including the mutex for
 concurrency. It doesn't sound as effective as proposed
 ParallelHeapScanDescData does for PartialSeqScan but any more granular
 might be complicated. For example, consider (current_relation,
 current_block) pair. If there are more workers than subplans/partitions,
 then multiple workers might start working on the same relation after a
 round-robin assignment of relations (but of course, a later worker would
 start scanning from a later block in the same relation). I imagine that
 might help with parallelism across volumes if that's the case.

 I initially thought ParallelAppend kicks fixed number of background workers
 towards sub-plans, according to the estimated cost on the planning stage.
 However, I'm now inclined that background worker picks up an uncompleted
 PlannedStmt first. (For more details, please see the reply to Amit Kapila)
 It looks like less less-grained worker's job distribution.
 Once number of workers gets larger than number of volumes / partitions,
 it means more than two workers begin to assign same PartialSeqScan, thus
 it takes fine-grained job distribution using shared parallel heap scan.
 

I like your idea of using round-robin assignment of partial/non-partial
sub-plans to workers. Do you think there are two considerations of cost
here: sub-plans themselves could have parallel paths to consider and (I
think) your proposal introduces a new consideration - a plain old
synchronous Append path vs. parallel asynchronous Append with Funnel
(below/above?) it. I guess the asynchronous version would always be
cheaper. So, even if we end up with non-parallel sub-plans do we still add
a Funnel to make Append asynchronous? Am I missing something?

 MergeAppend
 parallelization might involve a bit more complication but may be feasible
 with a PartialMergeAppend with slightly different kind of coordination
 among workers. What do you think of such an approach?

 Do we need to have something special in ParallelMergeAppend?
 If individual child nodes are designed to return sorted results,
 what we have to do seems to me same.
 

Sorry, I was wrongly worried because I did not really know that
MergeAppend uses a binaryheap to store tuples before returning.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Kouhei Kaigai
 On 27 July 2015 at 21:09, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp
 wrote:
 
 
   Hello, can I ask some questions?
 
   I suppose we can take this as the analog of ParalleSeqScan.  I
   can see not so distinction between Append(ParalleSeqScan) and
   ParallelAppend(SeqScan). What difference is there between them?
 
   If other nodes will have the same functionality as you mention at
   the last of this proposal, it might be better that some part of
   this feature is implemented as a part of existing executor
   itself, but not as a deidicated additional node, just as my
   asynchronous fdw execution patch patially does. (Although it
   lacks planner part and bg worker launching..) If that is the
   case, it might be better that ExecProcNode is modified so that it
   supports both in-process and inter-bgworker cases by the single
   API.
 
   What do you think about this?
 
 
 
 I have to say that I really like the thought of us having parallel enabled 
 stuff
 in Postgres, but I also have to say that I don't think inventing all these 
 special
 parallel node types is a good idea. If we think about everything that we can
 parallelise...
 
 Perhaps sort, hash join, seqscan, hash, bitmap heap scan, nested loop. I 
 don't
 want to debate that, but perhaps there's more, perhaps less.
 Are we really going to duplicate all of the code and add in the parallel stuff
 as new node types?

 My other concern here is that I seldom hear people talk about the planner's
 architectural lack of ability to make a good choice about how many parallel 
 workers
 to choose. Surely to properly calculate costs you need to know the exact 
 number
 of parallel workers that will be available at execution time, but you need to
 know this at planning time!? I can't see how this works, apart from just being
 very conservative about parallel workers, which I think is really bad, as many
 databases have busy times in the day, and also quiet times, generally quiet 
 time
 is when large batch stuff gets done, and that's the time that parallel stuff 
 is
 likely most useful. Remember queries are not always planned just before 
 they're
 executed. We could have a PREPAREd query, or we could have better plan caching
 in the future, or if we build some intelligence into the planner to choose a 
 good
 number of workers based on the current server load, then what's to say that 
 the
 server will be under this load at exec time? If we plan during a quiet time, 
 and
 exec in a busy time all hell may break loose.

Even though it is not easy to estimate available workers at planning time,
it might be possible to define a target number of workers to run.
If Funnel cannot get enough number of workers less than target, my preference
is to tell other workers (via scoreboard?) not to pick up next PlannedStmt and
exit when another Funnel cannot launch enough number of workers.

 I really do think that existing nodes should just be initialized in a parallel
 mode, and each node type can have a function to state if it supports 
 parallelism
 or not.
 
 I'd really like to hear more opinions in the ideas I discussed here:
 
 http://www.postgresql.org/message-id/CAApHDvp2STf0=pQfpq+e7WA4QdYmpFM5qu_YtU
 pe7r0jlnh...@mail.gmail.com
 
 This design makes use of the Funnel node that Amit has already made and allows
 more than 1 node to be executed in parallel at once.
 
 It appears that parallel enabling the executor node by node is fundamentally 
 locked
 into just 1 node being executed in parallel, then perhaps a Funnel node 
 gathering
 up the parallel worker buffers and streaming those back in serial mode. I 
 believe
 by design, this does not permit a whole plan branch from executing in parallel
 and I really feel like doing things this way is going to be very hard to undo
 and improve later. I might be too stupid to figure it out, but how would 
 parallel
 hash join work if it can't gather tuples from the inner and outer nodes in 
 parallel?

Hash-Join and Nest-Loop should not have PartialSeqScan in the inner-side, but
outer side can be PartialSeqScan under the Funnel node.
In case of Hash-Join, SeqScan of inner-side loads any tuples (*1) to hash-table
once, then records come from outer-side shall be combined with the hash-table.
Even though inner-side is read redundantly, advantage of parallel join will win
as long as inner-side is enough small; This assumption is right on usual pair of
master tables (small) and fact table (big).


(*1) Our colleague is now working on this feature. It enables to drop 
unnecessary
rows under the partitioned tables. So, we may not need to have entire hash table
for each background workers.
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f6...@bpxm15gp.gisp.nec.co.jp

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Kouhei Kaigai
 KaiGai-san,
 
 On 2015-07-27 PM 11:07, Kouhei Kaigai wrote:
 
Append
 -- Funnel
  -- PartialSeqScan on rel1 (num_workers = 4)
 -- Funnel
  -- PartialSeqScan on rel2 (num_workers = 8)
 -- SeqScan on rel3
 
   shall be rewritten to
Funnel
  -- PartialSeqScan on rel1 (num_workers = 4)
  -- PartialSeqScan on rel2 (num_workers = 8)
  -- SeqScan on rel3(num_workers = 1)
 
 
 In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1,
 rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the
 earlier one to finish? I would think the answer is no because then it
 would not be different from the former case, right? Because the original
 premise seems that (partitions) rel1, rel2, rel3 may be on different
 volumes so parallelism across volumes seems like a goal of parallelizing
 Append.
 
 From my understanding of parallel seqscan patch, each worker's
 PartialSeqScan asks for a block to scan using a shared parallel heap scan
 descriptor that effectively keeps track of division of work among
 PartialSeqScans in terms of blocks. What if we invent a PartialAppend
 which each worker would run in case of a parallelized Append. It would use
 some kind of shared descriptor to pick a relation (Append member) to scan.
 The shared structure could be the list of subplans including the mutex for
 concurrency. It doesn't sound as effective as proposed
 ParallelHeapScanDescData does for PartialSeqScan but any more granular
 might be complicated. For example, consider (current_relation,
 current_block) pair. If there are more workers than subplans/partitions,
 then multiple workers might start working on the same relation after a
 round-robin assignment of relations (but of course, a later worker would
 start scanning from a later block in the same relation). I imagine that
 might help with parallelism across volumes if that's the case.

I initially thought ParallelAppend kicks fixed number of background workers
towards sub-plans, according to the estimated cost on the planning stage.
However, I'm now inclined that background worker picks up an uncompleted
PlannedStmt first. (For more details, please see the reply to Amit Kapila)
It looks like less less-grained worker's job distribution.
Once number of workers gets larger than number of volumes / partitions,
it means more than two workers begin to assign same PartialSeqScan, thus
it takes fine-grained job distribution using shared parallel heap scan.

 MergeAppend
 parallelization might involve a bit more complication but may be feasible
 with a PartialMergeAppend with slightly different kind of coordination
 among workers. What do you think of such an approach?

Do we need to have something special in ParallelMergeAppend?
If individual child nodes are designed to return sorted results,
what we have to do seems to me same.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Kouhei Kaigai
 On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
   -Original Message-
   From: pgsql-hackers-ow...@postgresql.org
   [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
   Sent: Monday, July 27, 2015 11:07 PM
   To: Amit Kapila
   
Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation.  Something like
   
Append
   --- Funnel
  -- SeqScan rel1
  -- SeqScan rel2
   
   If Funnel can handle both of horizontal and vertical parallelism,
   it is a great simplification. I never stick a new node.
  
   Once Funnel get a capability to have multiple child nodes, probably,
   Append node above will have gone. I expect set_append_rel_pathlist()
   add two paths based on Append and Funnel, then planner will choose
   the cheaper one according to its cost.
  
  In the latest v16 patch, Funnel is declared as follows:
 
typedef struct Funnel
{
Scanscan;
int num_workers;
} Funnel;
 
  If we try to add Append capability here, I expects the structure will
  be adjusted as follows, for example:
 
typedef struct Funnel
{
Scanscan;
List   *funnel_plans;
List   *funnel_num_workers;
} Funnel;
 
  As literal, funnel_plans saves underlying Plan nodes instead of the
  lefttree. Also, funnel_num_workers saves number of expected workers
  to be assigned on individual child plans.
 
 
 or shall we have a node like above and name it as FunnelAppend or
 AppenFunnel?

It is better to have smaller number of node types which are capable to
kick background workers because of simplification of path construction.

Let's assume the case below. When planner considers a path to append
child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
up Funnel of rel2, can we?

  (Append? or Funnel)
   -- SeqScan on rel1
   -- Funnel
-- PartialSeqScan on rel2
   -- IndexScan on rel3

If we pull Funnel here, I think the plan shall be as follows:
  Funnel
   -- SeqScan on rel1
   -- PartialSeqScan on rel2
   -- IndexScan on rel3

If all we have to pay attention is Funnel node, it makes the code
around path construction and pull-up logic much simpler, rather than
multiple node types can kick background workers.

  Even though create_parallelscan_paths() in v16 set num_workers not
  larger than parallel_seqscan_degree, total number of the concurrent
  background workers may exceed this configuration if more than two
  PartialSeqScan nodes are underlying.
  It is a different configuration from max_worker_processes, so it is
  not a matter as long as we have another restriction.
  However, how do we control the cap of number of worker processes per
  appendable Funnel node? For example, if a parent table has 200
  child tables but max_worker_processes are configured to 50.
  It is obviously impossible to launch all the background workers
  simultaneously. One idea I have is to suspend launch of some plans
  until earlier ones are completed.
 
 
 Okay, but I think in that idea you need to re-launch the workers again for
 new set of relation scan's which could turn out to be costly, how about
 designing some way where workers after completing their assigned work
 check for new set of task/'s (which in this case would be to scan a new) and
 then execute the same.  I think in this way we can achieve dynamic allocation
 of work and achieve maximum parallelism with available set of workers.
 We have achieved this in ParallelSeqScan by scanning at block level, once
 a worker finishes a block, it checks for new block to scan.

Is it possible to put multiple PlannedStmt on TOC, isn't it?
If background worker picks up an uncompleted PlannedStmt first
(based on round-robin likely?), it may achieve the maximum
parallelism. Yep, it seems to me a good idea which I want to try.
If (num of worker)  (num of sub-plans), some of sub-plans can
have multiple workers from the beginning, then, other workers
also help to execute heavy plans later.
It may be better to put PlannedStmt in order of total_cost to
bias multi-workers execution from the beginning.

TODO: Even if a heavy query occupied most of available worker slots,
another session wants to use parallel execution later but during
execution of the primary query. We may need to have a 'scoreboard'
on shared memory to know how many workers are potentially needed
and how much ones are overused by somebody. If someone overconsumed
background workers, it should exit first, rather than picking up
the next PlannedStmt.

   We will need to pay attention another issues we will look at when Funnel
   kicks background worker towards asymmetric relations.
  
   If number of rows of individual child nodes are various, 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Amit Langote

KaiGai-san,

On 2015-07-27 PM 11:07, Kouhei Kaigai wrote:
 
   Append
-- Funnel
 -- PartialSeqScan on rel1 (num_workers = 4)
-- Funnel
 -- PartialSeqScan on rel2 (num_workers = 8)
-- SeqScan on rel3
 
  shall be rewritten to
   Funnel
 -- PartialSeqScan on rel1 (num_workers = 4)
 -- PartialSeqScan on rel2 (num_workers = 8)
 -- SeqScan on rel3(num_workers = 1)
 

In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1,
rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the
earlier one to finish? I would think the answer is no because then it
would not be different from the former case, right? Because the original
premise seems that (partitions) rel1, rel2, rel3 may be on different
volumes so parallelism across volumes seems like a goal of parallelizing
Append.

From my understanding of parallel seqscan patch, each worker's
PartialSeqScan asks for a block to scan using a shared parallel heap scan
descriptor that effectively keeps track of division of work among
PartialSeqScans in terms of blocks. What if we invent a PartialAppend
which each worker would run in case of a parallelized Append. It would use
some kind of shared descriptor to pick a relation (Append member) to scan.
The shared structure could be the list of subplans including the mutex for
concurrency. It doesn't sound as effective as proposed
ParallelHeapScanDescData does for PartialSeqScan but any more granular
might be complicated. For example, consider (current_relation,
current_block) pair. If there are more workers than subplans/partitions,
then multiple workers might start working on the same relation after a
round-robin assignment of relations (but of course, a later worker would
start scanning from a later block in the same relation). I imagine that
might help with parallelism across volumes if that's the case. MergeAppend
parallelization might involve a bit more complication but may be feasible
with a PartialMergeAppend with slightly different kind of coordination
among workers. What do you think of such an approach?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Amit Kapila
On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 Hello,

 I'm recently working/investigating on ParallelAppend feature
 towards the next commit fest. Below is my design proposal.

 1. Concept
 --
 Its concept is quite simple anybody might consider more than once.
 ParallelAppend node kicks background worker process to execute
 child nodes in parallel / asynchronous.
 It intends to improve the performance to scan a large partitioned
 tables from standpoint of entire throughput, however, latency of
 the first multi-hundred rows are not scope of this project.
 From standpoint of technology trend, it primarily tries to utilize
 multi-cores capability within a system, but also enables to expand
 distributed database environment using foreign-tables inheritance
 features.
 Its behavior is very similar to Funnel node except for several
 points, thus, we can reuse its infrastructure we have had long-
 standing discussion through the v9.5 development cycle.

 2. Problems to be solved
 -
 Typical OLAP workloads takes tons of tables join and scan on large
 tables which are often partitioned, and its KPI is query response
 time but very small number of sessions are active simultaneously.
 So, we are required to run a single query as rapid as possible even
 if it consumes larger computing resources than typical OLTP workloads.

 Current implementation to scan heap is painful when we look at its
 behavior from the standpoint - how many rows we can read within a
 certain time, because of synchronous manner.
 In the worst case, when SeqScan node tries to fetch the next tuple,
 heap_getnext() looks up a block on shared buffer, then ReadBuffer()
 calls storage manager to read the target block from the filesystem
 if not on the buffer. Next, operating system makes the caller
 process slept until required i/o get completed.
 Most of the cases are helped in earlier stage than the above worst
 case, however, the best scenario we can expect is: the next tuple
 already appear on top of the message queue (of course visibility
 checks are already done also) with no fall down to buffer manager
 or deeper.
 If we can run multiple scans in parallel / asynchronous, CPU core
 shall be assigned to another process by operating system, thus,
 it eventually improves the i/o density and enables higher processing
 throughput.
 Append node is an ideal point to be parallelized because
 - child nodes can have physically different location by tablespace,
   so further tuning is possible according to the system landscape.
 - it can control whether subplan is actually executed on background
   worker, per subplan basis. If subplan contains large tables and
   small tables, ParallelAppend may kick background worker to scan
   large tables only, but scan on small tables are by itself.
 - Like as Funnel node, we don't need to care about enhancement of
   individual node types. SeqScan, IndexScan, ForeignScan or others
   can perform as usual, but actually in parallel.


 3. Implementation
 --
 * Plan  Cost

 ParallelAppend shall appear where Appen can appear except for the
 usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
 both of AppendPath and ParallelAppendPath with cost for each.


Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation.  Something like

Append
   --- Funnel
  -- SeqScan rel1
  -- SeqScan rel2


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kouhei Kaigai
 On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
  Hello,
 
  I'm recently working/investigating on ParallelAppend feature
  towards the next commit fest. Below is my design proposal.
 
  1. Concept
  --
  Its concept is quite simple anybody might consider more than once.
  ParallelAppend node kicks background worker process to execute
  child nodes in parallel / asynchronous.
  It intends to improve the performance to scan a large partitioned
  tables from standpoint of entire throughput, however, latency of
  the first multi-hundred rows are not scope of this project.
  From standpoint of technology trend, it primarily tries to utilize
  multi-cores capability within a system, but also enables to expand
  distributed database environment using foreign-tables inheritance
  features.
  Its behavior is very similar to Funnel node except for several
  points, thus, we can reuse its infrastructure we have had long-
  standing discussion through the v9.5 development cycle.
 
  2. Problems to be solved
  -
  Typical OLAP workloads takes tons of tables join and scan on large
  tables which are often partitioned, and its KPI is query response
  time but very small number of sessions are active simultaneously.
  So, we are required to run a single query as rapid as possible even
  if it consumes larger computing resources than typical OLTP workloads.
 
  Current implementation to scan heap is painful when we look at its
  behavior from the standpoint - how many rows we can read within a
  certain time, because of synchronous manner.
  In the worst case, when SeqScan node tries to fetch the next tuple,
  heap_getnext() looks up a block on shared buffer, then ReadBuffer()
  calls storage manager to read the target block from the filesystem
  if not on the buffer. Next, operating system makes the caller
  process slept until required i/o get completed.
  Most of the cases are helped in earlier stage than the above worst
  case, however, the best scenario we can expect is: the next tuple
  already appear on top of the message queue (of course visibility
  checks are already done also) with no fall down to buffer manager
  or deeper.
  If we can run multiple scans in parallel / asynchronous, CPU core
  shall be assigned to another process by operating system, thus,
  it eventually improves the i/o density and enables higher processing
  throughput.
  Append node is an ideal point to be parallelized because
  - child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
  - it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
  - Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.
 
 
  3. Implementation
  --
  * Plan  Cost
 
  ParallelAppend shall appear where Appen can appear except for the
  usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
  both of AppendPath and ParallelAppendPath with cost for each.
 
 
 Is there a real need to have new node like ParallelAppendPath?
 Can't we have Funnel node beneath AppendNode and then each
 worker will be responsible to have SeqScan on each inherited child
 relation.  Something like
 
 Append
--- Funnel
   -- SeqScan rel1
   -- SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launch a worker.
In this case, Funnel has child nodes to be executed asynchronously and
synchronously.

If cheapest path of the child relation is a pair of Funnel and
PartialSeqScan, we have to avoid to stack Funnel node. Probably,
Funnel node that performs like Append needs to pull up underlying
Funnel and assign equivalen number of workers as follows.

  Append
   -- Funnel
-- PartialSeqScan on rel1 (num_workers = 4)
   -- Funnel
-- PartialSeqScan on rel2 (num_workers = 8)
   -- SeqScan on rel3

 shall be rewritten to
  Funnel
-- PartialSeqScan on rel1 (num_workers = 4)
-- 

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kouhei Kaigai
 Hello, can I ask some questions?

 I suppose we can take this as the analog of ParalleSeqScan.  I
 can see not so distinction between Append(ParalleSeqScan) and
 ParallelAppend(SeqScan). What difference is there between them?

Append does not start to execute the second or later node until
first node reaches end of the scan.
On the other hands, ParallelAppend will kick all the child nodes
(almost) simultaneously.

 If other nodes will have the same functionality as you mention at
 the last of this proposal, it might be better that some part of
 this feature is implemented as a part of existing executor
 itself, but not as a deidicated additional node, just as my
 asynchronous fdw execution patch patially does. (Although it
 lacks planner part and bg worker launching..) If that is the
 case, it might be better that ExecProcNode is modified so that it
 supports both in-process and inter-bgworker cases by the single
 API.
 
 What do you think about this?

Its downside is that we need to adjust all the existing nodes to
follow the new executor's capability. At this moment, we have 38
node types delivered from Plan. I think, it is not an easy job to
review a patch that changes multi-dozens files.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 regards,
 
  Hello,
 
  I'm recently working/investigating on ParallelAppend feature
  towards the next commit fest. Below is my design proposal.
 
  1. Concept
  --
  Its concept is quite simple anybody might consider more than once.
  ParallelAppend node kicks background worker process to execute
  child nodes in parallel / asynchronous.
  It intends to improve the performance to scan a large partitioned
  tables from standpoint of entire throughput, however, latency of
  the first multi-hundred rows are not scope of this project.
  From standpoint of technology trend, it primarily tries to utilize
  multi-cores capability within a system, but also enables to expand
  distributed database environment using foreign-tables inheritance
  features.
  Its behavior is very similar to Funnel node except for several
  points, thus, we can reuse its infrastructure we have had long-
  standing discussion through the v9.5 development cycle.
 
  2. Problems to be solved
  -
  Typical OLAP workloads takes tons of tables join and scan on large
  tables which are often partitioned, and its KPI is query response
  time but very small number of sessions are active simultaneously.
  So, we are required to run a single query as rapid as possible even
  if it consumes larger computing resources than typical OLTP workloads.
 
  Current implementation to scan heap is painful when we look at its
  behavior from the standpoint - how many rows we can read within a
  certain time, because of synchronous manner.
  In the worst case, when SeqScan node tries to fetch the next tuple,
  heap_getnext() looks up a block on shared buffer, then ReadBuffer()
  calls storage manager to read the target block from the filesystem
  if not on the buffer. Next, operating system makes the caller
  process slept until required i/o get completed.
  Most of the cases are helped in earlier stage than the above worst
  case, however, the best scenario we can expect is: the next tuple
  already appear on top of the message queue (of course visibility
  checks are already done also) with no fall down to buffer manager
  or deeper.
  If we can run multiple scans in parallel / asynchronous, CPU core
  shall be assigned to another process by operating system, thus,
  it eventually improves the i/o density and enables higher processing
  throughput.
  Append node is an ideal point to be parallelized because
  - child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
  - it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
  - Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.
 
 
  3. Implementation
  --
  * Plan  Cost
 
  ParallelAppend shall appear where Appen can appear except for the
  usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
  both of AppendPath and ParallelAppendPath with cost for each.
  Cost estimation logic shall take further discussions, however,
  I expect the logic below to estimate the cost for ParallelAppend.
1. Sum startup_cost and run_cost for each child pathnode, but
   distinguish according to synchronous or asynchronous.
   Probably, total cost of pathnode is less than:
(parallel_setup_cost + its total cost / parallel_append_degree

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kouhei Kaigai
 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
 Sent: Monday, July 27, 2015 11:07 PM
 To: Amit Kapila
 Cc: pgsql-hackers@postgresql.org; Robert Haas; Kyotaro HORIGUCHI
 Subject: Re: [HACKERS] [DESIGN] ParallelAppend
 
  On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  
   Hello,
  
   I'm recently working/investigating on ParallelAppend feature
   towards the next commit fest. Below is my design proposal.
  
   1. Concept
   --
   Its concept is quite simple anybody might consider more than once.
   ParallelAppend node kicks background worker process to execute
   child nodes in parallel / asynchronous.
   It intends to improve the performance to scan a large partitioned
   tables from standpoint of entire throughput, however, latency of
   the first multi-hundred rows are not scope of this project.
   From standpoint of technology trend, it primarily tries to utilize
   multi-cores capability within a system, but also enables to expand
   distributed database environment using foreign-tables inheritance
   features.
   Its behavior is very similar to Funnel node except for several
   points, thus, we can reuse its infrastructure we have had long-
   standing discussion through the v9.5 development cycle.
  
   2. Problems to be solved
   -
   Typical OLAP workloads takes tons of tables join and scan on large
   tables which are often partitioned, and its KPI is query response
   time but very small number of sessions are active simultaneously.
   So, we are required to run a single query as rapid as possible even
   if it consumes larger computing resources than typical OLTP workloads.
  
   Current implementation to scan heap is painful when we look at its
   behavior from the standpoint - how many rows we can read within a
   certain time, because of synchronous manner.
   In the worst case, when SeqScan node tries to fetch the next tuple,
   heap_getnext() looks up a block on shared buffer, then ReadBuffer()
   calls storage manager to read the target block from the filesystem
   if not on the buffer. Next, operating system makes the caller
   process slept until required i/o get completed.
   Most of the cases are helped in earlier stage than the above worst
   case, however, the best scenario we can expect is: the next tuple
   already appear on top of the message queue (of course visibility
   checks are already done also) with no fall down to buffer manager
   or deeper.
   If we can run multiple scans in parallel / asynchronous, CPU core
   shall be assigned to another process by operating system, thus,
   it eventually improves the i/o density and enables higher processing
   throughput.
   Append node is an ideal point to be parallelized because
   - child nodes can have physically different location by tablespace,
 so further tuning is possible according to the system landscape.
   - it can control whether subplan is actually executed on background
 worker, per subplan basis. If subplan contains large tables and
 small tables, ParallelAppend may kick background worker to scan
 large tables only, but scan on small tables are by itself.
   - Like as Funnel node, we don't need to care about enhancement of
 individual node types. SeqScan, IndexScan, ForeignScan or others
 can perform as usual, but actually in parallel.
  
  
   3. Implementation
   --
   * Plan  Cost
  
   ParallelAppend shall appear where Appen can appear except for the
   usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
   both of AppendPath and ParallelAppendPath with cost for each.
  
 
  Is there a real need to have new node like ParallelAppendPath?
  Can't we have Funnel node beneath AppendNode and then each
  worker will be responsible to have SeqScan on each inherited child
  relation.  Something like
 
  Append
 --- Funnel
-- SeqScan rel1
-- SeqScan rel2
 
 If Funnel can handle both of horizontal and vertical parallelism,
 it is a great simplification. I never stick a new node.
 
 Once Funnel get a capability to have multiple child nodes, probably,
 Append node above will have gone. I expect set_append_rel_pathlist()
 add two paths based on Append and Funnel, then planner will choose
 the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:

  typedef struct Funnel
  {
  Scanscan;
  int num_workers;
  } Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

  typedef struct Funnel
  {
  Scanscan;
  List   *funnel_plans;
  List   *funnel_num_workers;
  } Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the 
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

Even

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Amit Kapila
On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  -Original Message-
  From: pgsql-hackers-ow...@postgresql.org
  [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
  Sent: Monday, July 27, 2015 11:07 PM
  To: Amit Kapila
  
   Is there a real need to have new node like ParallelAppendPath?
   Can't we have Funnel node beneath AppendNode and then each
   worker will be responsible to have SeqScan on each inherited child
   relation.  Something like
  
   Append
  --- Funnel
 -- SeqScan rel1
 -- SeqScan rel2
  
  If Funnel can handle both of horizontal and vertical parallelism,
  it is a great simplification. I never stick a new node.
 
  Once Funnel get a capability to have multiple child nodes, probably,
  Append node above will have gone. I expect set_append_rel_pathlist()
  add two paths based on Append and Funnel, then planner will choose
  the cheaper one according to its cost.
 
 In the latest v16 patch, Funnel is declared as follows:

   typedef struct Funnel
   {
   Scanscan;
   int num_workers;
   } Funnel;

 If we try to add Append capability here, I expects the structure will
 be adjusted as follows, for example:

   typedef struct Funnel
   {
   Scanscan;
   List   *funnel_plans;
   List   *funnel_num_workers;
   } Funnel;

 As literal, funnel_plans saves underlying Plan nodes instead of the
 lefttree. Also, funnel_num_workers saves number of expected workers
 to be assigned on individual child plans.


or shall we have a node like above and name it as FunnelAppend or
AppenFunnel?

 Even though create_parallelscan_paths() in v16 set num_workers not
 larger than parallel_seqscan_degree, total number of the concurrent
 background workers may exceed this configuration if more than two
 PartialSeqScan nodes are underlying.
 It is a different configuration from max_worker_processes, so it is
 not a matter as long as we have another restriction.
 However, how do we control the cap of number of worker processes per
 appendable Funnel node? For example, if a parent table has 200
 child tables but max_worker_processes are configured to 50.
 It is obviously impossible to launch all the background workers
 simultaneously. One idea I have is to suspend launch of some plans
 until earlier ones are completed.


Okay, but I think in that idea you need to re-launch the workers again for
new set of relation scan's which could turn out to be costly, how about
designing some way where workers after completing their assigned work
check for new set of task/'s (which in this case would be to scan a new) and
then execute the same.  I think in this way we can achieve dynamic
allocation
of work and achieve maximum parallelism with available set of workers.
We have achieved this in ParallelSeqScan by scanning at block level, once
a worker finishes a block, it checks for new block to scan.


  We will need to pay attention another issues we will look at when Funnel
  kicks background worker towards asymmetric relations.
 
  If number of rows of individual child nodes are various, we may
  want to assign 10 background workers to scan rel1 with PartialSeqScan.
  On the other hands, rel2 may have very small number of rows thus
  its total_cost may be smaller than cost to launch a worker.
  In this case, Funnel has child nodes to be executed asynchronously and
  synchronously.
 

I think this might turn out to be slightly tricky, for example how do we
know
for what size of relation, how many workers are sufficient?
Another way to look at dividing the work in this case could be in terms of
chunk-of-blocks, once a worker finishes it current set of block/'s, it
should be
able to get new set of block's to scan.  So let us assume if we decide
chunk-size as 32 and total number of blocks in whole inheritance hierarchy
are 3200, then the max workers we should allocate to this scan are 100 and
if we have parallel_seqscan degree lesser than that then we can use those
many workers and then let them scan 32-blocks-at-a-time.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kyotaro HORIGUCHI
Hello, can I ask some questions?

I suppose we can take this as the analog of ParalleSeqScan.  I
can see not so distinction between Append(ParalleSeqScan) and
ParallelAppend(SeqScan). What difference is there between them?

If other nodes will have the same functionality as you mention at
the last of this proposal, it might be better that some part of
this feature is implemented as a part of existing executor
itself, but not as a deidicated additional node, just as my
asynchronous fdw execution patch patially does. (Although it
lacks planner part and bg worker launching..) If that is the
case, it might be better that ExecProcNode is modified so that it
supports both in-process and inter-bgworker cases by the single
API.

What do you think about this?



regards,

 Hello,
 
 I'm recently working/investigating on ParallelAppend feature
 towards the next commit fest. Below is my design proposal.
 
 1. Concept
 --
 Its concept is quite simple anybody might consider more than once.
 ParallelAppend node kicks background worker process to execute
 child nodes in parallel / asynchronous.
 It intends to improve the performance to scan a large partitioned
 tables from standpoint of entire throughput, however, latency of
 the first multi-hundred rows are not scope of this project.
 From standpoint of technology trend, it primarily tries to utilize
 multi-cores capability within a system, but also enables to expand
 distributed database environment using foreign-tables inheritance
 features.
 Its behavior is very similar to Funnel node except for several
 points, thus, we can reuse its infrastructure we have had long-
 standing discussion through the v9.5 development cycle.
 
 2. Problems to be solved
 -
 Typical OLAP workloads takes tons of tables join and scan on large
 tables which are often partitioned, and its KPI is query response
 time but very small number of sessions are active simultaneously.
 So, we are required to run a single query as rapid as possible even
 if it consumes larger computing resources than typical OLTP workloads.
 
 Current implementation to scan heap is painful when we look at its
 behavior from the standpoint - how many rows we can read within a
 certain time, because of synchronous manner.
 In the worst case, when SeqScan node tries to fetch the next tuple,
 heap_getnext() looks up a block on shared buffer, then ReadBuffer()
 calls storage manager to read the target block from the filesystem
 if not on the buffer. Next, operating system makes the caller
 process slept until required i/o get completed.
 Most of the cases are helped in earlier stage than the above worst
 case, however, the best scenario we can expect is: the next tuple
 already appear on top of the message queue (of course visibility
 checks are already done also) with no fall down to buffer manager
 or deeper.
 If we can run multiple scans in parallel / asynchronous, CPU core
 shall be assigned to another process by operating system, thus,
 it eventually improves the i/o density and enables higher processing
 throughput.
 Append node is an ideal point to be parallelized because
 - child nodes can have physically different location by tablespace,
   so further tuning is possible according to the system landscape.
 - it can control whether subplan is actually executed on background
   worker, per subplan basis. If subplan contains large tables and
   small tables, ParallelAppend may kick background worker to scan
   large tables only, but scan on small tables are by itself.
 - Like as Funnel node, we don't need to care about enhancement of
   individual node types. SeqScan, IndexScan, ForeignScan or others
   can perform as usual, but actually in parallel.
 
 
 3. Implementation
 --
 * Plan  Cost
 
 ParallelAppend shall appear where Appen can appear except for the
 usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
 both of AppendPath and ParallelAppendPath with cost for each.
 Cost estimation logic shall take further discussions, however,
 I expect the logic below to estimate the cost for ParallelAppend.
   1. Sum startup_cost and run_cost for each child pathnode, but
  distinguish according to synchronous or asynchronous.
  Probably, total cost of pathnode is less than:
   (parallel_setup_cost + its total cost / parallel_append_degree
+ number of rows * cpu_tuple_comm_cost)
  is nonsense to run on background worker.
   2. parallel_setup_cost * (# of asynchronous nodes) are added to
  sum of startup_cost of asynchronous nodes.
   3. sum of run_cost of asynchronous nodes are divided by 
  parallel_append_degree, then cpu_tuple_comm_cost * (total # of
  rows by asynchronous nodes) are added.
   4. both of synchronous and asynchronous cost are added, then it
  becomes the cost of ParallelAppend.
 Obviously, it stand on the viewpoint that says: cost reflects response
 time of the underlying plan. So,