RE: Determine parallel-safety of partition relations for Inserts

2021-02-21 Thread houzj.f...@fujitsu.com
Hi,
 
Attaching v7 patches with the changes:
   * rebase the code on the greg's latest parallel insert patch.

 Please consider it for further review.
 
 Best regards,
 houzj



v7_0004-reloption-parallel_dml-test-and-doc.patch
Description: v7_0004-reloption-parallel_dml-test-and-doc.patch


v7_0001-guc-option-enable_parallel_dml-src.patch
Description: v7_0001-guc-option-enable_parallel_dml-src.patch


v7_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v7_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v7_0003-reloption-parallel_dml-src.patch.patch
Description: v7_0003-reloption-parallel_dml-src.patch.patch


RE: Determine parallel-safety of partition relations for Inserts

2021-02-17 Thread Hou, Zhijie
Hi,

Attaching v6 patches with the changes:
  * rebase the code on the greg's latest parallel insert patch.
  * fix some code comment.
  * add some test to cover the partitioned table.

Please consider it for further review.

Best regards,
Houzj





v6_0004-reloption-parallel_dml-test-and-doc.patch
Description: v6_0004-reloption-parallel_dml-test-and-doc.patch


v6_0001-guc-option-enable_parallel_dml-src.patch
Description: v6_0001-guc-option-enable_parallel_dml-src.patch


v6_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v6_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v6_0003-reloption-parallel_dml-src.patch.patch
Description: v6_0003-reloption-parallel_dml-src.patch.patch


RE: Determine parallel-safety of partition relations for Inserts

2021-02-04 Thread Huang, Qiuyan
Hi,

I notice the comment 5) about is_parallel_possible_for_modify() seems to be 
inaccurate in clauses.c.
*  5) the reloption parallel_dml_enabled is not set for the target table

Because you have set parallel_dml_enabled to 'true' as default. 
{
{
"parallel_dml_enabled",
"Enables \"parallel dml\" feature for this table",
RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
ShareUpdateExclusiveLock
},
true
}

So even user doesn't set parallel_dml_enabled explicit, its value is 'on', 
Parallel is also possible for this rel(when enable_parallel_dml is on). 

Maybe we can just comment like this or a better one you'd like if you agree 
with above:
*  5) the reloption parallel_dml_enabled is set to off

Regards
Huang

> -Original Message-
> From: Hou, Zhijie 
> Sent: Wednesday, February 3, 2021 9:01 AM
> To: Greg Nancarrow 
> Cc: Amit Kapila ; PostgreSQL Hackers
> ; vignesh C ;
> Amit Langote ; David Rowley
> ; Tom Lane ; Tsunakawa,
> Takayuki/綱川 貴之 
> Subject: RE: Determine parallel-safety of partition relations for Inserts
> 
> Hi,
> 
> Attaching v5 patches with the changes:
>   * rebase the code on the greg's latest parallel insert patch
>   * fix some code style.
> 
> Please consider it for further review.
> 
> Best regards,
> Houzj
> 
> 
> 
> 





RE: Determine parallel-safety of partition relations for Inserts

2021-02-02 Thread Hou, Zhijie
> For v3_0003-reloption-parallel_dml-src.patch :
> +   table_close(rel, NoLock);

> Since the rel would always be closed, it seems the return value from 
> RelationGetParallelDML() can be assigned to a variable, followed by call to 
> table_close(), then the return statement.

Thanks for the comment. Fixed in the latest patch.

Best regards,
houzj




RE: Determine parallel-safety of partition relations for Inserts

2021-02-02 Thread Hou, Zhijie
Hi,

Attaching v5 patches with the changes:
  * rebase the code on the greg's latest parallel insert patch
  * fix some code style.

Please consider it for further review.

Best regards,
Houzj
 
 





v5_0004-reloption-parallel_dml-test-and-doc.patch
Description: v5_0004-reloption-parallel_dml-test-and-doc.patch


v5_0001-guc-option-enable_parallel_dml-src.patch
Description: v5_0001-guc-option-enable_parallel_dml-src.patch


v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v5_0003-reloption-parallel_dml-src.patch.patch
Description: v5_0003-reloption-parallel_dml-src.patch.patch


RE: Determine parallel-safety of partition relations for Inserts

2021-02-01 Thread Hou, Zhijie
> > IMO, It seems more readable to extract all the check that we can do
> > before the safety-check and put them in the new function.
> >
> > Please consider it for further review.
> >
> 
> I've updated your v2 patches and altered some comments and documentation
> changes (but made no code changes) - please compare against your v2 patches,
> and see whether you agree with the changes to the wording.
> In the documentation, you will also notice that in your V2 patch, it says
> that the "parallel_dml_enabled" table option defaults to false.
> As it actually defaults to true, I changed that in the documentation too.

Hi greg,

Thanks a lot for the document update, LGTM.

Attaching v4 patches with the changes:
 * fix some typos in the code.
 * store partitioned reloption in a separate struct PartitionedOptions, 
   making it more standard and easier to expand in the future.

Please consider it for further review.

Best regards,
Houzj





v4_0001-guc-option-enable_parallel_dml-src.patch
Description: v4_0001-guc-option-enable_parallel_dml-src.patch


v4_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v4_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v4_0003-reloption-parallel_dml-src.patch.patch
Description: v4_0003-reloption-parallel_dml-src.patch.patch


v4_0004-reloption-parallel_dml-test-and-doc.patch
Description: v4_0004-reloption-parallel_dml-test-and-doc.patch


Re: Determine parallel-safety of partition relations for Inserts

2021-02-01 Thread Zhihong Yu
Hi,
For v3_0003-reloption-parallel_dml-src.patch :

+* Check if parallel_dml_enabled is enabled for the target table,
+* if not, skip the safety checks and return PARALLEL_UNSAFE.

Looks like the return value is true / false. So the above comment should be
adjusted.

+   if (!RelationGetParallelDML(rel, true))
+   {
+   table_close(rel, NoLock);
+   return false;
+   }
+
+   table_close(rel, NoLock);

Since the rel would always be closed, it seems the return value
from RelationGetParallelDML() can be assigned to a variable, followed by
call to table_close(), then the return statement.

Cheers

On Mon, Feb 1, 2021 at 3:56 PM Greg Nancarrow  wrote:

> On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie 
> wrote:
> >
> > Attatching v2 patch which addressed the comments above.
> >
> > Some further refactor:
> >
> > Introducing a new function is_parallel_possible_for_modify() which
> decide whether to do safety check.
> >
> > IMO, It seems more readable to extract all the check that we can do
> before the safety-check and put them
> > in the new function.
> >
> > Please consider it for further review.
> >
>
> I've updated your v2 patches and altered some comments and
> documentation changes (but made no code changes) - please compare
> against your v2 patches, and see whether you agree with the changes to
> the wording.
> In the documentation, you will also notice that in your V2 patch, it
> says that the "parallel_dml_enabled" table option defaults to false.
> As it actually defaults to true, I changed that in the documentation
> too.
>
> Regards,
> Greg Nancarrow
> Fujitsu Australia
>


Re: Determine parallel-safety of partition relations for Inserts

2021-02-01 Thread Greg Nancarrow
On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie  wrote:
>
> Attatching v2 patch which addressed the comments above.
>
> Some further refactor:
>
> Introducing a new function is_parallel_possible_for_modify() which decide 
> whether to do safety check.
>
> IMO, It seems more readable to extract all the check that we can do before 
> the safety-check and put them
> in the new function.
>
> Please consider it for further review.
>

I've updated your v2 patches and altered some comments and
documentation changes (but made no code changes) - please compare
against your v2 patches, and see whether you agree with the changes to
the wording.
In the documentation, you will also notice that in your V2 patch, it
says that the "parallel_dml_enabled" table option defaults to false.
As it actually defaults to true, I changed that in the documentation
too.

Regards,
Greg Nancarrow
Fujitsu Australia


v3_0004-reloption-parallel_dml-test-and-doc.patch
Description: Binary data


v3_0001-guc-option-enable_parallel_dml-src.patch
Description: Binary data


v3_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: Binary data


v3_0003-reloption-parallel_dml-src.patch
Description: Binary data


RE: Determine parallel-safety of partition relations for Inserts

2021-01-31 Thread Hou, Zhijie
Hi greg,

Thanks for the review !

> Personally, I think a table's "parallel_dml" option should be ON by default.
> It's annoying to have to separately enable it for each and every table being
> used, when I think the need to turn it selectively OFF should be fairly
> rare.

Yes, I agreed.
Changed.

> And I'm not sure that "parallel_dml" is the best name for the table option
> - because it sort of implies parallel dml WILL be used - but that isn't
> true, it depends on other factors too.
> So I think (to be consistent with other table option naming) it would have
> to be something like "parallel_dml_enabled".

Agreed.
Changed to parallel_dml_enabled.

Attatching v2 patch which addressed the comments above.

Some further refactor:

Introducing a new function is_parallel_possible_for_modify() which decide 
whether to do safety check.

IMO, It seems more readable to extract all the check that we can do before the 
safety-check and put them
in the new function.

Please consider it for further review.

Best regards,
houzj




v2_0003-reloption-parallel_dml-src.patch
Description: v2_0003-reloption-parallel_dml-src.patch


v2_0004-reloption-parallel_dml-test-and-doc.patch
Description: v2_0004-reloption-parallel_dml-test-and-doc.patch


v2_0001-guc-option-enable_parallel_dml-src.patch
Description: v2_0001-guc-option-enable_parallel_dml-src.patch


v2_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v2_0002-guc-option-enable_parallel_dml-doc-and-test.patch


Re: Determine parallel-safety of partition relations for Inserts

2021-01-31 Thread Greg Nancarrow
On Fri, Jan 29, 2021 at 5:44 PM Hou, Zhijie  wrote:
>
>
> Attatching v1 patches, introducing options which let user manually control 
> whether to use parallel dml.
>
> About the patch:
> 1. add a new guc option: enable_parallel_dml (boolean)
> 2. add a new tableoption: parallel_dml (boolean)
>
>The default of both is off(false).
>
> User can set enable_parallel_dml in postgresql.conf or seesion to enable 
> parallel dml.
> If user want to choose some specific to use parallel insert, they can set 
> table.parallel_dml to on.
>
> Some attention:
> (1)
> Currently if guc option enable_parallel_dml is set to on but table's 
> parallel_dml is off,
> planner still do not consider parallel for dml.
>
> In this way, If user want to use parallel dml , they have to set 
> enable_parallel_dml=on and set parallel_dml = on.
> If someone dislike this, I think we can also let tableoption. parallel_dml's 
> default value to on ,with this user only need
> to set enable_parallel_dml=on
>
> (2)
> For the parallel_dml.
> If target table is partitioned, and it's parallel_dml is set to on, planner 
> will ignore
> The option value of it's child.
> This is beacause we can not divide dml plan to separate table, so we just 
> check the target table itself.
>
>
> Thoughts and comments are welcome.
>

Personally, I think a table's "parallel_dml" option should be ON by default.
It's annoying to have to separately enable it for each and every table
being used, when I think the need to turn it selectively OFF should be
fairly rare.
And I'm not sure that "parallel_dml" is the best name for the table
option - because it sort of implies parallel dml WILL be used - but
that isn't true, it depends on other factors too.
So I think (to be consistent with other table option naming) it would
have to be something like "parallel_dml_enabled".
I'm still looking at the patches, but have so far noticed that there
are some issues in the documentation updates (grammatical issues and
consistency issues with current documentation), and also some of the
explanations are not clear. I guess I can address these separately.

Regards,
Greg Nancarrow
Fujitsu Australia




RE: Determine parallel-safety of partition relations for Inserts

2021-01-28 Thread Hou, Zhijie
> Hi,
> 
> Attatching v1 patches, introducing options which let user manually control
> whether to use parallel dml.
> 
> About the patch:
> 1. add a new guc option: enable_parallel_dml (boolean) 2. add a new
> tableoption: parallel_dml (boolean)
> 
>The default of both is off(false).
> 
> User can set enable_parallel_dml in postgresql.conf or seesion to enable
> parallel dml.
> If user want to choose some specific to use parallel insert, they can set
> table.parallel_dml to on.
> 
> Some attention:
> (1)
> Currently if guc option enable_parallel_dml is set to on but table's
> parallel_dml is off, planner still do not consider parallel for dml.
> 
> In this way, If user want to use parallel dml , they have to set
> enable_parallel_dml=on and set parallel_dml = on.
> If someone dislike this, I think we can also let tableoption. parallel_dml's
> default value to on ,with this user only need to set enable_parallel_dml=on
> 
> (2)
> For the parallel_dml.
> If target table is partitioned, and it's parallel_dml is set to on, planner
> will ignore The option value of it's child.
> This is beacause we can not divide dml plan to separate table, so we just
> check the target table itself.
> 
> 
> Thoughts and comments are welcome.

The patch is based on v13 patch(parallel insert select) in [1]

[1] 
https://www.postgresql.org/message-id/CAJcOf-ejV8iU%2BYpuV4qbYEY-2vCG1QF2g3Gxn%3DZ%2BPyUH_5f84A%40mail.gmail.com

Best regards,
houzj




RE: Determine parallel-safety of partition relations for Inserts

2021-01-28 Thread Hou, Zhijie
Hi,

Attatching v1 patches, introducing options which let user manually control 
whether to use parallel dml.

About the patch:
1. add a new guc option: enable_parallel_dml (boolean)
2. add a new tableoption: parallel_dml (boolean)

   The default of both is off(false).

User can set enable_parallel_dml in postgresql.conf or seesion to enable 
parallel dml.
If user want to choose some specific to use parallel insert, they can set 
table.parallel_dml to on.

Some attention:
(1)
Currently if guc option enable_parallel_dml is set to on but table's 
parallel_dml is off,
planner still do not consider parallel for dml.

In this way, If user want to use parallel dml , they have to set 
enable_parallel_dml=on and set parallel_dml = on.
If someone dislike this, I think we can also let tableoption. parallel_dml's 
default value to on ,with this user only need
to set enable_parallel_dml=on

(2)
For the parallel_dml.
If target table is partitioned, and it's parallel_dml is set to on, planner 
will ignore
The option value of it's child. 
This is beacause we can not divide dml plan to separate table, so we just check 
the target table itself.


Thoughts and comments are welcome.

Best regards,
houzj




v1_0004-reloption-parallel_dml-test-and-doc.patch
Description: v1_0004-reloption-parallel_dml-test-and-doc.patch


v1_0001-guc-option-enable_parallel_dml-src.patch
Description: v1_0001-guc-option-enable_parallel_dml-src.patch


v1_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Description: v1_0002-guc-option-enable_parallel_dml-doc-and-test.patch


v1_0003-reloption-parallel_dml-src.patch
Description: v1_0003-reloption-parallel_dml-src.patch


RE: Determine parallel-safety of partition relations for Inserts

2021-01-28 Thread Hou, Zhijie
> > Hi
> >
> > I have an issue about the parameter for DML.
> >
> > If we define the parameter as a tableoption.
> >
> > For a partitioned table, if we set the parent table's parallel_dml=on,
> > and set one of its partition parallel_dml=off, it seems we can not divide
> the plan for the separate table.
> >
> > For this case, should we just check the parent's parameter ?
> >
> 
> I think so. IIUC, this means the Inserts where target table is parent table,
> those will just check the option of the parent table and then ignore the
> option value for child tables whereas we will consider childrel's option
> for Inserts where target table is one of the child table, right?
> 

Yes, I think we can just check the target table itself.

Best regards,
houzj





Re: Determine parallel-safety of partition relations for Inserts

2021-01-28 Thread Amit Kapila
On Thu, Jan 28, 2021 at 5:00 PM Hou, Zhijie  wrote:
>
> > From: Amit Kapila 
> > > Good question. I think if we choose to have a separate parameter for
> > > DML, it can probably a boolean to just indicate whether to enable
> > > parallel DML for a specified table and use the parallel_workers
> > > specified in the table used in SELECT.
> >
> > Agreed.
>
> Hi
>
> I have an issue about the parameter for DML.
>
> If we define the parameter as a tableoption.
>
> For a partitioned table, if we set the parent table's parallel_dml=on,
> and set one of its partition parallel_dml=off, it seems we can not divide the 
> plan for the separate table.
>
> For this case, should we just check the parent's parameter ?
>

I think so. IIUC, this means the Inserts where target table is parent
table, those will just check the option of the parent table and then
ignore the option value for child tables whereas we will consider
childrel's option for Inserts where target table is one of the child
table, right?

-- 
With Regards,
Amit Kapila.




RE: Determine parallel-safety of partition relations for Inserts

2021-01-28 Thread Hou, Zhijie
> From: Amit Kapila 
> > Good question. I think if we choose to have a separate parameter for
> > DML, it can probably a boolean to just indicate whether to enable
> > parallel DML for a specified table and use the parallel_workers
> > specified in the table used in SELECT.
> 
> Agreed.

Hi

I have an issue about the parameter for DML.

If we define the parameter as a tableoption.

For a partitioned table, if we set the parent table's parallel_dml=on,
and set one of its partition parallel_dml=off, it seems we can not divide the 
plan for the separate table.

For this case, should we just check the parent's parameter ?

Best regards,
houzj




RE: Determine parallel-safety of partition relations for Inserts

2021-01-18 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> Good question. I think if we choose to have a separate parameter for
> DML, it can probably a boolean to just indicate whether to enable
> parallel DML for a specified table and use the parallel_workers
> specified in the table used in SELECT.

Agreed.


Regards
Takayuki Tsunakawa



Re: Determine parallel-safety of partition relations for Inserts

2021-01-18 Thread Amit Kapila
On Mon, Jan 18, 2021 at 10:27 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > We already allow users to specify the degree of parallelism for all
> > the parallel operations via guc's max_parallel_maintenance_workers,
> > max_parallel_workers_per_gather, then we have a reloption
> > parallel_workers and vacuum command has the parallel option where
> > users can specify the number of workers that can be used for
> > parallelism. The parallelism considers these as hints but decides
> > parallelism based on some other parameters like if there are that many
> > workers available, etc. Why the users would expect differently for
> > parallel DML?
>
> I agree that the user would want to specify the degree of parallelism of DML, 
> too.  My simple (probably silly) question was, in INSERT SELECT,
>
> * If the target table has 10 partitions and the source table has 100 
> partitions, how would the user want to specify parameters?
>
> * If the source and target tables have the same number of partitions, and the 
> user specified different values to parallel_workers and parallel_dml_workers, 
> how many parallel workers would run?
>

Good question. I think if we choose to have a separate parameter for
DML, it can probably a boolean to just indicate whether to enable
parallel DML for a specified table and use the parallel_workers
specified in the table used in SELECT.

-- 
With Regards,
Amit Kapila.




RE: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> We already allow users to specify the degree of parallelism for all
> the parallel operations via guc's max_parallel_maintenance_workers,
> max_parallel_workers_per_gather, then we have a reloption
> parallel_workers and vacuum command has the parallel option where
> users can specify the number of workers that can be used for
> parallelism. The parallelism considers these as hints but decides
> parallelism based on some other parameters like if there are that many
> workers available, etc. Why the users would expect differently for
> parallel DML?

I agree that the user would want to specify the degree of parallelism of DML, 
too.  My simple (probably silly) question was, in INSERT SELECT,

* If the target table has 10 partitions and the source table has 100 
partitions, how would the user want to specify parameters?

* If the source and target tables have the same number of partitions, and the 
user specified different values to parallel_workers and parallel_dml_workers, 
how many parallel workers would run?

* What would the query plan be like?  Something like below?  Can we easily 
support this sort of nested thing?

Gather
  Workers Planned: 
  Insert
Gather
  Workers Planned: 
  Parallel Seq Scan


> Which memory specific to partitions are you referring to here and does
> that apply to the patch being discussed?

Relation cache and catalog cache, which are not specific to partitions.  This 
patch's current parallel safety check opens and closes all descendant 
partitions of the target table.  That leaves those cache entries in 
CacheMemoryContext after the SQL statement ends.  But as I said, we can 
consider it's not a serious problem in this case because the parallel DML would 
be executed in limited number of concurrent sessions.  I just touched on the 
memory consumption issue for completeness in comparison with (3).


Regards
Takayuki Tsunakawa



Re: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread Amit Kapila
On Mon, Jan 18, 2021 at 6:08 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > I think it would be good if the parallelism works by default when
> > required but I guess if we want to use something on these lines then
> > we can always check if the parallel_workers option is non-zero for a
> > relation (with RelationGetParallelWorkers). So users can always say
> > Alter Table  Set (parallel_workers = 0) if they don't want
> > to enable write parallelism for tbl and if someone is bothered that
> > this might impact Selects as well because the same option is used to
> > compute the number of workers for it then we can invent a second
> > option parallel_dml_workers or something like that.
>
> Yes, if we have to require some specification to enable parallel DML, I agree 
> that parallel query and parall DML can be separately enabled.  With that 
> said, I'm not sure if the user, and PG developers, want to allow specifying 
> degree of parallelism for DML.
>

We already allow users to specify the degree of parallelism for all
the parallel operations via guc's max_parallel_maintenance_workers,
max_parallel_workers_per_gather, then we have a reloption
parallel_workers and vacuum command has the parallel option where
users can specify the number of workers that can be used for
parallelism. The parallelism considers these as hints but decides
parallelism based on some other parameters like if there are that many
workers available, etc. Why the users would expect differently for
parallel DML?

>
> > > As an aside, (1) and (2) has a potential problem with memory consumption.
> > >
> >
> > I can see the memory consumption argument for (2) because we might end
> > up generating parallel paths (partial paths) for reading the table but
> > don't see how it applies to (1)?
>
> I assumed that we would still open all partitions for parallel safety check 
> in (1) and (2).  In (1), parallel safety check is done only when parallel DML 
> is explicitly enabled by the user.  Just opening partitions keeps 
> CacheMemoryContext bloated even after they are closed.
>

Which memory specific to partitions are you referring to here and does
that apply to the patch being discussed?

-- 
With Regards,
Amit Kapila.




Re: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread Amit Kapila
On Sun, Jan 17, 2021 at 4:45 PM Amit Langote  wrote:
>
> On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila  wrote:
> > On Fri, Jan 15, 2021 at 6:45 PM Amit Langote  
> > wrote:
> > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  
> > > wrote:
> > > > We want to do this for Inserts where only Select can be parallel and
> > > > Inserts will always be done by the leader backend. This is actually
> > > > the case we first want to implement.
> > >
> > > Sorry, I haven't looked at the linked threads and the latest patches
> > > there closely enough yet, so I may be misreading this, but if the
> > > inserts will always be done by the leader backend as you say, then why
> > > does the planner need to be checking the parallel safety of the
> > > *target* table's expressions?
> > >
> >
> > The reason is that once we enter parallel-mode we can't allow
> > parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> > enter the parallel-mode at the beginning of the statement execution,
> > see ExecutePlan(). So, the Insert will be performed in parallel-mode
> > even though it happens in the leader backend. It is not possible that
> > we finish getting all the tuples from the gather node first and then
> > start inserting. Even, if we somehow find something to make this work
> > anyway the checks being discussed will be required to make inserts
> > parallel (where inserts will be performed by workers) which is
> > actually the next patch in the thread I mentioned in the previous
> > email.
> >
> > Does this answer your question?
>
> Yes, thanks for the explanation.  I kind of figured that doing the
> insert part itself in parallel using workers would be a part of the
> end goal of this work, although that didn't come across immediately.
>
> It's a bit unfortunate that the parallel safety check of the
> individual partitions cannot be deferred until it's known that a given
> partition will be affected by the command at all.  Will we need
> fundamental changes to how parallel query works to make that possible?
>  If so, have such options been considered in these projects?
>

I think it is quite fundamental to how parallel query works and we
might not be able to change it due to various reasons like (a) it will
end up generating a lot of paths in optimizer when it is not safe to
do so and in the end, we won't use it. (b) If after inserting into a
few partitions we came to know that the next partition we are going to
insert has some parallel-unsafe constraints then we need to give up
the execution and restart the statement by again trying to first plan
it by having non-parallel paths. Now, we can optimize this by
retaining both parallel and non-parallel plans such that if we fail to
execute parallel-plan we can use a non-parallel plan to execute the
statement but still that doesn't seem like an advisable approach.

The extra time spent in optimizer will pay-off well by the parallel
execution. As pointer earlier, you can see one of the results shared
on the other thread [1]. The cases where it might not get the benefit
(say when the underlying plan is non-parallel) can have some impact
but still, we have not tested that in detail. The ideas we have
discussed so far to address that are (a) postpone parallel-safety
checks for partitions till there are some underneath partial paths
(from which parallel paths can be generated) but that has some
down-side in that we will end up generating partial paths when that is
really not required, (b) have a rel option like parallel_dml_workers
or use existing option parallel_workers to allow considering parallel
insert for a relation. Any better ideas?

>  If such
> changes are not possible in the short term, like for v14, we should at
> least try to make sure that the eager checking of all partitions is
> only performed if using parallelism is possible at all.
>

As of now, we do first check if it is safe to generate a parallel plan
for underlying select (in Insert into  Select ..) and then perform
parallel-safety checks for the DML. We can postpone it further as
suggested above in (a).

> I will try to take a look at the patches themselves to see if there's
> something I know that will help.
>

Thank you. It will be really helpful if you can do that.

[1] - 
https://www.postgresql.org/message-id/b54f2e306780449093c38cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local

-- 
With Regards,
Amit Kapila.




RE: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> I think it would be good if the parallelism works by default when
> required but I guess if we want to use something on these lines then
> we can always check if the parallel_workers option is non-zero for a
> relation (with RelationGetParallelWorkers). So users can always say
> Alter Table  Set (parallel_workers = 0) if they don't want
> to enable write parallelism for tbl and if someone is bothered that
> this might impact Selects as well because the same option is used to
> compute the number of workers for it then we can invent a second
> option parallel_dml_workers or something like that.

Yes, if we have to require some specification to enable parallel DML, I agree 
that parallel query and parall DML can be separately enabled.  With that said, 
I'm not sure if the user, and PG developers, want to allow specifying degree of 
parallelism for DML.


> > As an aside, (1) and (2) has a potential problem with memory consumption.
> >
> 
> I can see the memory consumption argument for (2) because we might end
> up generating parallel paths (partial paths) for reading the table but
> don't see how it applies to (1)?

I assumed that we would still open all partitions for parallel safety check in 
(1) and (2).  In (1), parallel safety check is done only when parallel DML is 
explicitly enabled by the user.  Just opening partitions keeps 
CacheMemoryContext bloated even after they are closed.


Regards
Takayuki Tsunakawa



Re: Determine parallel-safety of partition relations for Inserts

2021-01-17 Thread Amit Langote
On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila  wrote:
> On Fri, Jan 15, 2021 at 6:45 PM Amit Langote  wrote:
> > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  wrote:
> > > We want to do this for Inserts where only Select can be parallel and
> > > Inserts will always be done by the leader backend. This is actually
> > > the case we first want to implement.
> >
> > Sorry, I haven't looked at the linked threads and the latest patches
> > there closely enough yet, so I may be misreading this, but if the
> > inserts will always be done by the leader backend as you say, then why
> > does the planner need to be checking the parallel safety of the
> > *target* table's expressions?
> >
>
> The reason is that once we enter parallel-mode we can't allow
> parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
> enter the parallel-mode at the beginning of the statement execution,
> see ExecutePlan(). So, the Insert will be performed in parallel-mode
> even though it happens in the leader backend. It is not possible that
> we finish getting all the tuples from the gather node first and then
> start inserting. Even, if we somehow find something to make this work
> anyway the checks being discussed will be required to make inserts
> parallel (where inserts will be performed by workers) which is
> actually the next patch in the thread I mentioned in the previous
> email.
>
> Does this answer your question?

Yes, thanks for the explanation.  I kind of figured that doing the
insert part itself in parallel using workers would be a part of the
end goal of this work, although that didn't come across immediately.

It's a bit unfortunate that the parallel safety check of the
individual partitions cannot be deferred until it's known that a given
partition will be affected by the command at all.  Will we need
fundamental changes to how parallel query works to make that possible?
 If so, have such options been considered in these projects?  If such
changes are not possible in the short term, like for v14, we should at
least try to make sure that the eager checking of all partitions is
only performed if using parallelism is possible at all.

I will try to take a look at the patches themselves to see if there's
something I know that will help.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Amit Kapila
On Fri, Jan 15, 2021 at 7:35 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > This will surely increase planning time but the execution is reduced
> > to an extent due to parallelism that it won't matter for either of the
> > cases if we see just total time. For example, see the latest results
> > for parallel inserts posted by Haiying Tang [3]. There might be an
> > impact when Selects can't be parallelized due to the small size of the
> > Select-table but we still have to traverse all the partitions to
> > determine parallel-safety but not sure how much it is compared to
> > overall time. I guess we need to find the same but apart from that can
> > anyone think of a better way to determine parallel-safety of
> > partitioned relation for Inserts?
>
> Three solutions(?) quickly come to my mind:
>
>
> (1) Have the user specify whether they want to parallelize DML
> Oracle [1] and SQL Server [2] take this approach.  Oracle disables parallel 
> DML execution by default.  The reason is described as "This mode is required 
> because parallel DML and serial DML have different locking, transaction, and 
> disk space requirements and parallel DML is disabled for a session by 
> default."  To enable parallel DML in a session or in a specific statement, 
> you need to run either of the following:
>
>   ALTER SESSION ENABLE PARALLEL DML;
>   INSERT /*+ ENABLE_PARALLEL_DML */ …
>
> Besides, the user has to specify a parallel hint in a DML statement, or 
> specify the parallel attribute in CREATE or ALTER TABLE.
>
> SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT 
> statement like this:
>
>   INSERT INTO Sales.SalesHistory WITH (TABLOCK)  (target columns...) SELECT 
> ...;
>

I think it would be good if the parallelism works by default when
required but I guess if we want to use something on these lines then
we can always check if the parallel_workers option is non-zero for a
relation (with RelationGetParallelWorkers). So users can always say
Alter Table  Set (parallel_workers = 0) if they don't want
to enable write parallelism for tbl and if someone is bothered that
this might impact Selects as well because the same option is used to
compute the number of workers for it then we can invent a second
option parallel_dml_workers or something like that.

>
> (2) Postpone the parallel safety check after the planner finds a worthwhile 
> parallel query plan
> I'm not sure if the current planner code allows this easily...
>

I think it is possible but it has a bit of disadvantage as well as
mentioned in response to Ashutosh's email [1].

>
> (3) Record the parallel safety in system catalog
> Add a column like relparallel in pg_class that indicates the parallel safety 
> of the relation.  planner just checks the value instead of doing heavy work 
> for every SQL statement.  That column's value is modified whenever a relation 
> alteration is made that affects the parallel safety, such as adding a domain 
> column and CHECK constraint.  In case of a partitioned relation, the parallel 
> safety attributes of all its descendant relations are merged.  For example, 
> if a partition becomes parallel-unsafe, the ascendant partitioned tables also 
> become parallel-unsafe.
>
> But... developing such code would be burdonsome and bug-prone?
>
>
> I'm inclined to propose (1).  Parallel DML would be something that a limited 
> people run in limited circumstances (data loading in data warehouse and batch 
> processing in OLTP systems by the DBA or data administrator), so I think it's 
> legitimate to require explicit specification of parallelism.
>
> As an aside, (1) and (2) has a potential problem with memory consumption.
>

I can see the memory consumption argument for (2) because we might end
up generating parallel paths (partial paths) for reading the table but
don't see how it applies to (1)?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1J80Rzn4M-A5sfkmJ8NjgTxbaC8UWVaNHK6%2B2BCYYv2Nw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Amit Kapila
On Fri, Jan 15, 2021 at 6:45 PM Amit Langote  wrote:
>
> On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  wrote:
> > We want to do this for Inserts where only Select can be parallel and
> > Inserts will always be done by the leader backend. This is actually
> > the case we first want to implement.
>
> Sorry, I haven't looked at the linked threads and the latest patches
> there closely enough yet, so I may be misreading this, but if the
> inserts will always be done by the leader backend as you say, then why
> does the planner need to be checking the parallel safety of the
> *target* table's expressions?
>

The reason is that once we enter parallel-mode we can't allow
parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
enter the parallel-mode at the beginning of the statement execution,
see ExecutePlan(). So, the Insert will be performed in parallel-mode
even though it happens in the leader backend. It is not possible that
we finish getting all the tuples from the gather node first and then
start inserting. Even, if we somehow find something to make this work
anyway the checks being discussed will be required to make inserts
parallel (where inserts will be performed by workers) which is
actually the next patch in the thread I mentioned in the previous
email.

Does this answer your question?

-- 
With Regards,
Amit Kapila.




RE: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread tsunakawa.ta...@fujitsu.com
From: Amit Langote 
> Sorry, I haven't looked at the linked threads and the latest patches
> there closely enough yet, so I may be misreading this, but if the
> inserts will always be done by the leader backend as you say, then why
> does the planner need to be checking the parallel safety of the
> *target* table's expressions?

Yeah, I also wanted to confirm this next - that is, whether the current patch 
allows the SELECT operation to execute in parallel but the INSERT operation 
serially.  Oracle allows it; it even allows the user to specify a hint after 
INSERT and SELECT separately.  Even if INSERT in INSERT SELECT can't be run in 
parallel, it is useful for producing summary data, such as aggregating large 
amounts of IoT sensor data in parallel and inserting the small amount of 
summary data serially.


Regards
Takayuki Tsunakawa



RE: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread tsunakawa.ta...@fujitsu.com
From: Amit Kapila 
> This will surely increase planning time but the execution is reduced
> to an extent due to parallelism that it won't matter for either of the
> cases if we see just total time. For example, see the latest results
> for parallel inserts posted by Haiying Tang [3]. There might be an
> impact when Selects can't be parallelized due to the small size of the
> Select-table but we still have to traverse all the partitions to
> determine parallel-safety but not sure how much it is compared to
> overall time. I guess we need to find the same but apart from that can
> anyone think of a better way to determine parallel-safety of
> partitioned relation for Inserts?

Three solutions(?) quickly come to my mind:


(1) Have the user specify whether they want to parallelize DML
Oracle [1] and SQL Server [2] take this approach.  Oracle disables parallel DML 
execution by default.  The reason is described as "This mode is required 
because parallel DML and serial DML have different locking, transaction, and 
disk space requirements and parallel DML is disabled for a session by default." 
 To enable parallel DML in a session or in a specific statement, you need to 
run either of the following:

  ALTER SESSION ENABLE PARALLEL DML;
  INSERT /*+ ENABLE_PARALLEL_DML */ …

Besides, the user has to specify a parallel hint in a DML statement, or specify 
the parallel attribute in CREATE or ALTER TABLE.

SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT 
statement like this:

  INSERT INTO Sales.SalesHistory WITH (TABLOCK)  (target columns...) SELECT ...;


(2) Postpone the parallel safety check after the planner finds a worthwhile 
parallel query plan
I'm not sure if the current planner code allows this easily...


(3) Record the parallel safety in system catalog
Add a column like relparallel in pg_class that indicates the parallel safety of 
the relation.  planner just checks the value instead of doing heavy work for 
every SQL statement.  That column's value is modified whenever a relation 
alteration is made that affects the parallel safety, such as adding a domain 
column and CHECK constraint.  In case of a partitioned relation, the parallel 
safety attributes of all its descendant relations are merged.  For example, if 
a partition becomes parallel-unsafe, the ascendant partitioned tables also 
become parallel-unsafe.

But... developing such code would be burdonsome and bug-prone?


I'm inclined to propose (1).  Parallel DML would be something that a limited 
people run in limited circumstances (data loading in data warehouse and batch 
processing in OLTP systems by the DBA or data administrator), so I think it's 
legitimate to require explicit specification of parallelism.

As an aside, (1) and (2) has a potential problem with memory consumption.  
Opening relations bloat CacheMemoryContext with relcaches and catcaches, and 
closing relations does not free the (all) memory.  But I don't think it could 
really become a problem in practice, because parallel DML would be run in 
limited number of concurrent sessions.


[1]
Types of Parallelism
https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D8290A02-BE5F-436A-B814-D6FD71CEE81F

[2]
INSERT (Transact-SQL)
https://docs.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver15#best-practices


Regards
Takayuki Tsunakawa



Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Amit Langote
On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  wrote:
> We want to do this for Inserts where only Select can be parallel and
> Inserts will always be done by the leader backend. This is actually
> the case we first want to implement.

Sorry, I haven't looked at the linked threads and the latest patches
there closely enough yet, so I may be misreading this, but if the
inserts will always be done by the leader backend as you say, then why
does the planner need to be checking the parallel safety of the
*target* table's expressions?

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Amit Kapila
On Fri, Jan 15, 2021 at 5:53 PM Ashutosh Bapat
 wrote:
>
> On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila  wrote:
> >
> > While reviewing parallel insert [1] (Insert into  Select) and
> > parallel copy patches [2], it came to my notice that both the patches
> > traverse the entire partition hierarchy to determine parallel-safety
> > of partitioned relations. This is required because before considering
> > the Insert or Copy can be considered for parallelism, we need to
> > determine whether it is safe to do so. We need to check for each
> > partition because any of the partitions can have some parallel-unsafe
> > index expression, constraint, etc. We do a similar thing for Selects
> > in standard_planner.
> >
> > The plain Select case for partitioned tables was simpler because we
> > anyway loop through all the partitions in set_append_rel_size() and we
> > determine parallel-safety of each partition at that time but the same
> > is not true for Inserts.
> >
> > For Inserts, currently, we only open the partition table when we are
> > about to insert into that partition. During ExecInsert, we find out
> > the partition matching the partition-key value and then lock if it is
> > not already locked. In this patch, we need to open each partition at
> > the planning time to determine its parallel-safety.
>
> We don't want to open the partitions where no rows will be inserted.
>
> >
> > This will surely increase planning time but the execution is reduced
> > to an extent due to parallelism that it won't matter for either of the
> > cases if we see just total time. For example, see the latest results
> > for parallel inserts posted by Haiying Tang [3]. There might be an
> > impact when Selects can't be parallelized due to the small size of the
> > Select-table but we still have to traverse all the partitions to
> > determine parallel-safety but not sure how much it is compared to
> > overall time. I guess we need to find the same but apart from that can
> > anyone think of a better way to determine parallel-safety of
> > partitioned relation for Inserts?
>
> In case of SELECT we open only those partitions which surive pruning.
> So those are the ones which will definitely required to be scanned. We
> perform parallelism checks only on those partitions. The actual check
> isn't much costly.
>
> >
> > Thoughts?
> >
> > Note: I have kept a few people in Cc who are either directly involved
> > in this work or work regularly in the partitioning related work just
> > in the hope that might help in moving the discussion forward.
>
> Since you brought up comparison between SELECT and INSERT, "pruning"
> partitions based on the values being INSERTed might help. It should be
> doable in case of INSERT ... SELECT where we need to prune partitions
> based on the clauses of SELECT. Doable with some little effort in case
> of VALUEs and COPY.
>

I don't think we should try pruning in this patch as that is a
separate topic and even after pruning the same problem can happen when
we are not able to prune partitions in the table where we want to
Insert.

> Second possibility is to open partitions only when the estimated
> number of rows to be inserted goes beyond a certain value.
>

Yeah, this option has merits in the sense that we will pay the cost to
traverse partitions only when the parallel plan is possible. If you
see the 0001 patch in email [1], it tries to determine parallel-safety
(which in turn will traverse all partition tables) in standard_planner
where we determine the parallel-safety for the Query tree. Now, if we
have to postpone it for partitioned tables then we need to determine
it at the places where we create_modifytable_path if the
current_rel->pathlist has some parallel_safe paths. And which will
also mean that we need to postpone generating gather node as well till
that time because Insert can be parallel-unsafe.

This will have one disadvantage over the current approach implemented
by the patch which is that we might end up spending a lot of time in
the optimizer to create partial paths and later (while determining
parallel-safety of Insert) find that none of them is required.

If we decide to postpone the parallel-safety checking for Inserts then
why not we check parallel-safety for all sorts of Inserts at that
point. That can at least simplify the patch.

> Third idea is to use something similar to parallel append where
> individual partitions are scanned sequentially but multiple partitions
> are scanned in parallel. When a row is inserted into a non-yet-opened
> partition, allocate one/more backends to insert into partitions which
> do not allow parallelism, otherwise continue to use a common pool of
> parallel workers for insertion. This means the same thread performing
> select may not perform insert. So some complications will be involved.
>

We want to do this for Inserts where only Select can be parallel and
Inserts will always be done by the leader backend. This is actually
the case we first want t

Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Ashutosh Bapat
On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila  wrote:
>
> While reviewing parallel insert [1] (Insert into  Select) and
> parallel copy patches [2], it came to my notice that both the patches
> traverse the entire partition hierarchy to determine parallel-safety
> of partitioned relations. This is required because before considering
> the Insert or Copy can be considered for parallelism, we need to
> determine whether it is safe to do so. We need to check for each
> partition because any of the partitions can have some parallel-unsafe
> index expression, constraint, etc. We do a similar thing for Selects
> in standard_planner.
>
> The plain Select case for partitioned tables was simpler because we
> anyway loop through all the partitions in set_append_rel_size() and we
> determine parallel-safety of each partition at that time but the same
> is not true for Inserts.
>
> For Inserts, currently, we only open the partition table when we are
> about to insert into that partition. During ExecInsert, we find out
> the partition matching the partition-key value and then lock if it is
> not already locked. In this patch, we need to open each partition at
> the planning time to determine its parallel-safety.

We don't want to open the partitions where no rows will be inserted.

>
> This will surely increase planning time but the execution is reduced
> to an extent due to parallelism that it won't matter for either of the
> cases if we see just total time. For example, see the latest results
> for parallel inserts posted by Haiying Tang [3]. There might be an
> impact when Selects can't be parallelized due to the small size of the
> Select-table but we still have to traverse all the partitions to
> determine parallel-safety but not sure how much it is compared to
> overall time. I guess we need to find the same but apart from that can
> anyone think of a better way to determine parallel-safety of
> partitioned relation for Inserts?

In case of SELECT we open only those partitions which surive pruning.
So those are the ones which will definitely required to be scanned. We
perform parallelism checks only on those partitions. The actual check
isn't much costly.

>
> Thoughts?
>
> Note: I have kept a few people in Cc who are either directly involved
> in this work or work regularly in the partitioning related work just
> in the hope that might help in moving the discussion forward.

Since you brought up comparison between SELECT and INSERT, "pruning"
partitions based on the values being INSERTed might help. It should be
doable in case of INSERT ... SELECT where we need to prune partitions
based on the clauses of SELECT. Doable with some little effort in case
of VALUEs and COPY.

Second possibility is to open partitions only when the estimated
number of rows to be inserted goes beyond a certain value.

Third idea is to use something similar to parallel append where
individual partitions are scanned sequentially but multiple partitions
are scanned in parallel. When a row is inserted into a non-yet-opened
partition, allocate one/more backends to insert into partitions which
do not allow parallelism, otherwise continue to use a common pool of
parallel workers for insertion. This means the same thread performing
select may not perform insert. So some complications will be involved.

-- 
Best Wishes,
Ashutosh Bapat