RE: Determine parallel-safety of partition relations for Inserts
Hi, Attaching v7 patches with the changes: * rebase the code on the greg's latest parallel insert patch. Please consider it for further review. Best regards, houzj v7_0004-reloption-parallel_dml-test-and-doc.patch Description: v7_0004-reloption-parallel_dml-test-and-doc.patch v7_0001-guc-option-enable_parallel_dml-src.patch Description: v7_0001-guc-option-enable_parallel_dml-src.patch v7_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v7_0002-guc-option-enable_parallel_dml-doc-and-test.patch v7_0003-reloption-parallel_dml-src.patch.patch Description: v7_0003-reloption-parallel_dml-src.patch.patch
RE: Determine parallel-safety of partition relations for Inserts
Hi, Attaching v6 patches with the changes: * rebase the code on the greg's latest parallel insert patch. * fix some code comment. * add some test to cover the partitioned table. Please consider it for further review. Best regards, Houzj v6_0004-reloption-parallel_dml-test-and-doc.patch Description: v6_0004-reloption-parallel_dml-test-and-doc.patch v6_0001-guc-option-enable_parallel_dml-src.patch Description: v6_0001-guc-option-enable_parallel_dml-src.patch v6_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v6_0002-guc-option-enable_parallel_dml-doc-and-test.patch v6_0003-reloption-parallel_dml-src.patch.patch Description: v6_0003-reloption-parallel_dml-src.patch.patch
RE: Determine parallel-safety of partition relations for Inserts
Hi, I notice the comment 5) about is_parallel_possible_for_modify() seems to be inaccurate in clauses.c. * 5) the reloption parallel_dml_enabled is not set for the target table Because you have set parallel_dml_enabled to 'true' as default. { { "parallel_dml_enabled", "Enables \"parallel dml\" feature for this table", RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED, ShareUpdateExclusiveLock }, true } So even user doesn't set parallel_dml_enabled explicit, its value is 'on', Parallel is also possible for this rel(when enable_parallel_dml is on). Maybe we can just comment like this or a better one you'd like if you agree with above: * 5) the reloption parallel_dml_enabled is set to off Regards Huang > -Original Message- > From: Hou, Zhijie > Sent: Wednesday, February 3, 2021 9:01 AM > To: Greg Nancarrow > Cc: Amit Kapila ; PostgreSQL Hackers > ; vignesh C ; > Amit Langote ; David Rowley > ; Tom Lane ; Tsunakawa, > Takayuki/綱川 貴之 > Subject: RE: Determine parallel-safety of partition relations for Inserts > > Hi, > > Attaching v5 patches with the changes: > * rebase the code on the greg's latest parallel insert patch > * fix some code style. > > Please consider it for further review. > > Best regards, > Houzj > > > >
RE: Determine parallel-safety of partition relations for Inserts
> For v3_0003-reloption-parallel_dml-src.patch : > + table_close(rel, NoLock); > Since the rel would always be closed, it seems the return value from > RelationGetParallelDML() can be assigned to a variable, followed by call to > table_close(), then the return statement. Thanks for the comment. Fixed in the latest patch. Best regards, houzj
RE: Determine parallel-safety of partition relations for Inserts
Hi, Attaching v5 patches with the changes: * rebase the code on the greg's latest parallel insert patch * fix some code style. Please consider it for further review. Best regards, Houzj v5_0004-reloption-parallel_dml-test-and-doc.patch Description: v5_0004-reloption-parallel_dml-test-and-doc.patch v5_0001-guc-option-enable_parallel_dml-src.patch Description: v5_0001-guc-option-enable_parallel_dml-src.patch v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v5_0002-guc-option-enable_parallel_dml-doc-and-test.patch v5_0003-reloption-parallel_dml-src.patch.patch Description: v5_0003-reloption-parallel_dml-src.patch.patch
RE: Determine parallel-safety of partition relations for Inserts
> > IMO, It seems more readable to extract all the check that we can do > > before the safety-check and put them in the new function. > > > > Please consider it for further review. > > > > I've updated your v2 patches and altered some comments and documentation > changes (but made no code changes) - please compare against your v2 patches, > and see whether you agree with the changes to the wording. > In the documentation, you will also notice that in your V2 patch, it says > that the "parallel_dml_enabled" table option defaults to false. > As it actually defaults to true, I changed that in the documentation too. Hi greg, Thanks a lot for the document update, LGTM. Attaching v4 patches with the changes: * fix some typos in the code. * store partitioned reloption in a separate struct PartitionedOptions, making it more standard and easier to expand in the future. Please consider it for further review. Best regards, Houzj v4_0001-guc-option-enable_parallel_dml-src.patch Description: v4_0001-guc-option-enable_parallel_dml-src.patch v4_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v4_0002-guc-option-enable_parallel_dml-doc-and-test.patch v4_0003-reloption-parallel_dml-src.patch.patch Description: v4_0003-reloption-parallel_dml-src.patch.patch v4_0004-reloption-parallel_dml-test-and-doc.patch Description: v4_0004-reloption-parallel_dml-test-and-doc.patch
Re: Determine parallel-safety of partition relations for Inserts
Hi, For v3_0003-reloption-parallel_dml-src.patch : +* Check if parallel_dml_enabled is enabled for the target table, +* if not, skip the safety checks and return PARALLEL_UNSAFE. Looks like the return value is true / false. So the above comment should be adjusted. + if (!RelationGetParallelDML(rel, true)) + { + table_close(rel, NoLock); + return false; + } + + table_close(rel, NoLock); Since the rel would always be closed, it seems the return value from RelationGetParallelDML() can be assigned to a variable, followed by call to table_close(), then the return statement. Cheers On Mon, Feb 1, 2021 at 3:56 PM Greg Nancarrow wrote: > On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie > wrote: > > > > Attatching v2 patch which addressed the comments above. > > > > Some further refactor: > > > > Introducing a new function is_parallel_possible_for_modify() which > decide whether to do safety check. > > > > IMO, It seems more readable to extract all the check that we can do > before the safety-check and put them > > in the new function. > > > > Please consider it for further review. > > > > I've updated your v2 patches and altered some comments and > documentation changes (but made no code changes) - please compare > against your v2 patches, and see whether you agree with the changes to > the wording. > In the documentation, you will also notice that in your V2 patch, it > says that the "parallel_dml_enabled" table option defaults to false. > As it actually defaults to true, I changed that in the documentation > too. > > Regards, > Greg Nancarrow > Fujitsu Australia >
Re: Determine parallel-safety of partition relations for Inserts
On Mon, Feb 1, 2021 at 4:02 PM Hou, Zhijie wrote: > > Attatching v2 patch which addressed the comments above. > > Some further refactor: > > Introducing a new function is_parallel_possible_for_modify() which decide > whether to do safety check. > > IMO, It seems more readable to extract all the check that we can do before > the safety-check and put them > in the new function. > > Please consider it for further review. > I've updated your v2 patches and altered some comments and documentation changes (but made no code changes) - please compare against your v2 patches, and see whether you agree with the changes to the wording. In the documentation, you will also notice that in your V2 patch, it says that the "parallel_dml_enabled" table option defaults to false. As it actually defaults to true, I changed that in the documentation too. Regards, Greg Nancarrow Fujitsu Australia v3_0004-reloption-parallel_dml-test-and-doc.patch Description: Binary data v3_0001-guc-option-enable_parallel_dml-src.patch Description: Binary data v3_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: Binary data v3_0003-reloption-parallel_dml-src.patch Description: Binary data
RE: Determine parallel-safety of partition relations for Inserts
Hi greg, Thanks for the review ! > Personally, I think a table's "parallel_dml" option should be ON by default. > It's annoying to have to separately enable it for each and every table being > used, when I think the need to turn it selectively OFF should be fairly > rare. Yes, I agreed. Changed. > And I'm not sure that "parallel_dml" is the best name for the table option > - because it sort of implies parallel dml WILL be used - but that isn't > true, it depends on other factors too. > So I think (to be consistent with other table option naming) it would have > to be something like "parallel_dml_enabled". Agreed. Changed to parallel_dml_enabled. Attatching v2 patch which addressed the comments above. Some further refactor: Introducing a new function is_parallel_possible_for_modify() which decide whether to do safety check. IMO, It seems more readable to extract all the check that we can do before the safety-check and put them in the new function. Please consider it for further review. Best regards, houzj v2_0003-reloption-parallel_dml-src.patch Description: v2_0003-reloption-parallel_dml-src.patch v2_0004-reloption-parallel_dml-test-and-doc.patch Description: v2_0004-reloption-parallel_dml-test-and-doc.patch v2_0001-guc-option-enable_parallel_dml-src.patch Description: v2_0001-guc-option-enable_parallel_dml-src.patch v2_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v2_0002-guc-option-enable_parallel_dml-doc-and-test.patch
Re: Determine parallel-safety of partition relations for Inserts
On Fri, Jan 29, 2021 at 5:44 PM Hou, Zhijie wrote: > > > Attatching v1 patches, introducing options which let user manually control > whether to use parallel dml. > > About the patch: > 1. add a new guc option: enable_parallel_dml (boolean) > 2. add a new tableoption: parallel_dml (boolean) > >The default of both is off(false). > > User can set enable_parallel_dml in postgresql.conf or seesion to enable > parallel dml. > If user want to choose some specific to use parallel insert, they can set > table.parallel_dml to on. > > Some attention: > (1) > Currently if guc option enable_parallel_dml is set to on but table's > parallel_dml is off, > planner still do not consider parallel for dml. > > In this way, If user want to use parallel dml , they have to set > enable_parallel_dml=on and set parallel_dml = on. > If someone dislike this, I think we can also let tableoption. parallel_dml's > default value to on ,with this user only need > to set enable_parallel_dml=on > > (2) > For the parallel_dml. > If target table is partitioned, and it's parallel_dml is set to on, planner > will ignore > The option value of it's child. > This is beacause we can not divide dml plan to separate table, so we just > check the target table itself. > > > Thoughts and comments are welcome. > Personally, I think a table's "parallel_dml" option should be ON by default. It's annoying to have to separately enable it for each and every table being used, when I think the need to turn it selectively OFF should be fairly rare. And I'm not sure that "parallel_dml" is the best name for the table option - because it sort of implies parallel dml WILL be used - but that isn't true, it depends on other factors too. So I think (to be consistent with other table option naming) it would have to be something like "parallel_dml_enabled". I'm still looking at the patches, but have so far noticed that there are some issues in the documentation updates (grammatical issues and consistency issues with current documentation), and also some of the explanations are not clear. I guess I can address these separately. Regards, Greg Nancarrow Fujitsu Australia
RE: Determine parallel-safety of partition relations for Inserts
> Hi, > > Attatching v1 patches, introducing options which let user manually control > whether to use parallel dml. > > About the patch: > 1. add a new guc option: enable_parallel_dml (boolean) 2. add a new > tableoption: parallel_dml (boolean) > >The default of both is off(false). > > User can set enable_parallel_dml in postgresql.conf or seesion to enable > parallel dml. > If user want to choose some specific to use parallel insert, they can set > table.parallel_dml to on. > > Some attention: > (1) > Currently if guc option enable_parallel_dml is set to on but table's > parallel_dml is off, planner still do not consider parallel for dml. > > In this way, If user want to use parallel dml , they have to set > enable_parallel_dml=on and set parallel_dml = on. > If someone dislike this, I think we can also let tableoption. parallel_dml's > default value to on ,with this user only need to set enable_parallel_dml=on > > (2) > For the parallel_dml. > If target table is partitioned, and it's parallel_dml is set to on, planner > will ignore The option value of it's child. > This is beacause we can not divide dml plan to separate table, so we just > check the target table itself. > > > Thoughts and comments are welcome. The patch is based on v13 patch(parallel insert select) in [1] [1] https://www.postgresql.org/message-id/CAJcOf-ejV8iU%2BYpuV4qbYEY-2vCG1QF2g3Gxn%3DZ%2BPyUH_5f84A%40mail.gmail.com Best regards, houzj
RE: Determine parallel-safety of partition relations for Inserts
Hi, Attatching v1 patches, introducing options which let user manually control whether to use parallel dml. About the patch: 1. add a new guc option: enable_parallel_dml (boolean) 2. add a new tableoption: parallel_dml (boolean) The default of both is off(false). User can set enable_parallel_dml in postgresql.conf or seesion to enable parallel dml. If user want to choose some specific to use parallel insert, they can set table.parallel_dml to on. Some attention: (1) Currently if guc option enable_parallel_dml is set to on but table's parallel_dml is off, planner still do not consider parallel for dml. In this way, If user want to use parallel dml , they have to set enable_parallel_dml=on and set parallel_dml = on. If someone dislike this, I think we can also let tableoption. parallel_dml's default value to on ,with this user only need to set enable_parallel_dml=on (2) For the parallel_dml. If target table is partitioned, and it's parallel_dml is set to on, planner will ignore The option value of it's child. This is beacause we can not divide dml plan to separate table, so we just check the target table itself. Thoughts and comments are welcome. Best regards, houzj v1_0004-reloption-parallel_dml-test-and-doc.patch Description: v1_0004-reloption-parallel_dml-test-and-doc.patch v1_0001-guc-option-enable_parallel_dml-src.patch Description: v1_0001-guc-option-enable_parallel_dml-src.patch v1_0002-guc-option-enable_parallel_dml-doc-and-test.patch Description: v1_0002-guc-option-enable_parallel_dml-doc-and-test.patch v1_0003-reloption-parallel_dml-src.patch Description: v1_0003-reloption-parallel_dml-src.patch
RE: Determine parallel-safety of partition relations for Inserts
> > Hi > > > > I have an issue about the parameter for DML. > > > > If we define the parameter as a tableoption. > > > > For a partitioned table, if we set the parent table's parallel_dml=on, > > and set one of its partition parallel_dml=off, it seems we can not divide > the plan for the separate table. > > > > For this case, should we just check the parent's parameter ? > > > > I think so. IIUC, this means the Inserts where target table is parent table, > those will just check the option of the parent table and then ignore the > option value for child tables whereas we will consider childrel's option > for Inserts where target table is one of the child table, right? > Yes, I think we can just check the target table itself. Best regards, houzj
Re: Determine parallel-safety of partition relations for Inserts
On Thu, Jan 28, 2021 at 5:00 PM Hou, Zhijie wrote: > > > From: Amit Kapila > > > Good question. I think if we choose to have a separate parameter for > > > DML, it can probably a boolean to just indicate whether to enable > > > parallel DML for a specified table and use the parallel_workers > > > specified in the table used in SELECT. > > > > Agreed. > > Hi > > I have an issue about the parameter for DML. > > If we define the parameter as a tableoption. > > For a partitioned table, if we set the parent table's parallel_dml=on, > and set one of its partition parallel_dml=off, it seems we can not divide the > plan for the separate table. > > For this case, should we just check the parent's parameter ? > I think so. IIUC, this means the Inserts where target table is parent table, those will just check the option of the parent table and then ignore the option value for child tables whereas we will consider childrel's option for Inserts where target table is one of the child table, right? -- With Regards, Amit Kapila.
RE: Determine parallel-safety of partition relations for Inserts
> From: Amit Kapila > > Good question. I think if we choose to have a separate parameter for > > DML, it can probably a boolean to just indicate whether to enable > > parallel DML for a specified table and use the parallel_workers > > specified in the table used in SELECT. > > Agreed. Hi I have an issue about the parameter for DML. If we define the parameter as a tableoption. For a partitioned table, if we set the parent table's parallel_dml=on, and set one of its partition parallel_dml=off, it seems we can not divide the plan for the separate table. For this case, should we just check the parent's parameter ? Best regards, houzj
RE: Determine parallel-safety of partition relations for Inserts
From: Amit Kapila > Good question. I think if we choose to have a separate parameter for > DML, it can probably a boolean to just indicate whether to enable > parallel DML for a specified table and use the parallel_workers > specified in the table used in SELECT. Agreed. Regards Takayuki Tsunakawa
Re: Determine parallel-safety of partition relations for Inserts
On Mon, Jan 18, 2021 at 10:27 AM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > We already allow users to specify the degree of parallelism for all > > the parallel operations via guc's max_parallel_maintenance_workers, > > max_parallel_workers_per_gather, then we have a reloption > > parallel_workers and vacuum command has the parallel option where > > users can specify the number of workers that can be used for > > parallelism. The parallelism considers these as hints but decides > > parallelism based on some other parameters like if there are that many > > workers available, etc. Why the users would expect differently for > > parallel DML? > > I agree that the user would want to specify the degree of parallelism of DML, > too. My simple (probably silly) question was, in INSERT SELECT, > > * If the target table has 10 partitions and the source table has 100 > partitions, how would the user want to specify parameters? > > * If the source and target tables have the same number of partitions, and the > user specified different values to parallel_workers and parallel_dml_workers, > how many parallel workers would run? > Good question. I think if we choose to have a separate parameter for DML, it can probably a boolean to just indicate whether to enable parallel DML for a specified table and use the parallel_workers specified in the table used in SELECT. -- With Regards, Amit Kapila.
RE: Determine parallel-safety of partition relations for Inserts
From: Amit Kapila > We already allow users to specify the degree of parallelism for all > the parallel operations via guc's max_parallel_maintenance_workers, > max_parallel_workers_per_gather, then we have a reloption > parallel_workers and vacuum command has the parallel option where > users can specify the number of workers that can be used for > parallelism. The parallelism considers these as hints but decides > parallelism based on some other parameters like if there are that many > workers available, etc. Why the users would expect differently for > parallel DML? I agree that the user would want to specify the degree of parallelism of DML, too. My simple (probably silly) question was, in INSERT SELECT, * If the target table has 10 partitions and the source table has 100 partitions, how would the user want to specify parameters? * If the source and target tables have the same number of partitions, and the user specified different values to parallel_workers and parallel_dml_workers, how many parallel workers would run? * What would the query plan be like? Something like below? Can we easily support this sort of nested thing? Gather Workers Planned: Insert Gather Workers Planned: Parallel Seq Scan > Which memory specific to partitions are you referring to here and does > that apply to the patch being discussed? Relation cache and catalog cache, which are not specific to partitions. This patch's current parallel safety check opens and closes all descendant partitions of the target table. That leaves those cache entries in CacheMemoryContext after the SQL statement ends. But as I said, we can consider it's not a serious problem in this case because the parallel DML would be executed in limited number of concurrent sessions. I just touched on the memory consumption issue for completeness in comparison with (3). Regards Takayuki Tsunakawa
Re: Determine parallel-safety of partition relations for Inserts
On Mon, Jan 18, 2021 at 6:08 AM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > I think it would be good if the parallelism works by default when > > required but I guess if we want to use something on these lines then > > we can always check if the parallel_workers option is non-zero for a > > relation (with RelationGetParallelWorkers). So users can always say > > Alter Table Set (parallel_workers = 0) if they don't want > > to enable write parallelism for tbl and if someone is bothered that > > this might impact Selects as well because the same option is used to > > compute the number of workers for it then we can invent a second > > option parallel_dml_workers or something like that. > > Yes, if we have to require some specification to enable parallel DML, I agree > that parallel query and parall DML can be separately enabled. With that > said, I'm not sure if the user, and PG developers, want to allow specifying > degree of parallelism for DML. > We already allow users to specify the degree of parallelism for all the parallel operations via guc's max_parallel_maintenance_workers, max_parallel_workers_per_gather, then we have a reloption parallel_workers and vacuum command has the parallel option where users can specify the number of workers that can be used for parallelism. The parallelism considers these as hints but decides parallelism based on some other parameters like if there are that many workers available, etc. Why the users would expect differently for parallel DML? > > > > As an aside, (1) and (2) has a potential problem with memory consumption. > > > > > > > I can see the memory consumption argument for (2) because we might end > > up generating parallel paths (partial paths) for reading the table but > > don't see how it applies to (1)? > > I assumed that we would still open all partitions for parallel safety check > in (1) and (2). In (1), parallel safety check is done only when parallel DML > is explicitly enabled by the user. Just opening partitions keeps > CacheMemoryContext bloated even after they are closed. > Which memory specific to partitions are you referring to here and does that apply to the patch being discussed? -- With Regards, Amit Kapila.
Re: Determine parallel-safety of partition relations for Inserts
On Sun, Jan 17, 2021 at 4:45 PM Amit Langote wrote: > > On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila wrote: > > On Fri, Jan 15, 2021 at 6:45 PM Amit Langote > > wrote: > > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila > > > wrote: > > > > We want to do this for Inserts where only Select can be parallel and > > > > Inserts will always be done by the leader backend. This is actually > > > > the case we first want to implement. > > > > > > Sorry, I haven't looked at the linked threads and the latest patches > > > there closely enough yet, so I may be misreading this, but if the > > > inserts will always be done by the leader backend as you say, then why > > > does the planner need to be checking the parallel safety of the > > > *target* table's expressions? > > > > > > > The reason is that once we enter parallel-mode we can't allow > > parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We > > enter the parallel-mode at the beginning of the statement execution, > > see ExecutePlan(). So, the Insert will be performed in parallel-mode > > even though it happens in the leader backend. It is not possible that > > we finish getting all the tuples from the gather node first and then > > start inserting. Even, if we somehow find something to make this work > > anyway the checks being discussed will be required to make inserts > > parallel (where inserts will be performed by workers) which is > > actually the next patch in the thread I mentioned in the previous > > email. > > > > Does this answer your question? > > Yes, thanks for the explanation. I kind of figured that doing the > insert part itself in parallel using workers would be a part of the > end goal of this work, although that didn't come across immediately. > > It's a bit unfortunate that the parallel safety check of the > individual partitions cannot be deferred until it's known that a given > partition will be affected by the command at all. Will we need > fundamental changes to how parallel query works to make that possible? > If so, have such options been considered in these projects? > I think it is quite fundamental to how parallel query works and we might not be able to change it due to various reasons like (a) it will end up generating a lot of paths in optimizer when it is not safe to do so and in the end, we won't use it. (b) If after inserting into a few partitions we came to know that the next partition we are going to insert has some parallel-unsafe constraints then we need to give up the execution and restart the statement by again trying to first plan it by having non-parallel paths. Now, we can optimize this by retaining both parallel and non-parallel plans such that if we fail to execute parallel-plan we can use a non-parallel plan to execute the statement but still that doesn't seem like an advisable approach. The extra time spent in optimizer will pay-off well by the parallel execution. As pointer earlier, you can see one of the results shared on the other thread [1]. The cases where it might not get the benefit (say when the underlying plan is non-parallel) can have some impact but still, we have not tested that in detail. The ideas we have discussed so far to address that are (a) postpone parallel-safety checks for partitions till there are some underneath partial paths (from which parallel paths can be generated) but that has some down-side in that we will end up generating partial paths when that is really not required, (b) have a rel option like parallel_dml_workers or use existing option parallel_workers to allow considering parallel insert for a relation. Any better ideas? > If such > changes are not possible in the short term, like for v14, we should at > least try to make sure that the eager checking of all partitions is > only performed if using parallelism is possible at all. > As of now, we do first check if it is safe to generate a parallel plan for underlying select (in Insert into Select ..) and then perform parallel-safety checks for the DML. We can postpone it further as suggested above in (a). > I will try to take a look at the patches themselves to see if there's > something I know that will help. > Thank you. It will be really helpful if you can do that. [1] - https://www.postgresql.org/message-id/b54f2e306780449093c38cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local -- With Regards, Amit Kapila.
RE: Determine parallel-safety of partition relations for Inserts
From: Amit Kapila > I think it would be good if the parallelism works by default when > required but I guess if we want to use something on these lines then > we can always check if the parallel_workers option is non-zero for a > relation (with RelationGetParallelWorkers). So users can always say > Alter Table Set (parallel_workers = 0) if they don't want > to enable write parallelism for tbl and if someone is bothered that > this might impact Selects as well because the same option is used to > compute the number of workers for it then we can invent a second > option parallel_dml_workers or something like that. Yes, if we have to require some specification to enable parallel DML, I agree that parallel query and parall DML can be separately enabled. With that said, I'm not sure if the user, and PG developers, want to allow specifying degree of parallelism for DML. > > As an aside, (1) and (2) has a potential problem with memory consumption. > > > > I can see the memory consumption argument for (2) because we might end > up generating parallel paths (partial paths) for reading the table but > don't see how it applies to (1)? I assumed that we would still open all partitions for parallel safety check in (1) and (2). In (1), parallel safety check is done only when parallel DML is explicitly enabled by the user. Just opening partitions keeps CacheMemoryContext bloated even after they are closed. Regards Takayuki Tsunakawa
Re: Determine parallel-safety of partition relations for Inserts
On Sat, Jan 16, 2021 at 2:02 PM Amit Kapila wrote: > On Fri, Jan 15, 2021 at 6:45 PM Amit Langote wrote: > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila wrote: > > > We want to do this for Inserts where only Select can be parallel and > > > Inserts will always be done by the leader backend. This is actually > > > the case we first want to implement. > > > > Sorry, I haven't looked at the linked threads and the latest patches > > there closely enough yet, so I may be misreading this, but if the > > inserts will always be done by the leader backend as you say, then why > > does the planner need to be checking the parallel safety of the > > *target* table's expressions? > > > > The reason is that once we enter parallel-mode we can't allow > parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We > enter the parallel-mode at the beginning of the statement execution, > see ExecutePlan(). So, the Insert will be performed in parallel-mode > even though it happens in the leader backend. It is not possible that > we finish getting all the tuples from the gather node first and then > start inserting. Even, if we somehow find something to make this work > anyway the checks being discussed will be required to make inserts > parallel (where inserts will be performed by workers) which is > actually the next patch in the thread I mentioned in the previous > email. > > Does this answer your question? Yes, thanks for the explanation. I kind of figured that doing the insert part itself in parallel using workers would be a part of the end goal of this work, although that didn't come across immediately. It's a bit unfortunate that the parallel safety check of the individual partitions cannot be deferred until it's known that a given partition will be affected by the command at all. Will we need fundamental changes to how parallel query works to make that possible? If so, have such options been considered in these projects? If such changes are not possible in the short term, like for v14, we should at least try to make sure that the eager checking of all partitions is only performed if using parallelism is possible at all. I will try to take a look at the patches themselves to see if there's something I know that will help. -- Amit Langote EDB: http://www.enterprisedb.com
Re: Determine parallel-safety of partition relations for Inserts
On Fri, Jan 15, 2021 at 7:35 PM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > This will surely increase planning time but the execution is reduced > > to an extent due to parallelism that it won't matter for either of the > > cases if we see just total time. For example, see the latest results > > for parallel inserts posted by Haiying Tang [3]. There might be an > > impact when Selects can't be parallelized due to the small size of the > > Select-table but we still have to traverse all the partitions to > > determine parallel-safety but not sure how much it is compared to > > overall time. I guess we need to find the same but apart from that can > > anyone think of a better way to determine parallel-safety of > > partitioned relation for Inserts? > > Three solutions(?) quickly come to my mind: > > > (1) Have the user specify whether they want to parallelize DML > Oracle [1] and SQL Server [2] take this approach. Oracle disables parallel > DML execution by default. The reason is described as "This mode is required > because parallel DML and serial DML have different locking, transaction, and > disk space requirements and parallel DML is disabled for a session by > default." To enable parallel DML in a session or in a specific statement, > you need to run either of the following: > > ALTER SESSION ENABLE PARALLEL DML; > INSERT /*+ ENABLE_PARALLEL_DML */ … > > Besides, the user has to specify a parallel hint in a DML statement, or > specify the parallel attribute in CREATE or ALTER TABLE. > > SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT > statement like this: > > INSERT INTO Sales.SalesHistory WITH (TABLOCK) (target columns...) SELECT > ...; > I think it would be good if the parallelism works by default when required but I guess if we want to use something on these lines then we can always check if the parallel_workers option is non-zero for a relation (with RelationGetParallelWorkers). So users can always say Alter Table Set (parallel_workers = 0) if they don't want to enable write parallelism for tbl and if someone is bothered that this might impact Selects as well because the same option is used to compute the number of workers for it then we can invent a second option parallel_dml_workers or something like that. > > (2) Postpone the parallel safety check after the planner finds a worthwhile > parallel query plan > I'm not sure if the current planner code allows this easily... > I think it is possible but it has a bit of disadvantage as well as mentioned in response to Ashutosh's email [1]. > > (3) Record the parallel safety in system catalog > Add a column like relparallel in pg_class that indicates the parallel safety > of the relation. planner just checks the value instead of doing heavy work > for every SQL statement. That column's value is modified whenever a relation > alteration is made that affects the parallel safety, such as adding a domain > column and CHECK constraint. In case of a partitioned relation, the parallel > safety attributes of all its descendant relations are merged. For example, > if a partition becomes parallel-unsafe, the ascendant partitioned tables also > become parallel-unsafe. > > But... developing such code would be burdonsome and bug-prone? > > > I'm inclined to propose (1). Parallel DML would be something that a limited > people run in limited circumstances (data loading in data warehouse and batch > processing in OLTP systems by the DBA or data administrator), so I think it's > legitimate to require explicit specification of parallelism. > > As an aside, (1) and (2) has a potential problem with memory consumption. > I can see the memory consumption argument for (2) because we might end up generating parallel paths (partial paths) for reading the table but don't see how it applies to (1)? [1] - https://www.postgresql.org/message-id/CAA4eK1J80Rzn4M-A5sfkmJ8NjgTxbaC8UWVaNHK6%2B2BCYYv2Nw%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Determine parallel-safety of partition relations for Inserts
On Fri, Jan 15, 2021 at 6:45 PM Amit Langote wrote: > > On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila wrote: > > We want to do this for Inserts where only Select can be parallel and > > Inserts will always be done by the leader backend. This is actually > > the case we first want to implement. > > Sorry, I haven't looked at the linked threads and the latest patches > there closely enough yet, so I may be misreading this, but if the > inserts will always be done by the leader backend as you say, then why > does the planner need to be checking the parallel safety of the > *target* table's expressions? > The reason is that once we enter parallel-mode we can't allow parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We enter the parallel-mode at the beginning of the statement execution, see ExecutePlan(). So, the Insert will be performed in parallel-mode even though it happens in the leader backend. It is not possible that we finish getting all the tuples from the gather node first and then start inserting. Even, if we somehow find something to make this work anyway the checks being discussed will be required to make inserts parallel (where inserts will be performed by workers) which is actually the next patch in the thread I mentioned in the previous email. Does this answer your question? -- With Regards, Amit Kapila.
RE: Determine parallel-safety of partition relations for Inserts
From: Amit Langote > Sorry, I haven't looked at the linked threads and the latest patches > there closely enough yet, so I may be misreading this, but if the > inserts will always be done by the leader backend as you say, then why > does the planner need to be checking the parallel safety of the > *target* table's expressions? Yeah, I also wanted to confirm this next - that is, whether the current patch allows the SELECT operation to execute in parallel but the INSERT operation serially. Oracle allows it; it even allows the user to specify a hint after INSERT and SELECT separately. Even if INSERT in INSERT SELECT can't be run in parallel, it is useful for producing summary data, such as aggregating large amounts of IoT sensor data in parallel and inserting the small amount of summary data serially. Regards Takayuki Tsunakawa
RE: Determine parallel-safety of partition relations for Inserts
From: Amit Kapila > This will surely increase planning time but the execution is reduced > to an extent due to parallelism that it won't matter for either of the > cases if we see just total time. For example, see the latest results > for parallel inserts posted by Haiying Tang [3]. There might be an > impact when Selects can't be parallelized due to the small size of the > Select-table but we still have to traverse all the partitions to > determine parallel-safety but not sure how much it is compared to > overall time. I guess we need to find the same but apart from that can > anyone think of a better way to determine parallel-safety of > partitioned relation for Inserts? Three solutions(?) quickly come to my mind: (1) Have the user specify whether they want to parallelize DML Oracle [1] and SQL Server [2] take this approach. Oracle disables parallel DML execution by default. The reason is described as "This mode is required because parallel DML and serial DML have different locking, transaction, and disk space requirements and parallel DML is disabled for a session by default." To enable parallel DML in a session or in a specific statement, you need to run either of the following: ALTER SESSION ENABLE PARALLEL DML; INSERT /*+ ENABLE_PARALLEL_DML */ … Besides, the user has to specify a parallel hint in a DML statement, or specify the parallel attribute in CREATE or ALTER TABLE. SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT statement like this: INSERT INTO Sales.SalesHistory WITH (TABLOCK) (target columns...) SELECT ...; (2) Postpone the parallel safety check after the planner finds a worthwhile parallel query plan I'm not sure if the current planner code allows this easily... (3) Record the parallel safety in system catalog Add a column like relparallel in pg_class that indicates the parallel safety of the relation. planner just checks the value instead of doing heavy work for every SQL statement. That column's value is modified whenever a relation alteration is made that affects the parallel safety, such as adding a domain column and CHECK constraint. In case of a partitioned relation, the parallel safety attributes of all its descendant relations are merged. For example, if a partition becomes parallel-unsafe, the ascendant partitioned tables also become parallel-unsafe. But... developing such code would be burdonsome and bug-prone? I'm inclined to propose (1). Parallel DML would be something that a limited people run in limited circumstances (data loading in data warehouse and batch processing in OLTP systems by the DBA or data administrator), so I think it's legitimate to require explicit specification of parallelism. As an aside, (1) and (2) has a potential problem with memory consumption. Opening relations bloat CacheMemoryContext with relcaches and catcaches, and closing relations does not free the (all) memory. But I don't think it could really become a problem in practice, because parallel DML would be run in limited number of concurrent sessions. [1] Types of Parallelism https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D8290A02-BE5F-436A-B814-D6FD71CEE81F [2] INSERT (Transact-SQL) https://docs.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver15#best-practices Regards Takayuki Tsunakawa
Re: Determine parallel-safety of partition relations for Inserts
On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila wrote: > We want to do this for Inserts where only Select can be parallel and > Inserts will always be done by the leader backend. This is actually > the case we first want to implement. Sorry, I haven't looked at the linked threads and the latest patches there closely enough yet, so I may be misreading this, but if the inserts will always be done by the leader backend as you say, then why does the planner need to be checking the parallel safety of the *target* table's expressions? -- Amit Langote EDB: http://www.enterprisedb.com
Re: Determine parallel-safety of partition relations for Inserts
On Fri, Jan 15, 2021 at 5:53 PM Ashutosh Bapat wrote: > > On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila wrote: > > > > While reviewing parallel insert [1] (Insert into Select) and > > parallel copy patches [2], it came to my notice that both the patches > > traverse the entire partition hierarchy to determine parallel-safety > > of partitioned relations. This is required because before considering > > the Insert or Copy can be considered for parallelism, we need to > > determine whether it is safe to do so. We need to check for each > > partition because any of the partitions can have some parallel-unsafe > > index expression, constraint, etc. We do a similar thing for Selects > > in standard_planner. > > > > The plain Select case for partitioned tables was simpler because we > > anyway loop through all the partitions in set_append_rel_size() and we > > determine parallel-safety of each partition at that time but the same > > is not true for Inserts. > > > > For Inserts, currently, we only open the partition table when we are > > about to insert into that partition. During ExecInsert, we find out > > the partition matching the partition-key value and then lock if it is > > not already locked. In this patch, we need to open each partition at > > the planning time to determine its parallel-safety. > > We don't want to open the partitions where no rows will be inserted. > > > > > This will surely increase planning time but the execution is reduced > > to an extent due to parallelism that it won't matter for either of the > > cases if we see just total time. For example, see the latest results > > for parallel inserts posted by Haiying Tang [3]. There might be an > > impact when Selects can't be parallelized due to the small size of the > > Select-table but we still have to traverse all the partitions to > > determine parallel-safety but not sure how much it is compared to > > overall time. I guess we need to find the same but apart from that can > > anyone think of a better way to determine parallel-safety of > > partitioned relation for Inserts? > > In case of SELECT we open only those partitions which surive pruning. > So those are the ones which will definitely required to be scanned. We > perform parallelism checks only on those partitions. The actual check > isn't much costly. > > > > > Thoughts? > > > > Note: I have kept a few people in Cc who are either directly involved > > in this work or work regularly in the partitioning related work just > > in the hope that might help in moving the discussion forward. > > Since you brought up comparison between SELECT and INSERT, "pruning" > partitions based on the values being INSERTed might help. It should be > doable in case of INSERT ... SELECT where we need to prune partitions > based on the clauses of SELECT. Doable with some little effort in case > of VALUEs and COPY. > I don't think we should try pruning in this patch as that is a separate topic and even after pruning the same problem can happen when we are not able to prune partitions in the table where we want to Insert. > Second possibility is to open partitions only when the estimated > number of rows to be inserted goes beyond a certain value. > Yeah, this option has merits in the sense that we will pay the cost to traverse partitions only when the parallel plan is possible. If you see the 0001 patch in email [1], it tries to determine parallel-safety (which in turn will traverse all partition tables) in standard_planner where we determine the parallel-safety for the Query tree. Now, if we have to postpone it for partitioned tables then we need to determine it at the places where we create_modifytable_path if the current_rel->pathlist has some parallel_safe paths. And which will also mean that we need to postpone generating gather node as well till that time because Insert can be parallel-unsafe. This will have one disadvantage over the current approach implemented by the patch which is that we might end up spending a lot of time in the optimizer to create partial paths and later (while determining parallel-safety of Insert) find that none of them is required. If we decide to postpone the parallel-safety checking for Inserts then why not we check parallel-safety for all sorts of Inserts at that point. That can at least simplify the patch. > Third idea is to use something similar to parallel append where > individual partitions are scanned sequentially but multiple partitions > are scanned in parallel. When a row is inserted into a non-yet-opened > partition, allocate one/more backends to insert into partitions which > do not allow parallelism, otherwise continue to use a common pool of > parallel workers for insertion. This means the same thread performing > select may not perform insert. So some complications will be involved. > We want to do this for Inserts where only Select can be parallel and Inserts will always be done by the leader backend. This is actually the case we first want t
Re: Determine parallel-safety of partition relations for Inserts
On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila wrote: > > While reviewing parallel insert [1] (Insert into Select) and > parallel copy patches [2], it came to my notice that both the patches > traverse the entire partition hierarchy to determine parallel-safety > of partitioned relations. This is required because before considering > the Insert or Copy can be considered for parallelism, we need to > determine whether it is safe to do so. We need to check for each > partition because any of the partitions can have some parallel-unsafe > index expression, constraint, etc. We do a similar thing for Selects > in standard_planner. > > The plain Select case for partitioned tables was simpler because we > anyway loop through all the partitions in set_append_rel_size() and we > determine parallel-safety of each partition at that time but the same > is not true for Inserts. > > For Inserts, currently, we only open the partition table when we are > about to insert into that partition. During ExecInsert, we find out > the partition matching the partition-key value and then lock if it is > not already locked. In this patch, we need to open each partition at > the planning time to determine its parallel-safety. We don't want to open the partitions where no rows will be inserted. > > This will surely increase planning time but the execution is reduced > to an extent due to parallelism that it won't matter for either of the > cases if we see just total time. For example, see the latest results > for parallel inserts posted by Haiying Tang [3]. There might be an > impact when Selects can't be parallelized due to the small size of the > Select-table but we still have to traverse all the partitions to > determine parallel-safety but not sure how much it is compared to > overall time. I guess we need to find the same but apart from that can > anyone think of a better way to determine parallel-safety of > partitioned relation for Inserts? In case of SELECT we open only those partitions which surive pruning. So those are the ones which will definitely required to be scanned. We perform parallelism checks only on those partitions. The actual check isn't much costly. > > Thoughts? > > Note: I have kept a few people in Cc who are either directly involved > in this work or work regularly in the partitioning related work just > in the hope that might help in moving the discussion forward. Since you brought up comparison between SELECT and INSERT, "pruning" partitions based on the values being INSERTed might help. It should be doable in case of INSERT ... SELECT where we need to prune partitions based on the clauses of SELECT. Doable with some little effort in case of VALUEs and COPY. Second possibility is to open partitions only when the estimated number of rows to be inserted goes beyond a certain value. Third idea is to use something similar to parallel append where individual partitions are scanned sequentially but multiple partitions are scanned in parallel. When a row is inserted into a non-yet-opened partition, allocate one/more backends to insert into partitions which do not allow parallelism, otherwise continue to use a common pool of parallel workers for insertion. This means the same thread performing select may not perform insert. So some complications will be involved. -- Best Wishes, Ashutosh Bapat