RE: Parallel INSERT (INTO ... SELECT ...)

tsunakawa.ta...@fujitsu.com Thu, 18 Feb 2021 20:43:25 -0800

From: Greg Nancarrow <gregn4...@gmail.com> 
--------------------------------------------------
On Mon, Jan 25, 2021 at 10:23 AM tsunakawa.ta...@fujitsu.com 
<tsunakawa.ta...@fujitsu.com> wrote:
> (8)
> +               /*
> +                * If the trigger type is RI_TRIGGER_FK, this indicates a FK 
> exists in
> +                * the relation, and this would result in creation of new 
> CommandIds
> +                * on insert/update/delete and this isn't supported in a 
> parallel
> +                * worker (but is safe in the parallel leader).
> +                */
> +               trigtype = RI_FKey_trigger_type(trigger->tgfoid);
> +               if (trigtype == RI_TRIGGER_FK)
> +               {
> +                       if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, 
> context))
> +                               return true;
> +               }
>
> Here, RI_TRIGGER_FK should instead be RI_TRIGGER_PK, because RI_TRIGGER_FK 
> triggers do not generate command IDs.  See RI_FKey_check() which is called in 
> RI_TRIGGER_FK case.  In there, ri_PerformCheck() is called with the 
> detectNewRows argument set to false, which causes CommandCounterIncrement() 
> to not be called.
>


Hmmm, I'm not sure that you have read and interpreted the patch code correctly.
The existence of a RI_TRIGGER_FK trigger indicates the table has a foreign key, 
and an insert into such a table will generate a new commandId (so we must avoid 
that, as we don't currently have the technology to support sharing of new 
command IDs across the participants in the parallel operation). This is what 
the code comment says, It does not say that such a trigger generates a new 
command ID.

See Amit's updated comment here: 
https://github.com/postgres/postgres/commit/0d32511eca5aec205cb6b609638ea67129ef6665

In addition, the 2nd patch has an explicit test case for this (testing insert 
into a table that has a FK).
--------------------------------------------------


First of all, I anticipate this parallel INSERT SELECT feature will typically 
shine, and expected to work, in the ETL or ELT into a data warehouse or an ODS 
for analytics.  Bearing that in mind, let me list some issues or questions 
below.  But the current state of the patch would be of course attractive in 
some workloads, so I don't think these are not necessarily blockers.


(1)
According to the classic book "The Data Warehouse Toolkit" and the website [1] 
by its author, the fact table (large transaction history) in the data warehouse 
has foreign keys referencing to the dimension tables (small or medium-sized 
master or reference data).  So, parallel insert will be effective if it works 
when loading data into the fact table with foreign keys.

To answer the above question, I'm assuming:

CREATE TABLE some_dimension (key_col int PRIMARY KEY);
CREATE TABLE some_fact (some_key int REFERENCES some_dimension);
INSERT INTO some_fact SELECT ...;


My naive question is, "why should new command IDs be generated to check foreign 
key constraints in this INSERT case?  The check just reads the parent 
(some_dimension table here)..."

Looking a bit deeper into the code, although ri_PerformCheck() itself tries to 
avoid generating command IDs, it calls _SPI_execute_snapshot() with the 
read_only argument always set to false.  It in turn calls _SPI_execute_plan() 
-> CommandCounterIncrement() as follows:

[_SPI_execute_plan()]
            /*
             * If not read-only mode, advance the command counter before each
             * command and update the snapshot.
             */
            if (!read_only && !plan->no_snapshots)
            {
                CommandCounterIncrement();
                UpdateActiveSnapshotCommandId();
            }


Can't we pass true to read_only from ri_PerformCheck() in some cases?


(2)
Likewise, dimension tables have surrogate keys that are typically implemented 
as a sequence or an identity column.  It is suggested that even fact tables 
sometimes (or often?) have surrogate keys.  But the current patch does not 
parallelize the statement when the target table has a sequence or an identity 
column.

I was looking at the sequence code, and my naive (again) idea is that the 
parallel leader and workers allocates numbers from the sequence independently, 
and sets the largest number of them as the session's currval at the end of 
parallel operation.  We have to note in the documentation that gaps in the 
sequence numbers will arise and not used in parallel DML.


(3)
As Hou-san demonstrated, the current patch causes the resulting table and index 
to become larger when inserted in parallel than in inserted serially.  This 
could be a problem for analytics use cases where the table is just inserted and 
read only afterwards.  We could advise the user to run REINDEX CONCURRENTLY 
after loading data, but what about tables?

BTW, I don't know if Oracle and SQL Server have similar issues.  They may have 
some reason about this why they take an exclusive lock on the target table.


(4)
When the target table is partitioned, is the INSERT parallelized among its 
partitions?  Some plan like:

Parallel Append on parent_table
  -> Insert on partiton1
  -> Insert on partiton2



[1]
Fact Table Surrogate Key | Kimball Dimensional Modeling Techniques
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/fact-surrogate-key/


Regards
Takayuki Tsunakawa

RE: Parallel INSERT (INTO ... SELECT ...)

Reply via email to