On Wed, 11 Nov 2020, tsunakawa.ta...@fujitsu.com wrote:

This email was sent to you by someone outside of the University.
You should only click on links or attachments if you are certain that the email 
is genuine and the content is safe.

From: Tomas Vondra <tomas.von...@enterprisedb.com>
I see the patch builds the "bulk" query in execute_foreign_modify. IMO
that's something we should do earlier, when we're building the simple
query (for 1-row inserts). I'd understand if you were concerned about
overhead in case of 1-row inserts, trying to not plan the bulk query
until necessary, but I'm not sure this actually helps.

Or was the goal to build a query for every possible number of slots? I
don't think that's really useful, considering it requires deallocating
the old plan, preparing a new one, etc. IMO it should be sufficient to
have two queries - one for 1-row inserts, one for the full batch. The
last incomplete batch can be inserted using a loop of 1-row queries.

That's what my patch was doing, but I'm not insisting on that - it just
seems like a better approach to me. So feel free to argue why this is
better.

Don't be concerned, the processing is not changed for 1-row inserts: the INSERT 
query string is built in PlanForeignModify(), and the remote statement is 
prepared in execute_foreign_modify() during the first call to 
ExecForeignInsert() and it's reused for subsequent ExecForeignInsert() calls.

The re-creation of INSERT query string and its corresponding PREPARE happen 
when the number of tuples to be inserted is different from the previous call to 
ExecForeignInsert()/ExecForeignBulkInsert().  That's because we don't know how 
many tuples will be inserted during planning (PlanForeignModify) or execution 
(until the scan ends for SELECT).  For example, if we insert 10,030 rows with 
the bulk size 100, the flow is:

 PlanForeignModify():
   build the INSERT query string for 1 row
 ExecForeignBulkInsert(100):
   drop the INSERT query string and prepared statement for 1 row
   build the query string and prepare statement for 100 row INSERT
   execute it
 ExecForeignBulkInsert(100):
   reuse the prepared statement for 100 row INSERT and execute it
...
 ExecForeignBulkInsert(30):
   drop the INSERT query string and prepared statement for 100 row
   build the query string and prepare statement for 30 row INSERT
   execute it


I think it'd be good to have such GUC, even if only for testing and
development. We should probably have a way to disable the batching,
which the GUC could also do, I think. So +1 to have the GUC.

OK, I'll add it.  The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local 
relations in the future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, 
while "row" or "record" is not used in GUC (except for row_security).

The valid range would be between 1 and 1,000 (I said 10,000 previously, but I 
think it's overreaction and am a bit worried about unforseen trouble too many 
tuples might cause.)  1 disables the bulk processing and uses the traditonal 
ExecForeignInsert().  The default value is 100 (would 1 be sensible as a 
default value to avoid surprising users by increased memory usage?)


2. Should we accumulate records per relation in ResultRelInfo
instead? That is, when inserting into a partitioned table that has
foreign partitions, delay insertion until a certain number of input
records accumulate, and then insert accumulated records per relation
(e.g., 50 records to relation A, 30 records to relation B, and 20
records to relation C.)  If we do that,


I think there's a chunk of text missing here? If we do that, then what?

Sorry, the two bullets below there are what follows.  Perhaps I should have written ":" 
instead of ",".


Anyway, I don't see why accumulating the records in ResultRelInfo would
be better than what the patch does now. It seems to me like fairly
specific to FDWs, so keeping it int FDW state seems appropriate. What
would be the advantage of stashing it in ResultRelInfo?

I thought of distributing input records to their corresponding partitions' 
ResultRelInfos.  For example, input record for partition 1 comes, store it in 
the ResultRelInfo for partition 1, then input record for partition 2 comes, 
store it in the ResultRelInfo for partition 2.  When a ResultRelInfo 
accumulates some number of rows, insert the accumulated rows therein into the 
partition.  When the input endds, perform bulk inserts for ResultRelInfos that 
have accumulated rows.



I think that's OK for most use cases, and if it's not (e.g. when there's
something requiring the exact order of writes) then it's not possible to
use batching. That's one of the reasons why I think we should have a GUC
to disable the batching.

Agreed.


* Should the maximum count of accumulated records be applied per
relation or the query? When many foreign partitions belong to a
partitioned table, if the former is chosen, it may use much memory in
total.  If the latter is chosen, the records per relation could be
few and thus the benefit of bulk insert could be small.


I think it needs to be applied per relation, because that's the level at
which we can do it easily and consistently. The whole point is to send
data in sufficiently large chunks to minimize the communication overhead
(latency etc.), but if you enforce it "per query" that seems hard.

Imagine you're inserting data into a table with many partitions - how do
you pick the number of rows to accumulate? The table may have 10 or 1000
partitions, we may be inserting into all partitions or just a small
subset, not all partitions may be foreign, etc. It seems pretty
difficult to pick and enforce a reliable limit at the query level. But
maybe I'm missing something and it's easier than I think?

Of course, you're entirely correct enforcing this at the partition level
may require a lot of memory. Sadly, I don't see a way around that,
except for (a) disabling batching or (b) ordering the data to insert
data into one partition at a time.

OK, I think I'll try doing like that, after waiting for other opinions some 
days.


Two more comments regarding this:

1) If we want to be more strict about the memory consumption, we should
probably set the limit in terms of memory, not number of rows. Currently
the 100 rows may be 10kB or 10MB, there's no way to know. Of course,
this is not the only place with this issue.

2) I wonder what the COPY FROM patch [1] does in this regard. I don't
have time to check right now, but I suggest we try to do the same thing,
if only to be consistent.

[1]
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909
86e55489f%40postgrespro.ru

That COPY FROM patch uses the tuple accumulation mechanism for local tables 
as-is.  That is, it accumulates at most 1,000 tuples per partition.

/*
* No more than this many tuples per CopyMultiInsertBuffer
*
* Caution: Don't make this too big, as we could end up with this many
* CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
* multiInsertBuffers list.  Increasing this can cause quadratic growth in
* memory requirements during copies into partitioned tables with a large
* number of partitions.
*/
#define MAX_BUFFERED_TUPLES     1000


Regards
Takayuki Tsunakawa



Does this patch affect trigger semantics on the base table?

At the moment when I insert 1000 rows into a postgres_fdw table using a
single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I
naively expect a "statement level" trigger on the base table to trigger
once. But this is not the case. The postgres_fdw implements this
operation as 1000 separate insert statements on the base table, so the
trigger happens 1000 times instead of once. Hence there is no
distinction between using a statement level and a row level trigger on
the base table in this context.

So would this patch change the behaviour so only 10 separate insert
statements (each of 100 rows) would be made against the base table?
If so thats useful as it means improving performance using statement
level triggers becomes possible. But it would also result in more
obscure semantics and might break user processes dependent on the
existing behaviour after the patch is applied.

BTW is this subtlety documented, I haven't found anything but happy
to be proved wrong?

Tim

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Reply via email to