Re: Parallel copy
On Mon, Dec 28, 2020 at 3:14 PM vignesh C wrote: > > Attached is a patch that was used for the same. The patch is written > on top of the parallel copy patch. > The design Amit, Andres & myself voted for that is the leader > identifying the line bound design and sharing it in shared memory is > performing better. Hi Hackers, I see following are some of the problem with parallel copy feature: 1) Leader identifying the line/tuple boundaries from the file, letting the workers pick, insert parallelly vs leader reading the file and letting workers identify line/tuple boundaries, insert 2) Determining parallel safety of partitioned tables 3) Bulk extension of relation while inserting i.e. adding more than one extra blocks to the relation in RelationAddExtraBlocks Please let me know if I'm missing anything. For (1) - from Vignesh's experiments above, it shows that the " leader identifying the line/tuple boundaries from the file, letting the workers pick, insert parallelly" fares better. For (2) - while it's being discussed in another thread (I'm not sure what's the status of that thread), how about we take this feature without the support for partitioned tables i.e. parallel copy is disabled for partitioned tables? Once the other discussion gets to a logical end, we can come back and enable parallel copy for partitioned tables. For (3) - we need a way to extend or add new blocks fastly - fallocate might help here, not sure who's working on it, others can comment better here. Can we take the "parallel copy" feature forward of course with some restrictions in place? Thoughts? Regards, Bharath Rupireddy.
Re: Parallel copy
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila wrote: > > On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas wrote: > > > > On 02/11/2020 08:14, Amit Kapila wrote: > > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas > > > wrote: > > >> > > >> In this design, you don't need to keep line boundaries in shared memory, > > >> because each worker process is responsible for finding the line > > >> boundaries of its own block. > > >> > > >> There's a point of serialization here, in that the next block cannot be > > >> processed, until the worker working on the previous block has finished > > >> scanning the EOLs, and set the starting position on the next block, > > >> putting it in READY state. That's not very different from your patch, > > >> where you had a similar point of serialization because the leader > > >> scanned the EOLs, > > > > > > But in the design (single producer multiple consumer) used by the > > > patch the worker doesn't need to wait till the complete block is > > > processed, it can start processing the lines already found. This will > > > also allow workers to start much earlier to process the data as it > > > doesn't need to wait for all the offsets corresponding to 64K block > > > ready. However, in the design where each worker is processing the 64K > > > block, it can lead to much longer waits. I think this will impact the > > > Copy STDIN case more where in most cases (200-300 bytes tuples) we > > > receive line-by-line from client and find the line-endings by leader. > > > If the leader doesn't find the line-endings the workers need to wait > > > till the leader fill the entire 64K chunk, OTOH, with current approach > > > the worker can start as soon as leader is able to populate some > > > minimum number of line-endings > > > > You can use a smaller block size. > > > > Sure, but the same problem can happen if the last line in that block > is too long and we need to peek into the next block. And then there > could be cases where a single line could be greater than 64K. > > > However, the point of parallel copy is > > to maximize bandwidth. > > > > Okay, but this first-phase (finding the line boundaries) can anyway be > not done in parallel and we have seen in some of the initial > benchmarking that this initial phase is a small part of work > especially when the table has indexes, constraints, etc. So, I think > it won't matter much if this splitting is done in a single process or > multiple processes. > I wrote a patch to compare the performance of the current implementation leader identifying the line bound design vs the workers identifying the line boundary. The results of the same is given below: The below data can be read as parallel copy time taken in seconds based on the leader identifying the line boundary design, parallel copy time taken in seconds based on the workers identifying the line boundary design, workers. Use case 1 - 10million rows, 5.2GB data,3 indexes on integer columns: (211.206, 632.583, 1), (165.402, 360.152, 2), (137.608, 219.623, 4), (128.003, 206.851, 8), (114.518, 177.790, 16), (109.257, 170.058, 20), (102.050, 158.376, 30) Use case 2 - 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, csv file: (1212.356, 1602.118, 1), (707.191, 849.105, 2), (369.620, 441.068, 4), (221.359, 252.775, 8), (167.152, 180.207, 16), (168.804, 181.986, 20), (172.320, 194.875, 30) Use case 3 - 10million rows, 5.2GB data without index: (96.317, 437.453, 1), (70.730, 240.517, 2), (64.436, 197.604, 4), (67.186, 175.630, 8), (76.561, 156.015, 16), (81.025, 150.687, 20), (86.578, 148.481, 30) Use case 4 - 1 records, 9.6GB, toast data: (147.076, 276.323, 1), (101.610, 141.893, 2), (100.703, 134.096, 4), (112.583, 134.765, 8), (101.898, 135.789, 16), (109.258, 135.625, 20), (109.219, 136.144, 30) Attached is a patch that was used for the same. The patch is written on top of the parallel copy patch. The design Amit, Andres & myself voted for that is the leader identifying the line bound design and sharing it in shared memory is performing better. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com From dd9b6be19573b5391d01373b53e64a5c1dc305fd Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Mon, 28 Dec 2020 15:00:48 +0530 Subject: [PATCH v12 7/7] Parallel copy based on workers identifying line boundary. Parallel copy based on workers identifying line boundary. --- src/backend/commands/copyfromparse.c | 93 +++ src/backend/commands/copyparallel.c | 441 +-- src/include/commands/copyfrom_internal.h | 38 ++- src/test/regress/expected/copy2.out | 2 +- 4 files changed, 318 insertions(+), 256 deletions(-) diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index a767bae..1d79da9 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -82,13 +82,15 @@ if (1) \ { \ if (raw_buf_ptr > cstate->raw_buf_index) \ { \
Re: Parallel copy
On Wed, Dec 23, 2020 at 3:05 PM Hou, Zhijie wrote: > > Hi > > > Yes this optimization can be done, I will handle this in the next patch > > set. > > > > I have a suggestion for the parallel safety-check. > > As designed, The leader does not participate in the insertion of data. > If User use (PARALLEL 1), there is only one worker process which will do the > insertion. > > IMO, we can skip some of the safety-check in this case, becase the > safety-check is to limit parallel insert. > (except temporary table or ...) > > So, how about checking (PARALLEL 1) separately ? > Although it looks a bit complicated, But (PARALLEL 1) do have a good > performance improvement. > Thanks for the comments Hou Zhijie, I will run a few tests with 1 worker and try to include this in the next patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
RE: Parallel copy
Hi > Yes this optimization can be done, I will handle this in the next patch > set. > I have a suggestion for the parallel safety-check. As designed, The leader does not participate in the insertion of data. If User use (PARALLEL 1), there is only one worker process which will do the insertion. IMO, we can skip some of the safety-check in this case, becase the safety-check is to limit parallel insert. (except temporary table or ...) So, how about checking (PARALLEL 1) separately ? Although it looks a bit complicated, But (PARALLEL 1) do have a good performance improvement. Best regards, houzj
Re: Parallel copy
On Mon, Dec 7, 2020 at 3:00 PM Hou, Zhijie wrote: > > > Attached v11 patch has the fix for this, it also includes the changes to > > rebase on top of head. > > Thanks for the explanation. > > I think there is still chances we can know the size. > > +* line_size will be set. Read the line_size again to be sure > if it is > +* completed or partial block. > +*/ > + dataSize = pg_atomic_read_u32(&lineInfo->line_size); > + if (dataSize != -1) > + { > > If I am not wrong, this seems the branch that procsssing the populated block. > I think we can check the copiedSize here, if copiedSize == 0, that means > Datasizes is the size of the whole line and in this case we can do the > enlarge. > > Yes this optimization can be done, I will handle this in the next patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
RE: Parallel copy
> > 4. > > A suggestion for CacheLineInfo. > > > > It use appendBinaryStringXXX to store the line in memory. > > appendBinaryStringXXX will double the str memory when there is no enough > spaces. > > > > How about call enlargeStringInfo in advance, if we already know the whole > line size? > > It can avoid some memory waste and may impove a little performance. > > > > Here we will not know the size beforehand, in some cases we will start > processing the data when current block is populated and keep processing > block by block, we will come to know of the size at the end. We cannot use > enlargeStringInfo because of this. > > Attached v11 patch has the fix for this, it also includes the changes to > rebase on top of head. Thanks for the explanation. I think there is still chances we can know the size. +* line_size will be set. Read the line_size again to be sure if it is +* completed or partial block. +*/ + dataSize = pg_atomic_read_u32(&lineInfo->line_size); + if (dataSize != -1) + { If I am not wrong, this seems the branch that procsssing the populated block. I think we can check the copiedSize here, if copiedSize == 0, that means Datasizes is the size of the whole line and in this case we can do the enlarge. Best regards, houzj
RE: Parallel copy
Hi Vignesh, I took a look at the v10 patch set. Here are some comments: 1. +/* + * CheckExprParallelSafety + * + * Determine if where cluase and default expressions are parallel safe & do not + * have volatile expressions, return true if condition satisfies else return + * false. + */ 'cluase' seems a typo. 2. + /* +* Make sure that no worker has consumed this element, if this +* line is spread across multiple data blocks, worker would have +* started processing, no need to change the state to +* LINE_LEADER_POPULATING in this case. +*/ + (void) pg_atomic_compare_exchange_u32(&lineInfo->line_state, + ¤t_line_state, + LINE_LEADER_POPULATED); About the commect +* started processing, no need to change the state to +* LINE_LEADER_POPULATING in this case. Does it means no need to change the state to LINE_LEADER_POPULATED ' here? 3. + * 3) only one worker should choose one line for processing, this is handled by + *using pg_atomic_compare_exchange_u32, worker will change the state to + *LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED. In the latest patch, it will set the state to LINE_WORKER_PROCESSING if line_state is LINE_LEADER_POPULATED or LINE_LEADER_POPULATING. So The comment here seems wrong. 4. A suggestion for CacheLineInfo. It use appendBinaryStringXXX to store the line in memory. appendBinaryStringXXX will double the str memory when there is no enough spaces. How about call enlargeStringInfo in advance, if we already know the whole line size? It can avoid some memory waste and may impove a little performance. Best regards, houzj
Re: Parallel copy
On Sat, Nov 7, 2020 at 7:01 PM vignesh C wrote: > > On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie wrote: > > > > Hi > > > > > > > > my $bytes = $ARGV[0]; > > > for(my $i = 0; $i < $bytes; $i+=8){ > > > print "longdata"; > > > } > > > print "\n"; > > > > > > > > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' > > > with (parallel 2); > > > > > > This gets stuck forever (or at least I didn't have the patience to wait > > > it finish). Both worker processes are consuming 100% of CPU. > > > > I had a look over this problem. > > > > the ParallelCopyDataBlock has size limit: > > uint8 skip_bytes; > > chardata[DATA_BLOCK_SIZE]; /* data read from file */ > > > > It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers. > > And the leader process keep waiting free block. > > > > For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED, > > But leader process won't set the line_state unless it read the whole line. > > > > So it stuck forever. > > May be we should reconsider about this situation. > > > > The stack is as follows: > > > > Leader stack: > > #3 0x0075f7a1 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945) at latch.c:411 > > #4 0x005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546 > > #5 0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, raw_buf_ptr=raw_buf_ptr@entry=65536, > > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572 > > #6 0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058 > > #7 0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863 > > > > Worker stack: > > #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474 > > #1 0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711 > > #2 0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885 > > #3 0x005a4f2e in NextCopyFromRawFields > > (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615 > > #4 0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at copy.c:3696 > > #5 0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985 > > > > Thanks for providing your thoughts. I have analyzed this issue and I'm > working on the fix for this, I will be posting a patch for this > shortly. > I have fixed and provided a patch for this at [1] [1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Sat, Oct 31, 2020 at 2:07 AM Tomas Vondra wrote: > > Hi, > > I've done a bit more testing today, and I think the parsing is busted in > some way. Consider this: > > test=# create extension random; > CREATE EXTENSION > > test=# create table t (a text); > CREATE TABLE > > test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,1); > INSERT 0 1 > > test=# copy t to '/mnt/data/t.csv'; > COPY 1 > > test=# truncate t; > TRUNCATE TABLE > > test=# copy t from '/mnt/data/t.csv'; > COPY 1 > > test=# truncate t; > TRUNCATE TABLE > > test=# copy t from '/mnt/data/t.csv' with (parallel 2); > ERROR: invalid byte sequence for encoding "UTF8": 0x00 > CONTEXT: COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB >F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..." > parallel worker > > > The functions come from an extension I use to generate random data, I've > pushed it to github [1]. The random_string() generates a random string > with ASCII characters, symbols and a couple special characters (\r\n\t). > The intent was to try loading data where a fields may span multiple 64kB > blocks and may contain newlines etc. > > The non-parallel copy works fine, the parallel one fails. I haven't > investigated the details, but I guess it gets confused about where a > string starts/end, or something like that. > Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1] [1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Nov 13, 2020 at 2:25 PM Amit Kapila wrote: > > On Wed, Nov 11, 2020 at 10:42 PM vignesh C wrote: > > > > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila wrote: > > > > > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C wrote: > > > > > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila > > > > wrote: > > > > > > > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > > > checks if parallely copy can be performed, Parallel copy cannot be > > > > performed for the following a) If relation is temporary table b) If > > > > relation is foreign table c) If relation has non parallel safe index > > > > expressions d) If relation has triggers present whose type is of non > > > > before statement trigger type e) If relation has check constraint > > > > which are not parallel safe f) If relation has partition and any > > > > partition has the above type. This patch has the checks for it. This > > > > patch will be used by parallel copy implementation. > > > > > > > > > > How did you ensure that this is sufficient? For parallel-insert's > > > patch we have enabled parallel-mode for Inserts and ran the tests with > > > force_parallel_mode to see if we are not missing anything. Also, it > > > seems there are many common things here w.r.t parallel-insert patch, > > > is it possible to prepare this atop that patch or do you have any > > > reason to keep this separate? > > > > > > > I have done similar testing for copy too, I had set force_parallel > > mode to regress, hardcoded in the code to pick parallel workers for > > copy operation and ran make installcheck-world to verify. Many checks > > in this patch are common between both patches, but I was not sure how > > to handle it as both the projects are in-progress and are being > > updated based on the reviewer's opinion. How to handle this? > > Thoughts? > > > > I have not studied the differences in detail but if it is possible to > prepare it on top of that patch then there shouldn't be a problem. To > avoid confusion if you want you can always either post the latest > version of that patch with your patch or point to it. > I have made this as a separate patch as of now. I will work on to see if I can use Greg's changes as it is or if required I will provide a few review comments on top of Greg's patch so that it is usable for parallel copy too and later post a separate patch with the changes on top of it. I will retain it as a separate patch till that time. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Oct 29, 2020 at 2:26 PM Daniel Westermann (DWE) wrote: > > On 27/10/2020 15:36, vignesh C wrote: > >> Attached v9 patches have the fixes for the above comments. > > >I did some testing: > > I did some testing as well and have a cosmetic remark: > > postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10); > ERROR: value 10 out of bounds for option "parallel" > DETAIL: Valid values are between "1" and "1024". > postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000); > ERROR: parallel requires an integer value > postgres=# > > Wouldn't it make more sense to only have one error message? The first one > seems to be the better message. > I had seen similar behavior in other places too: postgres=# vacuum (parallel 10) t1; ERROR: parallel vacuum degree must be between 0 and 1024 LINE 1: vacuum (parallel 10) t1; ^ postgres=# vacuum (parallel 1000) t1; ERROR: parallel requires an integer value I'm not sure if we should fix this. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Wed, Oct 28, 2020 at 5:36 PM Hou, Zhijie wrote: > > Hi > > I found some issue in v9-0002 > > 1. > + > + elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d", > +write_pos, lineInfo->first_block, > + pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts), > +offset, pg_atomic_read_u32(&lineInfo->line_size)); > + > > write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg. > Modified it. > 2. > +* line_size will be set. Read the line_size again to be sure if it is > +* completed or partial block. > +*/ > + dataSize = pg_atomic_read_u32(&lineInfo->line_size); > + if (dataSize) > > It use dataSize( type int ) to get uint32 which seems a little dangerous. > Is it better to define dataSize uint32 here? > Modified it. > 3. > Since function with 'Cstate' in name has been changed to 'CState' > I think we can change function PopulateCommonCstateInfo as well. > Modified it. > 4. > + if (pcdata->worker_line_buf_count) > > I think some check like the above can be 'if (xxx > 0)', which seems easier to understand. Modified it. Thanks for the comments, these issues are fixed in v10 patch posted at [1] [1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Oct 29, 2020 at 2:20 PM Heikki Linnakangas wrote: > > On 27/10/2020 15:36, vignesh C wrote: > > Attached v9 patches have the fixes for the above comments. > > I did some testing: > > /tmp/longdata.pl: > > #!/usr/bin/perl > # > # Generate three rows: > # foo > # longdatalongdatalongdata... > # bar > # > # The length of the middle row is given as command line arg. > # > > my $bytes = $ARGV[0]; > > print "foo\n"; > for(my $i = 0; $i < $bytes; $i+=8){ > print "longdata"; > } > print "\n"; > print "bar\n"; > > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' > with (parallel 2); > > This gets stuck forever (or at least I didn't have the patience to wait > it finish). Both worker processes are consuming 100% of CPU. > Thanks for identifying this issue, this issue is fixed in v10 patch posted at [1] [1] https://www.postgresql.org/message-id/CALDaNm05FnA-ePvYV_t2%2BWE_tXJymbfPwnm%2Bkc9y1iMkR%2BNbUg%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Oct 29, 2020 at 2:54 PM Amit Kapila wrote: > > 4) Worker has to hop through all the processed chunks before getting > the chunk which it can process. > > One more point, I have noticed that some time back [1], I have given > one suggestion related to the way workers process the set of lines > (aka chunk). I think you can try by increasing the chunk size to say > 100, 500, 1000 and use some shared counter to remember the number of > chunks processed. > Hi, I did some analysis on using spinlock protected worker write position i.e. each worker acquires spinlock on a shared write position to choose the next available chunk vs each worker hops to get the next available chunk position: Use Case: 10mn rows, 5.6GB data, 2 indexes on integer columns, 1 index on text column, results are of the form (no of workers, total exec time in sec, index insertion time in sec, worker write pos get time in sec, buffer contention event count): With spinlock: (1,1126.443,1060.067,0.478,*0*), (2,669.343,630.769,0.306,*26*), (4,346.297,326.950,0.161,*89*), (8,209.600,196.417,0.088,*291*), (16,166.113,157.086,0.065,*1468*), (20,173.884,166.013,0.067,*2700*), (30,173.087,1166.565,0.0065,*5346*) Without spinlock: (1,1119.695,1054.586,0.496,*0*), (2,645.733,608.313,1.5,*8*), (4,340.620,320.344,1.6,*58*), (8,203.985,189.644,1.3,*222*), (16,142.997,133.045,1,*813*), (20,132.621,122.527,1.1,*1215*), (30,135.737,126.716,1.5,*2901*) With spinlock each worker is getting the required write position quickly and proceeding further till the index insertion(which is becoming a single point of contention) where we observed more buffer lock contention. Reason is that all the workers are reaching the index insertion point at the similar time. Without spinlock, each worker is spending some time in hopping to get the write position, by the time the other workers are inserting into the indexes. So basically, all the workers are not reaching the index insertion point at the same time and hence less buffer lock contention. The same behaviour(explained above) is observed with different worker chunk count(default 64, 128, 512 and 1024) i.e. the number of tuples each worker caches into its local memory before inserting into table. In summary: with spinlock, it looks like we are able to avoid workers waiting to get the next chunk, which also means that we are not creating any contention point inside the parallel copy code. However this is causing another choking point i.e. index insertion if indexes are available on the table, which is out of scope of parallel copy code. We think that it would be good to use spinlock-protected worker write position or an atomic variable for worker write position(as it performs equal to spinlock or little better in some platforms). Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Wed, Nov 11, 2020 at 10:42 PM vignesh C wrote: > > On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila wrote: > > > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C wrote: > > > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila > > > wrote: > > > > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > > checks if parallely copy can be performed, Parallel copy cannot be > > > performed for the following a) If relation is temporary table b) If > > > relation is foreign table c) If relation has non parallel safe index > > > expressions d) If relation has triggers present whose type is of non > > > before statement trigger type e) If relation has check constraint > > > which are not parallel safe f) If relation has partition and any > > > partition has the above type. This patch has the checks for it. This > > > patch will be used by parallel copy implementation. > > > > > > > How did you ensure that this is sufficient? For parallel-insert's > > patch we have enabled parallel-mode for Inserts and ran the tests with > > force_parallel_mode to see if we are not missing anything. Also, it > > seems there are many common things here w.r.t parallel-insert patch, > > is it possible to prepare this atop that patch or do you have any > > reason to keep this separate? > > > > I have done similar testing for copy too, I had set force_parallel > mode to regress, hardcoded in the code to pick parallel workers for > copy operation and ran make installcheck-world to verify. Many checks > in this patch are common between both patches, but I was not sure how > to handle it as both the projects are in-progress and are being > updated based on the reviewer's opinion. How to handle this? > Thoughts? > I have not studied the differences in detail but if it is possible to prepare it on top of that patch then there shouldn't be a problem. To avoid confusion if you want you can always either post the latest version of that patch with your patch or point to it. -- With Regards, Amit Kapila.
Re: Parallel copy
On Tue, Nov 10, 2020 at 7:27 PM Amit Kapila wrote: > > On Tue, Nov 10, 2020 at 7:12 PM vignesh C wrote: > > > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila wrote: > > > > > > > I have worked to provide a patch for the parallel safety checks. It > > checks if parallely copy can be performed, Parallel copy cannot be > > performed for the following a) If relation is temporary table b) If > > relation is foreign table c) If relation has non parallel safe index > > expressions d) If relation has triggers present whose type is of non > > before statement trigger type e) If relation has check constraint > > which are not parallel safe f) If relation has partition and any > > partition has the above type. This patch has the checks for it. This > > patch will be used by parallel copy implementation. > > > > How did you ensure that this is sufficient? For parallel-insert's > patch we have enabled parallel-mode for Inserts and ran the tests with > force_parallel_mode to see if we are not missing anything. Also, it > seems there are many common things here w.r.t parallel-insert patch, > is it possible to prepare this atop that patch or do you have any > reason to keep this separate? > I have done similar testing for copy too, I had set force_parallel mode to regress, hardcoded in the code to pick parallel workers for copy operation and ran make installcheck-world to verify. Many checks in this patch are common between both patches, but I was not sure how to handle it as both the projects are in-progress and are being updated based on the reviewer's opinion. How to handle this? Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Tue, Nov 10, 2020 at 7:12 PM vignesh C wrote: > > On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila wrote: > > > > I have worked to provide a patch for the parallel safety checks. It > checks if parallely copy can be performed, Parallel copy cannot be > performed for the following a) If relation is temporary table b) If > relation is foreign table c) If relation has non parallel safe index > expressions d) If relation has triggers present whose type is of non > before statement trigger type e) If relation has check constraint > which are not parallel safe f) If relation has partition and any > partition has the above type. This patch has the checks for it. This > patch will be used by parallel copy implementation. > How did you ensure that this is sufficient? For parallel-insert's patch we have enabled parallel-mode for Inserts and ran the tests with force_parallel_mode to see if we are not missing anything. Also, it seems there are many common things here w.r.t parallel-insert patch, is it possible to prepare this atop that patch or do you have any reason to keep this separate? -- With Regards, Amit Kapila.
Re: Parallel copy
On Tue, Nov 3, 2020 at 2:28 PM Amit Kapila wrote: > > On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas wrote: > > > > On 02/11/2020 08:14, Amit Kapila wrote: > > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas > > > wrote: > > >> > > >> In this design, you don't need to keep line boundaries in shared memory, > > >> because each worker process is responsible for finding the line > > >> boundaries of its own block. > > >> > > >> There's a point of serialization here, in that the next block cannot be > > >> processed, until the worker working on the previous block has finished > > >> scanning the EOLs, and set the starting position on the next block, > > >> putting it in READY state. That's not very different from your patch, > > >> where you had a similar point of serialization because the leader > > >> scanned the EOLs, > > > > > > But in the design (single producer multiple consumer) used by the > > > patch the worker doesn't need to wait till the complete block is > > > processed, it can start processing the lines already found. This will > > > also allow workers to start much earlier to process the data as it > > > doesn't need to wait for all the offsets corresponding to 64K block > > > ready. However, in the design where each worker is processing the 64K > > > block, it can lead to much longer waits. I think this will impact the > > > Copy STDIN case more where in most cases (200-300 bytes tuples) we > > > receive line-by-line from client and find the line-endings by leader. > > > If the leader doesn't find the line-endings the workers need to wait > > > till the leader fill the entire 64K chunk, OTOH, with current approach > > > the worker can start as soon as leader is able to populate some > > > minimum number of line-endings > > > > You can use a smaller block size. > > > > Sure, but the same problem can happen if the last line in that block > is too long and we need to peek into the next block. And then there > could be cases where a single line could be greater than 64K. > > > However, the point of parallel copy is > > to maximize bandwidth. > > > > Okay, but this first-phase (finding the line boundaries) can anyway be > not done in parallel and we have seen in some of the initial > benchmarking that this initial phase is a small part of work > especially when the table has indexes, constraints, etc. So, I think > it won't matter much if this splitting is done in a single process or > multiple processes. > > > If the workers ever have to sit idle, it means > > that the bottleneck is in receiving data from the client, i.e. the > > backend is fast enough, and you can't make the overall COPY finish any > > faster no matter how you do it. > > > > > The other point is that the leader backend won't be used completely as > > > it is only doing a very small part (primarily reading the file) of the > > > overall work. > > > > An idle process doesn't cost anything. If you have free CPU resources, > > use more workers. > > > > > We have discussed both these approaches (a) single producer multiple > > > consumer, and (b) all workers doing the processing as you are saying > > > in the beginning and concluded that (a) is better, see some of the > > > relevant emails [1][2][3]. > > > > > > [1] - > > > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de > > > [2] - > > > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de > > > [3] - > > > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de > > > > Sorry I'm late to the party. I don't think the design I proposed was > > discussed in that threads. > > > > I think something close to that is discussed as you have noticed in > your next email but IIRC, because many people (Andres, Ants, myself > and author) favoured the current approach (single reader and multiple > consumers) we decided to go with that. I feel this patch is very much > in the POC stage due to which the code doesn't look good and as we > move forward we need to see what is the better way to improve it, > maybe one of the ways is to split it as you are suggesting so that it > can be easier to review. I think the other important thing which this > patch has not addressed properly is the parallel-safety checks as > pointed by me earlier. There are two things to solve there (a) the > lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c > APIs, etc.) have checks which doesn't allow any writes, we need to see > which of those we can open now (or do some additional work to prevent > from those checks) after some of the work done for parallel-writes in > PG-13[1][2], and (b) in which all cases we can parallel-writes > (parallel copy) is allowed, for example need to identify whether table > or one of its partitions has any constraint/expression which is > parallel-unsafe. > I have worked to provide a patch for the parallel safety checks. It checks if parallely copy c
Re: Parallel copy
On Thu, Nov 5, 2020 at 6:33 PM Hou, Zhijie wrote: > > Hi > > > > > my $bytes = $ARGV[0]; > > for(my $i = 0; $i < $bytes; $i+=8){ > > print "longdata"; > > } > > print "\n"; > > > > > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' > > with (parallel 2); > > > > This gets stuck forever (or at least I didn't have the patience to wait > > it finish). Both worker processes are consuming 100% of CPU. > > I had a look over this problem. > > the ParallelCopyDataBlock has size limit: > uint8 skip_bytes; > chardata[DATA_BLOCK_SIZE]; /* data read from file */ > > It seems the input line is so long that the leader process run out of the > Shared memory among parallel copy workers. > And the leader process keep waiting free block. > > For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED, > But leader process won't set the line_state unless it read the whole line. > > So it stuck forever. > May be we should reconsider about this situation. > > The stack is as follows: > > Leader stack: > #3 0x0075f7a1 in WaitLatch (latch=, > wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, > wait_event_info=wait_event_info@entry=150994945) at latch.c:411 > #4 0x005a9245 in WaitGetFreeCopyBlock > (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546 > #5 0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, > line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, > raw_buf_ptr=raw_buf_ptr@entry=65536, > copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572 > #6 0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at > copy.c:4058 > #7 0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at > copy.c:3863 > > Worker stack: > #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474 > #1 0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, > buff_count=buff_count@entry=0) at copyparallel.c:711 > #2 0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at > copyparallel.c:885 > #3 0x005a4f2e in NextCopyFromRawFields > (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, > nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615 > #4 0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, > econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at > copy.c:3696 > #5 0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at > copy.c:2985 > Thanks for providing your thoughts. I have analyzed this issue and I'm working on the fix for this, I will be posting a patch for this shortly. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
RE: Parallel copy
Hi > > my $bytes = $ARGV[0]; > for(my $i = 0; $i < $bytes; $i+=8){ > print "longdata"; > } > print "\n"; > > > postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' > with (parallel 2); > > This gets stuck forever (or at least I didn't have the patience to wait > it finish). Both worker processes are consuming 100% of CPU. I had a look over this problem. the ParallelCopyDataBlock has size limit: uint8 skip_bytes; chardata[DATA_BLOCK_SIZE]; /* data read from file */ It seems the input line is so long that the leader process run out of the Shared memory among parallel copy workers. And the leader process keep waiting free block. For the worker process, it wait util line_state becomes LINE_LEADER_POPULATED, But leader process won't set the line_state unless it read the whole line. So it stuck forever. May be we should reconsider about this situation. The stack is as follows: Leader stack: #3 0x0075f7a1 in WaitLatch (latch=, wakeEvents=wakeEvents@entry=41, timeout=timeout@entry=1, wait_event_info=wait_event_info@entry=150994945) at latch.c:411 #4 0x005a9245 in WaitGetFreeCopyBlock (pcshared_info=pcshared_info@entry=0x7f26d2ed3580) at copyparallel.c:1546 #5 0x005a98ce in SetRawBufForLoad (cstate=cstate@entry=0x2978a88, line_size=67108864, copy_buf_len=copy_buf_len@entry=65536, raw_buf_ptr=raw_buf_ptr@entry=65536, copy_raw_buf=copy_raw_buf@entry=0x7fff4cdc0e18) at copyparallel.c:1572 #6 0x005a1963 in CopyReadLineText (cstate=cstate@entry=0x2978a88) at copy.c:4058 #7 0x005a4e76 in CopyReadLine (cstate=cstate@entry=0x2978a88) at copy.c:3863 Worker stack: #0 GetLinePosition (cstate=cstate@entry=0x29e1f28) at copyparallel.c:1474 #1 0x005a8aa4 in CacheLineInfo (cstate=cstate@entry=0x29e1f28, buff_count=buff_count@entry=0) at copyparallel.c:711 #2 0x005a8e46 in GetWorkerLine (cstate=cstate@entry=0x29e1f28) at copyparallel.c:885 #3 0x005a4f2e in NextCopyFromRawFields (cstate=cstate@entry=0x29e1f28, fields=fields@entry=0x7fff4cdc0b48, nfields=nfields@entry=0x7fff4cdc0b44) at copy.c:3615 #4 0x005a50af in NextCopyFrom (cstate=cstate@entry=0x29e1f28, econtext=econtext@entry=0x2a358d8, values=0x2a42068, nulls=0x2a42070) at copy.c:3696 #5 0x005a5b90 in CopyFrom (cstate=cstate@entry=0x29e1f28) at copy.c:2985 Best regards, houzj
Re: Parallel copy
On 03/11/2020 10:59, Amit Kapila wrote: On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas wrote: However, the point of parallel copy is to maximize bandwidth. Okay, but this first-phase (finding the line boundaries) can anyway be not done in parallel and we have seen in some of the initial benchmarking that this initial phase is a small part of work especially when the table has indexes, constraints, etc. So, I think it won't matter much if this splitting is done in a single process or multiple processes. Right, it won't matter performance-wise. That's not my point. The difference is in the complexity. If you don't store the line boundaries in shared memory, you get away with much simpler shared memory structures. I think something close to that is discussed as you have noticed in your next email but IIRC, because many people (Andres, Ants, myself and author) favoured the current approach (single reader and multiple consumers) we decided to go with that. I feel this patch is very much in the POC stage due to which the code doesn't look good and as we move forward we need to see what is the better way to improve it, maybe one of the ways is to split it as you are suggesting so that it can be easier to review. Sure. I think the roadmap here is: 1. Split copy.c [1]. Not strictly necessary, but I think it'd make this nice to review and work with. 2. Refactor CopyReadLine(), so that finding the line-endings and the rest of the line-parsing are separated into separate functions. 3. Implement parallel copy. I think the other important thing which this patch has not addressed properly is the parallel-safety checks as pointed by me earlier. There are two things to solve there (a) the lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c APIs, etc.) have checks which doesn't allow any writes, we need to see which of those we can open now (or do some additional work to prevent from those checks) after some of the work done for parallel-writes in PG-13[1][2], and (b) in which all cases we can parallel-writes (parallel copy) is allowed, for example need to identify whether table or one of its partitions has any constraint/expression which is parallel-unsafe. Agreed, that needs to be solved. I haven't given it any thought myself. - Heikki [1] https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi
Re: Parallel copy
On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas wrote: > > On 02/11/2020 08:14, Amit Kapila wrote: > > On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas wrote: > >> > >> In this design, you don't need to keep line boundaries in shared memory, > >> because each worker process is responsible for finding the line > >> boundaries of its own block. > >> > >> There's a point of serialization here, in that the next block cannot be > >> processed, until the worker working on the previous block has finished > >> scanning the EOLs, and set the starting position on the next block, > >> putting it in READY state. That's not very different from your patch, > >> where you had a similar point of serialization because the leader > >> scanned the EOLs, > > > > But in the design (single producer multiple consumer) used by the > > patch the worker doesn't need to wait till the complete block is > > processed, it can start processing the lines already found. This will > > also allow workers to start much earlier to process the data as it > > doesn't need to wait for all the offsets corresponding to 64K block > > ready. However, in the design where each worker is processing the 64K > > block, it can lead to much longer waits. I think this will impact the > > Copy STDIN case more where in most cases (200-300 bytes tuples) we > > receive line-by-line from client and find the line-endings by leader. > > If the leader doesn't find the line-endings the workers need to wait > > till the leader fill the entire 64K chunk, OTOH, with current approach > > the worker can start as soon as leader is able to populate some > > minimum number of line-endings > > You can use a smaller block size. > Sure, but the same problem can happen if the last line in that block is too long and we need to peek into the next block. And then there could be cases where a single line could be greater than 64K. > However, the point of parallel copy is > to maximize bandwidth. > Okay, but this first-phase (finding the line boundaries) can anyway be not done in parallel and we have seen in some of the initial benchmarking that this initial phase is a small part of work especially when the table has indexes, constraints, etc. So, I think it won't matter much if this splitting is done in a single process or multiple processes. > If the workers ever have to sit idle, it means > that the bottleneck is in receiving data from the client, i.e. the > backend is fast enough, and you can't make the overall COPY finish any > faster no matter how you do it. > > > The other point is that the leader backend won't be used completely as > > it is only doing a very small part (primarily reading the file) of the > > overall work. > > An idle process doesn't cost anything. If you have free CPU resources, > use more workers. > > > We have discussed both these approaches (a) single producer multiple > > consumer, and (b) all workers doing the processing as you are saying > > in the beginning and concluded that (a) is better, see some of the > > relevant emails [1][2][3]. > > > > [1] - > > https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de > > [2] - > > https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de > > [3] - > > https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de > > Sorry I'm late to the party. I don't think the design I proposed was > discussed in that threads. > I think something close to that is discussed as you have noticed in your next email but IIRC, because many people (Andres, Ants, myself and author) favoured the current approach (single reader and multiple consumers) we decided to go with that. I feel this patch is very much in the POC stage due to which the code doesn't look good and as we move forward we need to see what is the better way to improve it, maybe one of the ways is to split it as you are suggesting so that it can be easier to review. I think the other important thing which this patch has not addressed properly is the parallel-safety checks as pointed by me earlier. There are two things to solve there (a) the lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c APIs, etc.) have checks which doesn't allow any writes, we need to see which of those we can open now (or do some additional work to prevent from those checks) after some of the work done for parallel-writes in PG-13[1][2], and (b) in which all cases we can parallel-writes (parallel copy) is allowed, for example need to identify whether table or one of its partitions has any constraint/expression which is parallel-unsafe. [1] 85f6b49 Allow relation extension lock to conflict among parallel group members [2] 3ba59cc Allow page lock to conflict among parallel group members > > I want to throw out one more idea. It's an interim step, not the final > solution we want, but a useful step in getting there: > > Have the leader process scan the input for
Re: Parallel copy
On 02/11/2020 09:10, Heikki Linnakangas wrote: On 02/11/2020 08:14, Amit Kapila wrote: We have discussed both these approaches (a) single producer multiple consumer, and (b) all workers doing the processing as you are saying in the beginning and concluded that (a) is better, see some of the relevant emails [1][2][3]. [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de Sorry I'm late to the party. I don't think the design I proposed was discussed in that threads. The alternative that's discussed in that thread seems to be something much more fine-grained, where processes claim individual lines. I'm not sure though, I didn't fully understand the alternative designs. I read the thread more carefully, and I think Robert had basically the right idea here (https://www.postgresql.org/message-id/CA%2BTgmoZMU4az9MmdJtg04pjRa0wmWQtmoMxttdxNrupYJNcR3w%40mail.gmail.com): I really think we don't want a single worker in charge of finding tuple boundaries for everybody. That adds a lot of unnecessary inter-process communication and synchronization. Each process should just get the next tuple starting after where the last one ended, and then advance the end pointer so that the next process can do the same thing. [...] And here (https://www.postgresql.org/message-id/CA%2BTgmoZw%2BF3y%2BoaxEsHEZBxdL1x1KAJ7pRMNgCqX0WjmjGNLrA%40mail.gmail.com): On Thu, Apr 9, 2020 at 2:55 PM Andres Freund wrote: I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably not even below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think timestamps - splitting on a granular basis will make indexing much more expensive. And have a lot more contention. That's a fair point. I think the solution ought to be that once any process starts finding line endings, it continues until it's grabbed at least a certain amount of data for itself. Then it stops and lets some other process grab a chunk of data. Yes! That's pretty close to the design I sketched. I imagined that the leader would divide the input into 64 kB blocks, and each block would have few metadata fields, notably the starting position of the first line in the block. I think Robert envisioned having a single "next starting position" field in shared memory. That works too, and is even simpler, so +1 for that. For some reason, the discussion took a different turn from there, to discuss how the line-endings (called "chunks" in the discussion) should be represented in shared memory. But none of that is necessary with Robert's design. - Heikki
Re: Parallel copy
On 02/11/2020 08:14, Amit Kapila wrote: On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas wrote: Leader process: The leader process is simple. It picks the next FREE buffer, fills it with raw data from the file, and marks it as FILLED. If no buffers are FREE, wait. Worker process: 1. Claim next READY block from queue, by changing its state to PROCESSING. If the next block is not READY yet, wait until it is. 2. Start scanning the block from 'startpos', finding end-of-line markers. (in CSV mode, need to track when we're in-quotes). 3. When you reach the end of the block, if the last line continues to next block, wait for the next block to become FILLED. Peek into the next block, and copy the remaining part of the split line to a local buffer, and set the 'startpos' on the next block to point to the end of the split line. Mark the next block as READY. 4. Process all the lines in the block, call input functions, insert rows. 5. Mark the block as DONE. In this design, you don't need to keep line boundaries in shared memory, because each worker process is responsible for finding the line boundaries of its own block. There's a point of serialization here, in that the next block cannot be processed, until the worker working on the previous block has finished scanning the EOLs, and set the starting position on the next block, putting it in READY state. That's not very different from your patch, where you had a similar point of serialization because the leader scanned the EOLs, But in the design (single producer multiple consumer) used by the patch the worker doesn't need to wait till the complete block is processed, it can start processing the lines already found. This will also allow workers to start much earlier to process the data as it doesn't need to wait for all the offsets corresponding to 64K block ready. However, in the design where each worker is processing the 64K block, it can lead to much longer waits. I think this will impact the Copy STDIN case more where in most cases (200-300 bytes tuples) we receive line-by-line from client and find the line-endings by leader. If the leader doesn't find the line-endings the workers need to wait till the leader fill the entire 64K chunk, OTOH, with current approach the worker can start as soon as leader is able to populate some minimum number of line-endings You can use a smaller block size. However, the point of parallel copy is to maximize bandwidth. If the workers ever have to sit idle, it means that the bottleneck is in receiving data from the client, i.e. the backend is fast enough, and you can't make the overall COPY finish any faster no matter how you do it. The other point is that the leader backend won't be used completely as it is only doing a very small part (primarily reading the file) of the overall work. An idle process doesn't cost anything. If you have free CPU resources, use more workers. We have discussed both these approaches (a) single producer multiple consumer, and (b) all workers doing the processing as you are saying in the beginning and concluded that (a) is better, see some of the relevant emails [1][2][3]. [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de Sorry I'm late to the party. I don't think the design I proposed was discussed in that threads. The alternative that's discussed in that thread seems to be something much more fine-grained, where processes claim individual lines. I'm not sure though, I didn't fully understand the alternative designs. I want to throw out one more idea. It's an interim step, not the final solution we want, but a useful step in getting there: Have the leader process scan the input for line-endings. Split the input data into blocks of slightly under 64 kB in size, so that a line never crosses a block. Put the blocks in shared memory. A worker process claims a block from shared memory, processes it from beginning to end. It *also* has to parse the input to split it into lines. In this design, the line-splitting is done twice. That's clearly not optimal, and we want to avoid that in the final patch, but I think it would be a useful milestone. After that patch is done, write another patch to either a) implement the design I sketched, where blocks are fixed-size and a worker notifies the next worker on where the first line in next block begins, or b) have the leader process report the line-ending positions in shared memory, so that workers don't need to scan them again. Even if we apply the patches together, I think splitting them like that would make for easier review. - Heikki
Re: Parallel copy
On Fri, Oct 30, 2020 at 10:11 PM Heikki Linnakangas wrote: > > Leader process: > > The leader process is simple. It picks the next FREE buffer, fills it > with raw data from the file, and marks it as FILLED. If no buffers are > FREE, wait. > > Worker process: > > 1. Claim next READY block from queue, by changing its state to > PROCESSING. If the next block is not READY yet, wait until it is. > > 2. Start scanning the block from 'startpos', finding end-of-line > markers. (in CSV mode, need to track when we're in-quotes). > > 3. When you reach the end of the block, if the last line continues to > next block, wait for the next block to become FILLED. Peek into the > next block, and copy the remaining part of the split line to a local > buffer, and set the 'startpos' on the next block to point to the end > of the split line. Mark the next block as READY. > > 4. Process all the lines in the block, call input functions, insert > rows. > > 5. Mark the block as DONE. > > In this design, you don't need to keep line boundaries in shared memory, > because each worker process is responsible for finding the line > boundaries of its own block. > > There's a point of serialization here, in that the next block cannot be > processed, until the worker working on the previous block has finished > scanning the EOLs, and set the starting position on the next block, > putting it in READY state. That's not very different from your patch, > where you had a similar point of serialization because the leader > scanned the EOLs, > But in the design (single producer multiple consumer) used by the patch the worker doesn't need to wait till the complete block is processed, it can start processing the lines already found. This will also allow workers to start much earlier to process the data as it doesn't need to wait for all the offsets corresponding to 64K block ready. However, in the design where each worker is processing the 64K block, it can lead to much longer waits. I think this will impact the Copy STDIN case more where in most cases (200-300 bytes tuples) we receive line-by-line from client and find the line-endings by leader. If the leader doesn't find the line-endings the workers need to wait till the leader fill the entire 64K chunk, OTOH, with current approach the worker can start as soon as leader is able to populate some minimum number of line-endings The other point is that the leader backend won't be used completely as it is only doing a very small part (primarily reading the file) of the overall work. We have discussed both these approaches (a) single producer multiple consumer, and (b) all workers doing the processing as you are saying in the beginning and concluded that (a) is better, see some of the relevant emails [1][2][3]. [1] - https://www.postgresql.org/message-id/20200413201633.cki4nsptynq7blhg%40alap3.anarazel.de [2] - https://www.postgresql.org/message-id/20200415181913.4gjqcnuzxfzbbzxa%40alap3.anarazel.de [3] - https://www.postgresql.org/message-id/78C0107E-62F2-4F76-BFD8-34C73B716944%40anarazel.de -- With Regards, Amit Kapila.
Re: Parallel copy
On Sat, Oct 31, 2020 at 12:09:32AM +0200, Heikki Linnakangas wrote: On 30/10/2020 22:56, Tomas Vondra wrote: I agree this design looks simpler. I'm a bit worried about serializing the parsing like this, though. It's true the current approach (where the first phase of parsing happens in the leader) has a similar issue, but I think it would be easier to improve that in that design. My plan was to parallelize the parsing roughly like this: 1) split the input buffer into smaller chunks 2) let workers scan the buffers and record positions of interesting characters (delimiters, quotes, ...) and pass it back to the leader 3) use the information to actually parse the input data (we only need to look at the interesting characters, skipping large parts of data) 4) pass the parsed chunks to workers, just like in the current patch But maybe something like that would be possible even with the approach you propose - we could have a special parse phase for processing each buffer, where any worker could look for the special characters, record the positions in a bitmap next to the buffer. So the whole sequence of states would look something like this: EMPTY FILLED PARSED READY PROCESSING I think it's even simpler than that. You don't need to communicate the "interesting positions" between processes, if the same worker takes care of the chunk through all states from FILLED to DONE. You can build the bitmap of interesting positions immediately in FILLED state, independently of all previous blocks. Once you've built the bitmap, you need to wait for the information on where the first line starts, but presumably finding the interesting positions is the expensive part. I don't think it's that simple. For example, the previous block may contain a very long value (say, 1MB), so a bunch of blocks have to be processed by the same worker. That probably makes the state transitions a bit, and it also means the bitmap would need to be passed to the worker that actually processes the block. Or we might just ignore this, on the grounds that it's not a very common situation. Of course, the question is whether parsing really is sufficiently expensive for this to be worth it. Yeah, I don't think it's worth it. Splitting the lines is pretty fast, I think we have many years to come before that becomes a bottleneck. But if it turns out I'm wrong and we need to implement that, the path is pretty straightforward. OK. I agree the parsing is relatively cheap, and I don't recall seeing CSV parsing as a bottleneck in production. I suspect that's might be simply because we're hitting other bottlenecks first, though. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel copy
On 30/10/2020 22:56, Tomas Vondra wrote: I agree this design looks simpler. I'm a bit worried about serializing the parsing like this, though. It's true the current approach (where the first phase of parsing happens in the leader) has a similar issue, but I think it would be easier to improve that in that design. My plan was to parallelize the parsing roughly like this: 1) split the input buffer into smaller chunks 2) let workers scan the buffers and record positions of interesting characters (delimiters, quotes, ...) and pass it back to the leader 3) use the information to actually parse the input data (we only need to look at the interesting characters, skipping large parts of data) 4) pass the parsed chunks to workers, just like in the current patch But maybe something like that would be possible even with the approach you propose - we could have a special parse phase for processing each buffer, where any worker could look for the special characters, record the positions in a bitmap next to the buffer. So the whole sequence of states would look something like this: EMPTY FILLED PARSED READY PROCESSING I think it's even simpler than that. You don't need to communicate the "interesting positions" between processes, if the same worker takes care of the chunk through all states from FILLED to DONE. You can build the bitmap of interesting positions immediately in FILLED state, independently of all previous blocks. Once you've built the bitmap, you need to wait for the information on where the first line starts, but presumably finding the interesting positions is the expensive part. Of course, the question is whether parsing really is sufficiently expensive for this to be worth it. Yeah, I don't think it's worth it. Splitting the lines is pretty fast, I think we have many years to come before that becomes a bottleneck. But if it turns out I'm wrong and we need to implement that, the path is pretty straightforward. - Heikki
Re: Parallel copy
On Fri, Oct 30, 2020 at 06:41:41PM +0200, Heikki Linnakangas wrote: On 30/10/2020 18:36, Heikki Linnakangas wrote: I find this design to be very complicated. Why does the line-boundary information need to be in shared memory? I think this would be much simpler if each worker grabbed a fixed-size block of raw data, and processed that. In your patch, the leader process scans the input to find out where one line ends and another begins, and because of that decision, the leader needs to make the line boundaries available in shared memory, for the worker processes. If we moved that responsibility to the worker processes, you wouldn't need to keep the line boundaries in shared memory. A worker would only need to pass enough state to the next worker to tell it where to start scanning the next block. Here's a high-level sketch of how I'm imagining this to work: The shared memory structure consists of a queue of blocks, arranged as a ring buffer. Each block is of fixed size, and contains 64 kB of data, and a few fields for coordination: typedef struct { /* Current state of the block */ pg_atomic_uint32 state; /* starting offset of first line within the block */ int startpos; chardata[64 kB]; } ParallelCopyDataBlock; Where state is one of: enum { FREE, /* buffer is empty */ FILLED, /* leader has filled the buffer with raw data */ READY, /* start pos has been filled in, but no worker process has claimed the block yet */ PROCESSING, /* worker has claimed the block, and is processing it */ } State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the COPY progresses, the ring of blocks will always look something like this: blk 0 startpos 0: PROCESSING [worker 1] blk 1 startpos 12: PROCESSING [worker 2] blk 2 startpos 10: READY blk 3 starptos -: FILLED blk 4 startpos -: FILLED blk 5 starptos -: FILLED blk 6 startpos -: FREE blk 7 startpos -: FREE Typically, each worker process is busy processing a block. After the blocks being processed, there is one block in READY state, and after that, blocks in FILLED state. Leader process: The leader process is simple. It picks the next FREE buffer, fills it with raw data from the file, and marks it as FILLED. If no buffers are FREE, wait. Worker process: 1. Claim next READY block from queue, by changing its state to PROCESSING. If the next block is not READY yet, wait until it is. 2. Start scanning the block from 'startpos', finding end-of-line markers. (in CSV mode, need to track when we're in-quotes). 3. When you reach the end of the block, if the last line continues to next block, wait for the next block to become FILLED. Peek into the next block, and copy the remaining part of the split line to a local buffer, and set the 'startpos' on the next block to point to the end of the split line. Mark the next block as READY. 4. Process all the lines in the block, call input functions, insert rows. 5. Mark the block as DONE. In this design, you don't need to keep line boundaries in shared memory, because each worker process is responsible for finding the line boundaries of its own block. There's a point of serialization here, in that the next block cannot be processed, until the worker working on the previous block has finished scanning the EOLs, and set the starting position on the next block, putting it in READY state. That's not very different from your patch, where you had a similar point of serialization because the leader scanned the EOLs, but I think the coordination between processes is simpler here. I agree this design looks simpler. I'm a bit worried about serializing the parsing like this, though. It's true the current approach (where the first phase of parsing happens in the leader) has a similar issue, but I think it would be easier to improve that in that design. My plan was to parallelize the parsing roughly like this: 1) split the input buffer into smaller chunks 2) let workers scan the buffers and record positions of interesting characters (delimiters, quotes, ...) and pass it back to the leader 3) use the information to actually parse the input data (we only need to look at the interesting characters, skipping large parts of data) 4) pass the parsed chunks to workers, just like in the current patch But maybe something like that would be possible even with the approach you propose - we could have a special parse phase for processing each buffer, where any worker could look for the special characters, record the positions in a bitmap next to the buffer. So the whole sequence of states would look something like this: EMPTY FILLED PARSED READY PROCESSING Of course, the question is whether parsing really is sufficiently expensive for this to be worth it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel copy
Hi, I've done a bit more testing today, and I think the parsing is busted in some way. Consider this: test=# create extension random; CREATE EXTENSION test=# create table t (a text); CREATE TABLE test=# insert into t select random_string(random_int(10, 256*1024)) from generate_series(1,1); INSERT 0 1 test=# copy t to '/mnt/data/t.csv'; COPY 1 test=# truncate t; TRUNCATE TABLE test=# copy t from '/mnt/data/t.csv'; COPY 1 test=# truncate t; TRUNCATE TABLE test=# copy t from '/mnt/data/t.csv' with (parallel 2); ERROR: invalid byte sequence for encoding "UTF8": 0x00 CONTEXT: COPY t, line 485: "m&\nh%_a"%r]>qtCl:Q5ltvF~;2oS6@HB>F>og,bD$Lw'nZY\tYl#BH\t{(j~ryoZ08"SGU~.}8CcTRk1\ts$@U3szCC+U1U3i@P..." parallel worker The functions come from an extension I use to generate random data, I've pushed it to github [1]. The random_string() generates a random string with ASCII characters, symbols and a couple special characters (\r\n\t). The intent was to try loading data where a fields may span multiple 64kB blocks and may contain newlines etc. The non-parallel copy works fine, the parallel one fails. I haven't investigated the details, but I guess it gets confused about where a string starts/end, or something like that. [1] https://github.com/tvondra/random regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel copy
On 30/10/2020 18:36, Heikki Linnakangas wrote: Whether the leader process finds the EOLs or the worker processes, it's pretty clear that it needs to be done ASAP, for a chunk at a time, because that cannot be done in parallel. I think some refactoring in CopyReadLine() and friends would be in order. It probably would be faster, or at least not slower, to find all the EOLs in a block in one tight loop, even when parallel copy is not used. Something like the attached. It passes the regression tests, but it's quite incomplete. It's missing handing of "\." as end-of-file marker, and I haven't tested encoding conversions at all, for starters. Quick testing suggests that this a little bit faster than the current code, but the difference is small; I had to use a "WHERE false" option to really see the difference. The crucial thing here is that there's a new function, ParseLinesText(), to find all end-of-line characters in a buffer in one go. In this patch, it's used against 'raw_buf', but with parallel copy, you could point it at a block in shared memory instead. - Heikki >From af3be3bd4e77b66f4605393617da0d15ec21e15b Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 30 Oct 2020 18:51:10 +0200 Subject: [PATCH 1/1] WIP: Find all line-endings in COPY in chunks. Refactor CopyReadLines and friends to find all the line-endings in the buffer in one go, before splitting the lines further. --- src/backend/commands/copy.c | 972 1 file changed, 536 insertions(+), 436 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 36ddcdccdb8..fbf11cb2550 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -95,6 +95,18 @@ typedef enum CopyInsertMethod CIM_MULTI_CONDITIONAL /* use table_multi_insert only if valid */ } CopyInsertMethod; + +/* + * Represents the heap insert method to be used during COPY FROM. + */ +typedef enum ParseLinesState +{ + PLSTATE_NORMAL, + PLSTATE_ESCAPE, + PLSTATE_IN_QUOTE, + PLSTATE_ESCAPE_IN_QUOTE +} ParseLinesState; + /* * This struct contains all the state variables used throughout a COPY * operation. For simplicity, we use the same struct for all variants of COPY, @@ -110,6 +122,24 @@ typedef enum CopyInsertMethod * it's faster to make useless comparisons to trailing bytes than it is to * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true * when we have to do it the hard way. + * + * COPY FROM buffers: + * + * In COPY FROM processing, there are three levels of buffers: + * + * raw_buf - contains raw data read from file/client + * converted_buf - contains the data in 'raw_buf', but converted to server encoding + * line_buf - contains "current" line of data, without the end-of-line char + * + * + * In simple cases, no encoding conversion are needed, and converted_buf always + * points to raw_buf. If the encoding_embeds_ascii==true, encoding conversion is + * performed on the raw buffer, before splitting it to lines. converted_buf contains + * the converted version in that case. + * + * Usually, line_buf pointer points in the middle of converted_buf, but when a line + * is split by a raw-buffer boundary, the incomplete line is reassembled + * in a separate buffer (split_line_buf), and line_buf points to that. */ typedef struct CopyStateData { @@ -205,16 +235,34 @@ typedef struct CopyStateData char **raw_fields; /* - * Similarly, line_buf holds the whole input line being processed. The + * These variables are used to track state of parsing raw data into + * lines in COPY FROM. + */ + bool last_was_cr; + ParseLinesState parse_lines_state; + + int last_line_no; /* last line in 'endlines', -1 if EOF not reached yet */ + + int nextline; + int *endlines; /* line ending positions within raw_buf */ + int numlines; + + /* split_line_buf holds partial line carried over from previous buf */ + StringInfoData split_line_buf; + + /* + * Similarly, line_buf holds the current input line being processed. The * input cycle is first to read the whole line into line_buf, convert it * to server encoding there, and then extract the individual attribute * fields into attribute_buf. line_buf is preserved unmodified so that we * can display it in error messages if appropriate. (In binary mode, * line_buf is not used.) */ - StringInfoData line_buf; + char *line_buf; + int line_len; bool line_buf_converted; /* converted to server encoding? */ bool line_buf_valid; /* contains the row being processed? */ + bool line_buf_alloced; /* * Finally, raw_buf holds raw data read from the data source (file or @@ -230,6 +278,9 @@ typedef struct CopyStateData int raw_buf_len; /* total # of bytes stored */ /* Shorthand for number of unconsumed bytes available in raw_buf */ #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index) + + char *converted_buf; + int converted
Re: Parallel copy
On 30/10/2020 18:36, Heikki Linnakangas wrote: I find this design to be very complicated. Why does the line-boundary information need to be in shared memory? I think this would be much simpler if each worker grabbed a fixed-size block of raw data, and processed that. In your patch, the leader process scans the input to find out where one line ends and another begins, and because of that decision, the leader needs to make the line boundaries available in shared memory, for the worker processes. If we moved that responsibility to the worker processes, you wouldn't need to keep the line boundaries in shared memory. A worker would only need to pass enough state to the next worker to tell it where to start scanning the next block. Here's a high-level sketch of how I'm imagining this to work: The shared memory structure consists of a queue of blocks, arranged as a ring buffer. Each block is of fixed size, and contains 64 kB of data, and a few fields for coordination: typedef struct { /* Current state of the block */ pg_atomic_uint32 state; /* starting offset of first line within the block */ int startpos; chardata[64 kB]; } ParallelCopyDataBlock; Where state is one of: enum { FREE, /* buffer is empty */ FILLED, /* leader has filled the buffer with raw data */ READY, /* start pos has been filled in, but no worker process has claimed the block yet */ PROCESSING, /* worker has claimed the block, and is processing it */ } State changes FREE -> FILLED -> READY -> PROCESSING -> FREE. As the COPY progresses, the ring of blocks will always look something like this: blk 0 startpos 0: PROCESSING [worker 1] blk 1 startpos 12: PROCESSING [worker 2] blk 2 startpos 10: READY blk 3 starptos -: FILLED blk 4 startpos -: FILLED blk 5 starptos -: FILLED blk 6 startpos -: FREE blk 7 startpos -: FREE Typically, each worker process is busy processing a block. After the blocks being processed, there is one block in READY state, and after that, blocks in FILLED state. Leader process: The leader process is simple. It picks the next FREE buffer, fills it with raw data from the file, and marks it as FILLED. If no buffers are FREE, wait. Worker process: 1. Claim next READY block from queue, by changing its state to PROCESSING. If the next block is not READY yet, wait until it is. 2. Start scanning the block from 'startpos', finding end-of-line markers. (in CSV mode, need to track when we're in-quotes). 3. When you reach the end of the block, if the last line continues to next block, wait for the next block to become FILLED. Peek into the next block, and copy the remaining part of the split line to a local buffer, and set the 'startpos' on the next block to point to the end of the split line. Mark the next block as READY. 4. Process all the lines in the block, call input functions, insert rows. 5. Mark the block as DONE. In this design, you don't need to keep line boundaries in shared memory, because each worker process is responsible for finding the line boundaries of its own block. There's a point of serialization here, in that the next block cannot be processed, until the worker working on the previous block has finished scanning the EOLs, and set the starting position on the next block, putting it in READY state. That's not very different from your patch, where you had a similar point of serialization because the leader scanned the EOLs, but I think the coordination between processes is simpler here. - Heikki
Re: Parallel copy
On 27/10/2020 15:36, vignesh C wrote: Attached v9 patches have the fixes for the above comments. I find this design to be very complicated. Why does the line-boundary information need to be in shared memory? I think this would be much simpler if each worker grabbed a fixed-size block of raw data, and processed that. In your patch, the leader process scans the input to find out where one line ends and another begins, and because of that decision, the leader needs to make the line boundaries available in shared memory, for the worker processes. If we moved that responsibility to the worker processes, you wouldn't need to keep the line boundaries in shared memory. A worker would only need to pass enough state to the next worker to tell it where to start scanning the next block. Whether the leader process finds the EOLs or the worker processes, it's pretty clear that it needs to be done ASAP, for a chunk at a time, because that cannot be done in parallel. I think some refactoring in CopyReadLine() and friends would be in order. It probably would be faster, or at least not slower, to find all the EOLs in a block in one tight loop, even when parallel copy is not used. - Heikki
Re: Parallel copy
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila wrote: > > On Tue, Oct 27, 2020 at 7:06 PM vignesh C wrote: > > > [latest version] > > I think the parallel-safety checks in this patch > (v9-0002-Allow-copy-from-command-to-process-data-from-file) are > incomplete and wrong. > One more point, I have noticed that some time back [1], I have given one suggestion related to the way workers process the set of lines (aka chunk). I think you can try by increasing the chunk size to say 100, 500, 1000 and use some shared counter to remember the number of chunks processed. [1] - https://www.postgresql.org/message-id/CAA4eK1L-Xgw1zZEbGePmhBBWmEmLFL6rCaiOMDPnq2GNMVz-sg%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Parallel copy
On 27/10/2020 15:36, vignesh C wrote: >> Attached v9 patches have the fixes for the above comments. >I did some testing: I did some testing as well and have a cosmetic remark: postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10); ERROR: value 10 out of bounds for option "parallel" DETAIL: Valid values are between "1" and "1024". postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000); ERROR: parallel requires an integer value postgres=# Wouldn't it make more sense to only have one error message? The first one seems to be the better message. Regards Daniel
Re: Parallel copy
On 27/10/2020 15:36, vignesh C wrote: Attached v9 patches have the fixes for the above comments. I did some testing: /tmp/longdata.pl: #!/usr/bin/perl # # Generate three rows: # foo # longdatalongdatalongdata... # bar # # The length of the middle row is given as command line arg. # my $bytes = $ARGV[0]; print "foo\n"; for(my $i = 0; $i < $bytes; $i+=8){ print "longdata"; } print "\n"; print "bar\n"; postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' with (parallel 2); This gets stuck forever (or at least I didn't have the patience to wait it finish). Both worker processes are consuming 100% of CPU. - Heikki
Re: Parallel copy
On Tue, Oct 27, 2020 at 7:06 PM vignesh C wrote: > [latest version] I think the parallel-safety checks in this patch (v9-0002-Allow-copy-from-command-to-process-data-from-file) are incomplete and wrong. See below comments. 1. +static pg_attribute_always_inline bool +CheckExprParallelSafety(CopyState cstate) +{ + if (contain_volatile_functions(cstate->whereClause)) + { + if (max_parallel_hazard((Query *) cstate->whereClause) != PROPARALLEL_SAFE) + return false; + } I don't understand the above check. Why do we only need to check where clause for parallel-safety when it contains volatile functions? It should be checked otherwise as well, no? The similar comment applies to other checks in this function. Also, I don't think there is a need to make this function inline. 2. +/* + * IsParallelCopyAllowed + * + * Check if parallel copy can be allowed. + */ +bool +IsParallelCopyAllowed(CopyState cstate) { .. + * When there are BEFORE/AFTER/INSTEAD OF row triggers on the table. We do + * not allow parallelism in such cases because such triggers might query + * the table we are inserting into and act differently if the tuples that + * have already been processed and prepared for insertion are not there. + * Now, if we allow parallelism with such triggers the behaviour would + * depend on if the parallel worker has already inserted or not that + * particular tuples. + */ + if (cstate->rel->trigdesc != NULL && + (cstate->rel->trigdesc->trig_insert_after_statement || + cstate->rel->trigdesc->trig_insert_new_table || + cstate->rel->trigdesc->trig_insert_before_row || + cstate->rel->trigdesc->trig_insert_after_row || + cstate->rel->trigdesc->trig_insert_instead_row)) + return false; .. Why do we need to disable parallelism for before/after row triggers unless they have parallel-unsafe functions? I see a few lines down in this function you are checking parallel-safety of trigger functions, what is the use of the same if you are already disabling parallelism with the above check. 3. What about if the index on table has expressions that are parallel-unsafe? What is your strategy to check parallel-safety for partitioned tables? I suggest checking Greg's patch for parallel-safety of Inserts [1]. I think you will find that most of those checks are required here as well and see how we can use that patch (at least what is common). I feel the first patch should be just to have parallel-safety checks and we can test that by trying to enable Copy with force_parallel_mode. We can build the rest of the patch atop of it or in other words, let's move all parallel-safety work into a separate patch. Few assorted comments: 1. +/* + * ESTIMATE_NODE_SIZE - Estimate the size required for node type in shared + * memory. + */ +#define ESTIMATE_NODE_SIZE(list, listStr, strsize) \ +{ \ + uint32 estsize = sizeof(uint32); \ + if ((List *)list != NIL) \ + { \ + listStr = nodeToString(list); \ + estsize += strlen(listStr) + 1; \ + } \ + \ + strsize = add_size(strsize, estsize); \ +} This can be probably a function instead of a macro. 2. +/* + * ESTIMATE_1BYTE_STR_SIZE - Estimate the size required for 1Byte strings in + * shared memory. + */ +#define ESTIMATE_1BYTE_STR_SIZE(src, strsize) \ +{ \ + strsize = add_size(strsize, sizeof(uint8)); \ + strsize = add_size(strsize, (src) ? 1 : 0); \ +} This could be an inline function. 3. +/* + * SERIALIZE_1BYTE_STR - Copy 1Byte strings to shared memory. + */ +#define SERIALIZE_1BYTE_STR(dest, src, copiedsize) \ +{ \ + uint8 len = (src) ? 1 : 0; \ + memcpy(dest + copiedsize, (uint8 *) &len, sizeof(uint8)); \ + copiedsize += sizeof(uint8); \ + if (src) \ + dest[copiedsize++] = src[0]; \ +} Similarly, this could be a function. I think keeping such things as macros in-between code makes it difficult to read. Please see if you can make these and similar macros as functions unless they are doing few memory instructions. Having functions makes it easier to debug the code as well. [1] - https://www.postgresql.org/message-id/CAJcOf-cgfjj0NfYPrNFGmQJxsnNW102LTXbzqxQJuziar1EKfQ%40mail.gmail.com -- With Regards, Amit Kapila.
RE: Parallel copy
Hi I found some issue in v9-0002 1. + + elog(DEBUG1, "[Worker] Processing - line position:%d, block:%d, unprocessed lines:%d, offset:%d, line size:%d", +write_pos, lineInfo->first_block, +pg_atomic_read_u32(&data_blk_ptr->unprocessed_line_parts), +offset, pg_atomic_read_u32(&lineInfo->line_size)); + write_pos or other variable to be printed here are type of uint32, I think it'better to use '%u' in elog msg. 2. +* line_size will be set. Read the line_size again to be sure if it is +* completed or partial block. +*/ + dataSize = pg_atomic_read_u32(&lineInfo->line_size); + if (dataSize) It use dataSize( type int ) to get uint32 which seems a little dangerous. Is it better to define dataSize uint32 here? 3. Since function with 'Cstate' in name has been changed to 'CState' I think we can change function PopulateCommonCstateInfo as well. 4. + if (pcdata->worker_line_buf_count) I think some check like the above can be 'if (xxx > 0)', which seems easier to understand. Best regards, houzj
Re: Parallel copy
On Fri, Oct 23, 2020 at 6:58 PM Ashutosh Sharma wrote: > > > > > I think, if possible, all these if-else checks in CopyFrom() can be > > moved to a single function which can probably be named as > > IdentifyCopyInsertMethod() and this function can be called in > > IsParallelCopyAllowed(). This will ensure that in case of Parallel > > Copy when the leader has performed all these checks, the worker won't > > do it again. I also feel that it will make the code look a bit > > cleaner. > > > > Just rewriting above comment to make it a bit more clear: > > I think, if possible, all these if-else checks in CopyFrom() should be > moved to a separate function which can probably be named as > IdentifyCopyInsertMethod() and this function called from > IsParallelCopyAllowed() and CopyFrom() functions. It will only be > called from CopyFrom() when IsParallelCopy() returns false. This will > ensure that in case of Parallel Copy if the leader has performed all > these checks, the worker won't do it again. I also feel that having a > separate function containing all these checks will make the code look > a bit cleaner. > In the recent patch posted we have changed it to simplify the check for parallel copy, it is not an exact match. I feel this comment is not applicable on the latest patch Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
Thanks Ashutosh for reviewing and providing your comments. On Fri, Oct 23, 2020 at 5:43 PM Ashutosh Sharma wrote: > > Hi Vignesh, > > Thanks for the updated patches. Here are some more comments that I can > find after reviewing your latest patches: > > +/* > + * This structure helps in storing the common data from CopyStateData that > are > + * required by the workers. This information will then be allocated and > stored > + * into the DSM for the worker to retrieve and copy it to CopyStateData. > + */ > +typedef struct SerializedParallelCopyState > +{ > + /* low-level state data */ > + CopyDestcopy_dest; /* type of copy source/destination */ > + int file_encoding; /* file or remote side's character encoding */ > + boolneed_transcoding; /* file encoding diff from server? */ > + boolencoding_embeds_ascii; /* ASCII can be non-first byte? */ > + > ... > ... > + > + /* Working state for COPY FROM */ > + AttrNumber num_defaults; > + Oid relid; > +} SerializedParallelCopyState; > > Can the above structure not be part of the CopyStateData structure? I > am just asking this question because all the fields present in the > above structure are also present in the CopyStateData structure. So, > including it in the CopyStateData structure will reduce the code > duplication and will also make CopyStateData a bit shorter. > I have removed the common members from the structure, now there are no common members between CopyStateData & the new structure. I'm using CopyStateData to copy to/from directly in the new patch. > -- > > + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist, > +relid); > > Do we need to pass cstate->nworkers and relid to BeginParallelCopy() > function when we are already passing cstate structure, using which > both of these information can be retrieved ? > nworkers need not be passed as you have suggested but relid need to be passed as we will be setting it to pcdata, modified nworkers as suggested. > -- > > +/* DSM keys for parallel copy. */ > +#define PARALLEL_COPY_KEY_SHARED_INFO 1 > +#define PARALLEL_COPY_KEY_CSTATE 2 > +#define PARALLEL_COPY_WAL_USAGE3 > +#define PARALLEL_COPY_BUFFER_USAGE 4 > > DSM key names do not appear to be consistent. For shared info and > cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY", > but for WalUsage and BufferUsage structures, it is prefixed with > "PARALLEL_COPY". I think it would be better to make them consistent. > Modified as suggested > -- > > if (resultRelInfo->ri_TrigDesc != NULL && > (resultRelInfo->ri_TrigDesc->trig_insert_before_row || > resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) > { > /* > * Can't support multi-inserts when there are any BEFORE/INSTEAD OF > * triggers on the table. Such triggers might query the table we're > * inserting into and act differently if the tuples that have already > * been processed and prepared for insertion are not there. > */ > insertMethod = CIM_SINGLE; > } > else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL && > resultRelInfo->ri_TrigDesc->trig_insert_new_table) > { > /* > * For partitioned tables we can't support multi-inserts when there > * are any statement level insert triggers. It might be possible to > * allow partitioned tables with such triggers in the future, but for > * now, CopyMultiInsertInfoFlush expects that any before row insert > * and statement level insert triggers are on the same relation. > */ > insertMethod = CIM_SINGLE; > } > else if (resultRelInfo->ri_FdwRoutine != NULL || > cstate->volatile_defexprs) > { > ... > ... > > I think, if possible, all these if-else checks in CopyFrom() can be > moved to a single function which can probably be named as > IdentifyCopyInsertMethod() and this function can be called in > IsParallelCopyAllowed(). This will ensure that in case of Parallel > Copy when the leader has performed all these checks, the worker won't > do it again. I also feel that it will make the code look a bit > cleaner. > In the recent patch posted we have changed it to simplify the check for parallel copy, it is not an exact match. I feel this comment is not applicable on the latest patch > -- > > +void > +ParallelCopyMain(dsm_segment *seg, shm_toc *toc) > +{ > ... > ... > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > + &walusage[ParallelWorkerNumber]); > + > + MemoryContextSwitchTo(oldcontext); > + pfree(cstate); > + return; > +} > > It seems like you also need to delete the memory context > (cstate->copycontext) here. > Added it. > -- > > +void > +ExecBeforeStmtTrigger(CopyState cstate) > +{ > + ESta
Re: Parallel copy
On Wed, Oct 21, 2020 at 3:50 PM Amit Kapila wrote: > > On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy > wrote: > > > > > > 9. Instead of calling CopyStringToSharedMemory() for each string > > variable, can't we just create a linked list of all the strings that > > need to be copied into shm and call CopyStringToSharedMemory() only > > once? We could avoid 5 function calls? > > > > If we want to avoid different function calls then can't we just store > all these strings in a local structure and use it? That might improve > the other parts of code as well where we are using these as individual > parameters. > I have made one structure SerializedListToStrCState to store all the variables. The rest of the common variables is directly copied from & into cstate. > > 10. Similar to above comment: can we fill all the required > > cstate->variables inside the function CopyNodeFromSharedMemory() and > > call it only once? In each worker we could save overhead of 5 function > > calls. > > > > Yeah, that makes sense. > I feel keeping it this way makes the code more readable, and also this is not in a performance intensive tight loop. I'm retaining the change as is unless we feel this will make an impact. This is addressed in v9 patch shared at [1]. [1] - https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
Thanks Heikki for reviewing and providing your comments. Please find my thoughts below. On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas wrote: > > I had a brief look at at this patch. Important work! A couple of first > impressions: > > 1. The split between patches > 0002-Framework-for-leader-worker-in-parallel-copy.patch and > 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite > artificial. All the stuff introduced in the first is unused until the > second patch is applied. The first patch introduces a forward > declaration for ParallelCopyData(), but the function only comes in the > second patch. The comments in the first patch talk about > LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only > comes in the second patch. I think these have to merged into one. If you > want to split it somehow, I'd suggest having a separate patch just to > move CopyStateData from copy.c to copy.h. The subsequent patch would > then be easier to read as you could see more easily what's being added > to CopyStateData. Actually I think it would be better to have a new > header file, copy_internal.h, to hold CopyStateData and the other > structs, and keep copy.h as it is. > I have merged 0002 & 0003 patch, I have moved few things like creation of copy_internal.h, moving of CopyStateData from copy.c into copy_internal.h into 0001 patch. > 2. This desperately needs some kind of a high-level overview of how it > works. What is a leader, what is a worker? Which process does each step > of COPY processing, like reading from the file/socket, splitting the > input into lines, handling escapes, calling input functions, and > updating the heap and indexes? What data structures are used for the > communication? How does is the work synchronized between the processes? > There are comments on those individual aspects scattered in the patch, > but if you're not already familiar with it, you don't know where to > start. There's some of that in the commit message, but it needs to be > somewhere in the source code, maybe in a long comment at the top of > copyparallel.c. > Added it in copyparallel.c > 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for > every input line. Doesn't that incur a lot of synchronization overhead? > I haven't done any testing, this is just my gut feeling, but I assumed > you'd work in batches of, say, 100 or 1000 lines each. > Data read from the file will be stored in DSM which is of size 64k * 1024. Leader will parse and identify the line boundary like which line starts from which data block, what is the starting offset in the data block, what is the line size, this information will be present in ParallelCopyLineBoundary. Like you said, each worker processes WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run for parallel copy are available at [1]. This is addressed in v9 patch shared at [2]. [1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Wed, Oct 21, 2020 at 4:20 PM Bharath Rupireddy wrote: > > On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy > wrote: > > > > 17. Remove extra lines after #define IsHeaderLine() > > (cstate->header_line && cstate->cur_lineno == 1) in copy.h > > > > I missed one comment: > > 18. I think we need to treat the number of parallel workers as an > integer similar to the parallel option in vacuum. > > postgres=# copy t1 from stdin with(parallel '1'); < - we > should not allow this. > Enter data to be copied followed by a newline. > > postgres=# vacuum (parallel '1') t1; > ERROR: parallel requires an integer value > I have made the behavior the same as vacuum. This is addressed in v9 patch shared at [1]. [1] - https://www.postgresql.org/message-id/caldanm1caonkfdn6k72dsirpgqngvwxql7tjeihz58opnp9...@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Oct 23, 2020 at 5:42 PM Ashutosh Sharma wrote: > > Hi Vignesh, > > Thanks for the updated patches. Here are some more comments that I can > find after reviewing your latest patches: > > +/* > + * This structure helps in storing the common data from CopyStateData that > are > + * required by the workers. This information will then be allocated and > stored > + * into the DSM for the worker to retrieve and copy it to CopyStateData. > + */ > +typedef struct SerializedParallelCopyState > +{ > + /* low-level state data */ > + CopyDestcopy_dest; /* type of copy source/destination */ > + int file_encoding; /* file or remote side's character encoding */ > + boolneed_transcoding; /* file encoding diff from server? */ > + boolencoding_embeds_ascii; /* ASCII can be non-first byte? */ > + > ... > ... > + > + /* Working state for COPY FROM */ > + AttrNumber num_defaults; > + Oid relid; > +} SerializedParallelCopyState; > > Can the above structure not be part of the CopyStateData structure? I > am just asking this question because all the fields present in the > above structure are also present in the CopyStateData structure. So, > including it in the CopyStateData structure will reduce the code > duplication and will also make CopyStateData a bit shorter. > > -- > > + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist, > +relid); > > Do we need to pass cstate->nworkers and relid to BeginParallelCopy() > function when we are already passing cstate structure, using which > both of these information can be retrieved ? > > -- > > +/* DSM keys for parallel copy. */ > +#define PARALLEL_COPY_KEY_SHARED_INFO 1 > +#define PARALLEL_COPY_KEY_CSTATE 2 > +#define PARALLEL_COPY_WAL_USAGE3 > +#define PARALLEL_COPY_BUFFER_USAGE 4 > > DSM key names do not appear to be consistent. For shared info and > cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY", > but for WalUsage and BufferUsage structures, it is prefixed with > "PARALLEL_COPY". I think it would be better to make them consistent. > > -- > > if (resultRelInfo->ri_TrigDesc != NULL && > (resultRelInfo->ri_TrigDesc->trig_insert_before_row || > resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) > { > /* > * Can't support multi-inserts when there are any BEFORE/INSTEAD OF > * triggers on the table. Such triggers might query the table we're > * inserting into and act differently if the tuples that have already > * been processed and prepared for insertion are not there. > */ > insertMethod = CIM_SINGLE; > } > else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL && > resultRelInfo->ri_TrigDesc->trig_insert_new_table) > { > /* > * For partitioned tables we can't support multi-inserts when there > * are any statement level insert triggers. It might be possible to > * allow partitioned tables with such triggers in the future, but for > * now, CopyMultiInsertInfoFlush expects that any before row insert > * and statement level insert triggers are on the same relation. > */ > insertMethod = CIM_SINGLE; > } > else if (resultRelInfo->ri_FdwRoutine != NULL || > cstate->volatile_defexprs) > { > ... > ... > > I think, if possible, all these if-else checks in CopyFrom() can be > moved to a single function which can probably be named as > IdentifyCopyInsertMethod() and this function can be called in > IsParallelCopyAllowed(). This will ensure that in case of Parallel > Copy when the leader has performed all these checks, the worker won't > do it again. I also feel that it will make the code look a bit > cleaner. > Just rewriting above comment to make it a bit more clear: I think, if possible, all these if-else checks in CopyFrom() should be moved to a separate function which can probably be named as IdentifyCopyInsertMethod() and this function called from IsParallelCopyAllowed() and CopyFrom() functions. It will only be called from CopyFrom() when IsParallelCopy() returns false. This will ensure that in case of Parallel Copy if the leader has performed all these checks, the worker won't do it again. I also feel that having a separate function containing all these checks will make the code look a bit cleaner. > -- > > +void > +ParallelCopyMain(dsm_segment *seg, shm_toc *toc) > +{ > ... > ... > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > + &walusage[ParallelWorkerNumber]); > + > + MemoryContextSwitchTo(oldcontext); > + pfree(cstate); > + return; > +} > > It seems like you also need to delete the memory context > (cstate->copycontext) here. > > -- > > +void > +ExecBeforeStmtTrigger(CopyState cstate) > +{ > + EState
Re: Parallel copy
Hi Vignesh, Thanks for the updated patches. Here are some more comments that I can find after reviewing your latest patches: +/* + * This structure helps in storing the common data from CopyStateData that are + * required by the workers. This information will then be allocated and stored + * into the DSM for the worker to retrieve and copy it to CopyStateData. + */ +typedef struct SerializedParallelCopyState +{ + /* low-level state data */ + CopyDestcopy_dest; /* type of copy source/destination */ + int file_encoding; /* file or remote side's character encoding */ + boolneed_transcoding; /* file encoding diff from server? */ + boolencoding_embeds_ascii; /* ASCII can be non-first byte? */ + ... ... + + /* Working state for COPY FROM */ + AttrNumber num_defaults; + Oid relid; +} SerializedParallelCopyState; Can the above structure not be part of the CopyStateData structure? I am just asking this question because all the fields present in the above structure are also present in the CopyStateData structure. So, including it in the CopyStateData structure will reduce the code duplication and will also make CopyStateData a bit shorter. -- + pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist, +relid); Do we need to pass cstate->nworkers and relid to BeginParallelCopy() function when we are already passing cstate structure, using which both of these information can be retrieved ? -- +/* DSM keys for parallel copy. */ +#define PARALLEL_COPY_KEY_SHARED_INFO 1 +#define PARALLEL_COPY_KEY_CSTATE 2 +#define PARALLEL_COPY_WAL_USAGE3 +#define PARALLEL_COPY_BUFFER_USAGE 4 DSM key names do not appear to be consistent. For shared info and cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY", but for WalUsage and BufferUsage structures, it is prefixed with "PARALLEL_COPY". I think it would be better to make them consistent. -- if (resultRelInfo->ri_TrigDesc != NULL && (resultRelInfo->ri_TrigDesc->trig_insert_before_row || resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) { /* * Can't support multi-inserts when there are any BEFORE/INSTEAD OF * triggers on the table. Such triggers might query the table we're * inserting into and act differently if the tuples that have already * been processed and prepared for insertion are not there. */ insertMethod = CIM_SINGLE; } else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL && resultRelInfo->ri_TrigDesc->trig_insert_new_table) { /* * For partitioned tables we can't support multi-inserts when there * are any statement level insert triggers. It might be possible to * allow partitioned tables with such triggers in the future, but for * now, CopyMultiInsertInfoFlush expects that any before row insert * and statement level insert triggers are on the same relation. */ insertMethod = CIM_SINGLE; } else if (resultRelInfo->ri_FdwRoutine != NULL || cstate->volatile_defexprs) { ... ... I think, if possible, all these if-else checks in CopyFrom() can be moved to a single function which can probably be named as IdentifyCopyInsertMethod() and this function can be called in IsParallelCopyAllowed(). This will ensure that in case of Parallel Copy when the leader has performed all these checks, the worker won't do it again. I also feel that it will make the code look a bit cleaner. -- +void +ParallelCopyMain(dsm_segment *seg, shm_toc *toc) +{ ... ... + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], + &walusage[ParallelWorkerNumber]); + + MemoryContextSwitchTo(oldcontext); + pfree(cstate); + return; +} It seems like you also need to delete the memory context (cstate->copycontext) here. -- +void +ExecBeforeStmtTrigger(CopyState cstate) +{ + EState *estate = CreateExecutorState(); + ResultRelInfo *resultRelInfo; This function has a lot of comments which have been copied as it is from the CopyFrom function, I think it would be good to remove those comments from here and mention that this code changes done in this function has been taken from the CopyFrom function. If any queries people may refer to the CopyFrom function. This will again avoid the unnecessary code in the patch. -- As Heikki rightly pointed out in his previous email, we need some high level description of how Parallel Copy works somewhere in copyparallel.c file. For reference, please see how a brief description about parallel vacuum has been added in the vacuumlazy.c file. * Lazy vacuum supports parallel execution with parallel worker processes. In * a parallel vacuum, we perform both index vacuum and index cleanup with * parallel worker processes.
Re: Parallel copy
I had a brief look at at this patch. Important work! A couple of first impressions: 1. The split between patches 0002-Framework-for-leader-worker-in-parallel-copy.patch and 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite artificial. All the stuff introduced in the first is unused until the second patch is applied. The first patch introduces a forward declaration for ParallelCopyData(), but the function only comes in the second patch. The comments in the first patch talk about LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only comes in the second patch. I think these have to merged into one. If you want to split it somehow, I'd suggest having a separate patch just to move CopyStateData from copy.c to copy.h. The subsequent patch would then be easier to read as you could see more easily what's being added to CopyStateData. Actually I think it would be better to have a new header file, copy_internal.h, to hold CopyStateData and the other structs, and keep copy.h as it is. 2. This desperately needs some kind of a high-level overview of how it works. What is a leader, what is a worker? Which process does each step of COPY processing, like reading from the file/socket, splitting the input into lines, handling escapes, calling input functions, and updating the heap and indexes? What data structures are used for the communication? How does is the work synchronized between the processes? There are comments on those individual aspects scattered in the patch, but if you're not already familiar with it, you don't know where to start. There's some of that in the commit message, but it needs to be somewhere in the source code, maybe in a long comment at the top of copyparallel.c. 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for every input line. Doesn't that incur a lot of synchronization overhead? I haven't done any testing, this is just my gut feeling, but I assumed you'd work in batches of, say, 100 or 1000 lines each. - Heikki
Re: Parallel copy
On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy wrote: > > 17. Remove extra lines after #define IsHeaderLine() > (cstate->header_line && cstate->cur_lineno == 1) in copy.h > I missed one comment: 18. I think we need to treat the number of parallel workers as an integer similar to the parallel option in vacuum. postgres=# copy t1 from stdin with(parallel '1'); < - we should not allow this. Enter data to be copied followed by a newline. postgres=# vacuum (parallel '1') t1; ERROR: parallel requires an integer value With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy wrote: > > > 9. Instead of calling CopyStringToSharedMemory() for each string > variable, can't we just create a linked list of all the strings that > need to be copied into shm and call CopyStringToSharedMemory() only > once? We could avoid 5 function calls? > If we want to avoid different function calls then can't we just store all these strings in a local structure and use it? That might improve the other parts of code as well where we are using these as individual parameters. > 10. Similar to above comment: can we fill all the required > cstate->variables inside the function CopyNodeFromSharedMemory() and > call it only once? In each worker we could save overhead of 5 function > calls. > Yeah, that makes sense. -- With Regards, Amit Kapila.
Re: Parallel copy
Hi Vignesh, I took a look at the v8 patch set. Here are some comments: 1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo() or PopulateCopyStateInfo()? And also EstimateCstateSize() -- EstimateCStateSize(), PopulateCstateCatalogInfo() -- PopulateCStateCatalogInfo()? 2. Instead of mentioning numbers like 1024, 64K, 10240 in the comments, can we represent them in terms of macros? /* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */ #define MAX_BLOCKS_COUNT 1024 /* * It can hold upto 10240 record information for worker to process. RINGSIZE 3. How about " Each worker at once will pick the WORKER_CHUNK_COUNT records from the DSM data blocks and store them in it's local memory. This is to make workers not contend much while getting record information from the DSM. Read RINGSIZE comments before changing this value. " instead of /* * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data * block to process to avoid lock contention. Read RINGSIZE comments before * changing this value. */ 4. How about one line gap before and after for comments: "Leader should operate in the following order:" and "Worker should operate in the following order:" 5. Can we move RAW_BUF_BYTES macro definition to the beginning of the copy.h where all the macro are defined? 6. I don't think we need the change in toast_internals.c with the temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed)); in GetCurrentCommandId() 7. I think /* Can't perform copy in parallel */ if (parallel_workers <= 0) return NULL; can be /* Can't perform copy in parallel */ if (parallel_workers == 0) return NULL; as parallel_workers can never be < 0 since we enter BeginParallelCopy only if cstate->nworkers > 0 and also we are not allowed to have negative values for max_worker_processes. 8. Do we want to pfree(cstate->pcdata) in case we failed to start any parallel workers, we would have allocated a good else { /* * Reset nworkers to -1 here. This is useful in cases where user * specifies parallel workers, but, no worker is picked up, so go * back to non parallel mode value of nworkers. */ cstate->nworkers = -1; *processed = CopyFrom(cstate);/* copy from file to database */ } 9. Instead of calling CopyStringToSharedMemory() for each string variable, can't we just create a linked list of all the strings that need to be copied into shm and call CopyStringToSharedMemory() only once? We could avoid 5 function calls? 10. Similar to above comment: can we fill all the required cstate->variables inside the function CopyNodeFromSharedMemory() and call it only once? In each worker we could save overhead of 5 function calls. 11. Looks like CopyStringFromSharedMemory() and CopyNodeFromSharedMemory() do almost the same things except stringToNode() and pfree(destptr);. Can we have a generic function CopyFromSharedMemory() or something else and handle with flag "bool isnode" to differentiate the two use cases? 12. Can we move below check to the end in IsParallelCopyAllowed()? /* Check parallel safety of the trigger functions. */ if (cstate->rel->trigdesc != NULL && !CheckRelTrigFunParallelSafety(cstate->rel->trigdesc)) return false; 13. CacheLineInfo(): Instead of goto empty_data_line_update; how about having this directly inside the if block as it's being used only once? 14. GetWorkerLine(): How about avoiding goto statements and replacing the common code with a always static inline function or a macro? 15. UpdateSharedLineInfo(): Below line is misaligned. lineInfo->first_block = blk_pos; lineInfo->start_offset = offset; 16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the start of for (;;)? 17. Remove extra lines after #define IsHeaderLine() (cstate->header_line && cstate->cur_lineno == 1) in copy.h With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Oct 8, 2020 at 11:15 AM vignesh C wrote: > > I'm summarizing the pending open points so that I don't miss anything: > 1) Performance test on latest patch set. It is tested and results are shared by bharath at [1] > 2) Testing points suggested. Tests are added as suggested and details shared by bharath at [1] > 3) Support of parallel copy for COPY_OLD_FE. It is handled as part of v8 patch shared at [2] > 4) Worker has to hop through all the processed chunks before getting > the chunk which it can process. Open > 5) Handling of Tomas's comments. I have fixed and updated the fix details as part of [3] > 6) Handling of Greg's comments. I have fixed and updated the fix details as part of [4] Except for "4) Worker has to hop through all the processed chunks before getting the chunk which it can process", all open tasks are handled. I will work on this and provide an update shortly. [1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/caldanm2ucmcmozcbkl8b7az9oyd9hz+fndczhssiiqj4v-x...@mail.gmail.com [3] https://www.postgresql.org/message-id/CALDaNm0_zUa9%2BS%3DpwCz3Yp43SY3r9bnO4v-9ucXUujEE%3D0Sd7g%40mail.gmail.com [4] https://www.postgresql.org/message-id/CALDaNm31pGG%2BL9N4HbM0mO4iuceih4mJ5s87jEwOPaFLpmDKyQ%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila wrote: > > > > 2. Do we have tests for toast tables? I think if you implement the > > previous point some existing tests might cover it but I feel we should > > have at least one or two tests for the same. > > > Toast table use case 1: 1 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB: > (222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X) > > Toast table use case 2: 10 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB: > (2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X) > > Toast table use case3: 1 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB: > (136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X) > > In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon. > I analyzed the above point of getting only upto 2.5X performance improvement for csv files with a toast table with 3 indexers - 2 on integer columns and 1 on text column(not the toast column). Reason is that workers are fast enough to do the work and they are waiting for the leader to fill in the data blocks and in this case the leader is able to serve the workers at its maximum possible speed. Hence most of the time the workers are waiting not doing any beneficial work. Having observed the above point, I tried to make workers perform more work to avoid waiting time. For this, I added a gist index on the toasted text column. The use and results are as follows. Toast table use case4: 1 tuples, 9.6GB, 4 indexes - 2 on integer columns, 1 on non-toasted text column and 1 gist index on toasted text column, csv file, each row is ~ 12.2KB: (1322.839, 0, 1X), (1261.176, 1, 1.05X), (632.296, 2, 2.09X), (321.941, 4, 4.11X), (181.796, 8, 7.27X), *(105.750, 16, 12.51X)*, (107.099, 20, 12.35X), (123.262, 30, 10.73X) With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie wrote: > > Hi Vignesh, > > After having a look over the patch, > I have some suggestions for > 0003-Allow-copy-from-command-to-process-data-from-file.patch. > > 1. > > +static uint32 > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List > *attnamelist, > + char **whereClauseStr, char > **rangeTableStr, > + char **attnameListStr, char > **notnullListStr, > + char **nullListStr, char **convertListStr) > +{ > + uint32 strsize = > MAXALIGN(sizeof(SerializedParallelCopyState)); > + > + strsize += EstimateStringSize(cstate->null_print); > + strsize += EstimateStringSize(cstate->delim); > + strsize += EstimateStringSize(cstate->quote); > + strsize += EstimateStringSize(cstate->escape); > > > It use function EstimateStringSize to get the strlen of null_print, delim, > quote and escape. > But the length of null_print seems has been stored in null_print_len. > And delim/quote/escape must be 1 byte, so I think call strlen again seems > unnecessary. > > How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" > +1. This seems like a good suggestion but add comments for delim/quote/escape to indicate that we are considering one-byte for each. I think this will obviate the need of function EstimateStringSize. Another thing in this regard is that we normally use add_size function to compute the size but I don't see that being used in this and nearby computation. That helps us to detect overflow of addition if any. EstimateCstateSize() { .. + + strsize++; .. } Why do we need this additional one-byte increment? Does it make sense to add a small comment for the same? > 2. > + strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr); > > + copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr, > + > shmptr + copiedsize); > > Some string length is counted for two times. > The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call > strlen in CopyStringToSharedMemory again. > I don't know wheather it's worth to refacor the code to avoid duplicate > strlen . what do you think ? > It doesn't seem worth to me. We probably need to use additional variables to save those lengths. I think it will add more code/complexity than we will save. See EstimateParamListSpace and SerializeParamList where we get the typeLen each time, that way code looks neat to me and we are don't going to save much by not following a similar thing here. -- With Regards, Amit Kapila.
RE: Parallel copy
Hi Vignesh, After having a look over the patch, I have some suggestions for 0003-Allow-copy-from-command-to-process-data-from-file.patch. 1. +static uint32 +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist, + char **whereClauseStr, char **rangeTableStr, + char **attnameListStr, char **notnullListStr, + char **nullListStr, char **convertListStr) +{ + uint32 strsize = MAXALIGN(sizeof(SerializedParallelCopyState)); + + strsize += EstimateStringSize(cstate->null_print); + strsize += EstimateStringSize(cstate->delim); + strsize += EstimateStringSize(cstate->quote); + strsize += EstimateStringSize(cstate->escape); It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape. But the length of null_print seems has been stored in null_print_len. And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary. How about " strsize += sizeof(uint32) + cstate->null_print_len + 1" 2. + strsize += EstimateNodeSize(cstate->whereClause, whereClauseStr); + copiedsize += CopyStringToSharedMemory(cstate, whereClauseStr, + shmptr + copiedsize); Some string length is counted for two times. The ' whereClauseStr ' has call strlen in EstimateNodeSize once and call strlen in CopyStringToSharedMemory again. I don't know wheather it's worth to refacor the code to avoid duplicate strlen . what do you think ? Best regards, houzj
Re: Parallel copy
On Wed, Oct 14, 2020 at 6:51 PM vignesh C wrote: > > On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila wrote: > > > > I am not able to properly parse the data but If understand the wal > > data for non-parallel (1116 | 0 | 3587203) and parallel (1119 > > | 6 | 3624405) case doesn't seem to be the same. Is that > > right? If so, why? Please ensure that no checkpoint happens for both > > cases. > > > > I have disabled checkpoint, the results with the checkpoint disabled > are given below: >| wal_records | wal_fpi | wal_bytes > Sequential Copy | 1116| 0 | 3587669 > Parallel Copy(1 worker) | 1116| 0 | 3587669 > Parallel Copy(4 worker) | 1121| 0 | 3587668 > I noticed that for 1 worker wal_records & wal_bytes are same as > sequential copy, but with different worker count I had noticed that > there is difference in wal_records & wal_bytes, I think the difference > should be ok because with more than 1 worker the order of records > processed will be different based on which worker picks which records > to process from input file. In the case of sequential copy/1 worker > the order in which the records will be processed is always in the same > order hence wal_bytes are the same. > Are all records of the same size in your test? If so, then why the order should matter? Also, even the number of wal_records has increased but wal_bytes are not increased, rather it is one-byte less. Can we identify what is going on here? I don't intend to say that it is a problem but we should know the reason clearly. -- With Regards, Amit Kapila.
Re: Parallel copy
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow wrote: > > On Thu, Oct 8, 2020 at 5:44 AM vignesh C wrote: > > > Attached v6 patch with the fixes. > > > > Hi Vignesh, > > I noticed a couple of issues when scanning the code in the following patch: > > v6-0003-Allow-copy-from-command-to-process-data-from-file.patch > > In the following code, it will put a junk uint16 value into *destptr > (and thus may well cause a crash) on a Big Endian architecture > (Solaris Sparc, s390x, etc.): > You're storing a (uint16) string length in a uint32 and then pulling > out the lower two bytes of the uint32 and copying them into the > location pointed to by destptr. > > > static void > +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr, > + uint32 *copiedsize) > +{ > + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0; > + > + memcpy(destptr, (uint16 *) &len, sizeof(uint16)); > + *copiedsize += sizeof(uint16); > + if (len) > + { > + memcpy(destptr + sizeof(uint16), srcPtr, len); > + *copiedsize += len; > + } > +} > > I suggest you change the code to: > > uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; > memcpy(destptr, &len, sizeof(uint16)); > > [I assume string length here can't ever exceed (65535 - 1), right?] > > Looking a bit deeper into this, I'm wondering if in fact your > EstimateStringSize() and EstimateNodeSize() functions should be using > BUFFERALIGN() for EACH stored string/node (rather than just calling > shm_toc_estimate_chunk() once at the end, after the length of packed > strings and nodes has been estimated), to ensure alignment of start of > each string/node. Other Postgres code appears to be aligning each > stored chunk using shm_toc_estimate_chunk(). See the definition of > that macro and its current usages. > I'm not handling this, this is similar to how it is handled in other places. > Then you could safely use: > > uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; > *(uint16 *)destptr = len; > *copiedsize += sizeof(uint16); > if (len) > { > memcpy(destptr + sizeof(uint16), srcPtr, len); > *copiedsize += len; > } > > and in the CopyStringFromSharedMemory() function, then could safely use: > > len = *(uint16 *)srcPtr; > > The compiler may be smart enough to optimize-away the memcpy() in this > case anyway, but there are issues in doing this for architectures that > take a performance hit for unaligned access, or don't support > unaligned access. Changed it to uin32, so that there are no issues in case if length exceeds 65535 & also to avoid problems in Big Endian architecture. > Also, in CopyFromSharedMemory() functions, you should use palloc() > instead of palloc0(), as you're filling the entire palloc'd buffer > anyway, so no need to ask for additional MemSet() of all buffer bytes > to 0 prior to memcpy(). > I have changed palloc0 to palloc. Thanks Greg for reviewing & providing your comments. These changes are fixed in one of my earlier mail [1] that I sent. [1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra wrote: > > Hello Vignesh, > > I've done some basic benchmarking on the v4 version of the patches (but > AFAIKC the v5 should perform about the same), and some initial review. > > For the benchmarking, I used the lineitem table from TPC-H - for 75GB > data set, this largest table is about 64GB once loaded, with another > 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and > NVME storage. > > The COPY duration with varying number of workers (specified using the > parallel COPY option) looks like this: > > workersduration > - > 01366 > 11255 > 2 704 > 3 526 > 4 434 > 5 385 > 6 347 > 7 322 > 8 327 > > So this seems to work pretty well - initially we get almost linear > speedup, then it slows down (likely due to contention for locks, I/O > etc.). Not bad. Thanks for testing with different workers & posting the results. > I've only done a quick review, but overall the patch looks in fairly > good shape. > > 1) I don't quite understand why we need INCREMENTPROCESSED and > RETURNPROCESSED, considering it just does ++ or return. It just > obfuscated the code, I think. > I have removed the macros. > 2) I find it somewhat strange that BeginParallelCopy can just decide not > to do parallel copy after all. Why not to do this decisions in the > caller? Or maybe it's fine this way, not sure. > I have moved the check IsParallelCopyAllowed to the caller. > 3) AFAIK we don't modify typedefs.list in patches, so these changes > should be removed. > I had seen that in many of the commits typedefs.list is getting changed, also it helps in running pgindent. So I'm retaining this change. > 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just > one, so the comment needs minor rewording. > Modified the comments. Thanks for the comments & sharing the test results Tomas, These changes are fixed in one of my earlier mail [1] that I sent. [1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Oct 9, 2020 at 11:01 AM Amit Kapila wrote: > > On Thu, Oct 8, 2020 at 12:14 AM vignesh C wrote: > > > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila > > wrote: > > > > > + */ > > > > > +typedef struct ParallelCopyLineBoundary > > > > > > > > > > Are we doing all this state management to avoid using locks while > > > > > processing lines? If so, I think we can use either spinlock or LWLock > > > > > to keep the main patch simple and then provide a later patch to make > > > > > it lock-less. This will allow us to first focus on the main design of > > > > > the patch rather than trying to make this datastructure processing > > > > > lock-less in the best possible way. > > > > > > > > > > > > > The steps will be more or less same if we use spinlock too. step 1, > > > > step 3 & step 4 will be common we have to use lock & unlock instead of > > > > step 2 & step 5. I feel we can retain the current implementation. > > > > > > > > > > I'll study this in detail and let you know my opinion on the same but > > > in the meantime, I don't follow one part of this comment: "If they > > > don't follow this order the worker might process wrong line_size and > > > leader might populate the information which worker has not yet > > > processed or in the process of processing." > > > > > > Do you want to say that leader might overwrite some information which > > > worker hasn't read yet? If so, it is not clear from the comment. > > > Another minor point about this comment: > > > > > > > Here leader and worker must follow these steps to avoid any corruption > > or hang issue. Changed it to: > > * The leader & worker process access the shared line information by > > following > > * the below steps to avoid any data corruption or hang: > > > > Actually, I wanted more on the lines why such corruption or hang can > happen? It might help reviewers to understand why you have followed > such a sequence. There are 3 variables which the leader & worker are working on: line_size, line_state & data. Leader will update line_state & populate data, update line_size & line_state. Workers will wait for line_state to be updated, once the updated leader will read the data based on the line_size. If the worker is not synchronized wrong line_size will be set & read wrong amount of data, anything can happen.There are 3 variables which leader & worker are working on: line_size, line_state & data. Leader will update line_state & populate data, update line_size & line_state. Workers will wait for line_state to be updated, once the updated leader will read the data based on the line_size. If the worker is not synchronized wrong line_size will be set & read wrong amount of data, anything can happen. This is the usual concurrency case with reader/writers. I felt that much details need not be mentioned. > > > > > > How did you ensure that this is fixed? Have you tested it, if so > > > please share the test? I see a basic problem with your fix. > > > > > > + /* Report WAL/buffer usage during parallel execution */ > > > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); > > > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); > > > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > > > + &walusage[ParallelWorkerNumber]); > > > > > > You need to call InstrStartParallelQuery() before the actual operation > > > starts, without that stats won't be accurate? Also, after calling > > > WaitForParallelWorkersToFinish(), you need to accumulate the stats > > > collected from workers which neither you have done nor is possible > > > with the current code in your patch because you haven't made any > > > provision to capture them in BeginParallelCopy. > > > > > > I suggest you look into lazy_parallel_vacuum_indexes() and > > > begin_parallel_vacuum() to understand how the buffer/wal usage stats > > > are accumulated. Also, please test this functionality using > > > pg_stat_statements. > > > > > > > Made changes accordingly. > > I have verified it using: > > postgres=# select * from pg_stat_statements where query like '%copy%'; > > userid | dbid | queryid| > > query > >| plans | total_plan_time | > > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time | > > calls | total_exec_time | min_exec_time | max_exec_time | > > mean_exec_time | stddev_exec_time | rows | shared_blks_hi > > t | shared_blks_read | shared_blks_dirtied | shared_blks_written | > > local_blks_hit | local_blks_read | local_blks_dirtied | > > local_blks_written | temp_blks_read | temp_blks_written | blk_ > > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes > > +---+--+-+---+-+- > > --+---++--+---+-+--
Re: Parallel copy
On Fri, Oct 9, 2020 at 10:42 AM Amit Kapila wrote: > > On Thu, Oct 8, 2020 at 12:14 AM vignesh C wrote: > > > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > > > > > > I am convinced by the reason given by Kyotaro-San in that another > > > thread [1] and performance data shown by Peter that this can't be an > > > independent improvement and rather in some cases it can do harm. Now, > > > if you need it for a parallel-copy path then we can change it > > > specifically to the parallel-copy code path but I don't understand > > > your reason completely. > > > > > > > Whenever we need data to be populated, we will get a new data block & > > pass it to CopyGetData to populate the data. In case of file copy, the > > server will completely fill the data block. We expect the data to be > > filled completely. If data is available it will completely load the > > complete data block in case of file copy. There is no scenario where > > even if data is present a partial data block will be returned except > > for EOF or no data available. But in case of STDIN data copy, even > > though there is 8K data available in data block & 8K data available in > > STDIN, CopyGetData will return as soon as libpq buffer data is more > > than the minread. We will pass new data block every time to load data. > > Every time we pass an 8K data block but CopyGetData loads a few bytes > > in the new data block & returns. I wanted to keep the same data > > population logic for both file copy & STDIN copy i.e copy full 8K data > > blocks & then the populated data can be required. There is an > > alternative solution I can have some special handling in case of STDIN > > wherein the existing data block can be passed with the index from > > where the data should be copied. Thoughts? > > > > What you are proposing as an alternative solution, isn't that what we > are doing without the patch? IIUC, you require this because of your > corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that > right? If so, what is the difficulty in making it behave similar to > the non-parallel case? > The alternate solution is similar to how existing copy handles STDIN copies, I have made changes in the v7 patch attached in [1] to have parallel copy handle STDIN data similar to non parallel copy, so the original comment on why this change is required has been removed from 001 patch: > > + if (cstate->copy_dest == COPY_NEW_FE) > > + minread = RAW_BUF_SIZE - nbytes; > > + > > inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, > > - 1, RAW_BUF_SIZE - nbytes); > > + minread, RAW_BUF_SIZE - nbytes); > > > > No comment to explain why this change is done? [1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
I did performance testing on v7 patch set[1] with custom postgresql.conf[2]. The results are of the triplet form (exec time in sec, number of workers, gain) Use case 1: 10million rows, 5.2GB data, 2 indexes on integer columns, 1 index on text column, binary file (1104.898, 0, 1X), (1112.221, 1, 1X), (640.236, 2, 1.72X), (335.090, 4, 3.3X), (200.492, 8, 5.51X), (131.448, 16, 8.4X), (121.832, 20, 9.1X), (124.287, 30, 8.9X) Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, copy from stdin, csv format (1203.282, 0, 1X), (1135.517, 1, 1.06X), (655.140, 2, 1.84X), (343.688, 4, 3.5X), (203.742, 8, 5.9X), (144.793, 16, 8.31X), (133.339, 20, 9.02X), (136.672, 30, 8.8X) Use case 3: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, text file (1165.991, 0, 1X), (1128.599, 1, 1.03X), (644.793, 2, 1.81X), (342.813, 4, 3.4X), (204.279, 8, 5.71X), (139.986, 16, 8.33X), (128.259, 20, 9.1X), (132.764, 30, 8.78X) Above results are similar to the results with earlier versions of the patch set. On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila wrote: > > Sure, you need to change the code such that when force_parallel_mode = > 'regress' is specified then it always uses one worker. This is > primarily for testing purposes and will help during the development of > this patch as it will make all exiting Copy tests to use quite a good > portion of the parallel infrastructure. > I performed force_parallel_mode = regress testing and found 2 issues, the fixes for the same are available in v7 patch set[1]. > > > Overall, we have below test cases to cover the code and for performance > > measurements. We plan to run these tests whenever a new set of patches is > > posted. > > > > 1. csv > > 2. binary > > Don't we need the tests for plain text files as well? > I added a text use case and above mentioned are perf results on v7 patch set[1]. > > > 3. force parallel mode = regress > > 4. toast data csv and binary > > 5. foreign key check, before row, after row, before statement, after > > statement, instead of triggers > > 6. partition case > > 7. foreign partitions and partitions having trigger cases > > 8. where clause having parallel unsafe and safe expression, default > > parallel unsafe and safe expression > > 9. temp, global, local, unlogged, inherited tables cases, foreign tables > > > > Sounds like good coverage. So, are you doing all this testing > manually? How are you maintaining these tests? > All test cases listed above, except for the cases that are meant to measure perf gain with huge data, are present in v7-0005 patch in v7 patch set[1]. [1] https://www.postgresql.org/message-id/CALDaNm1n1xW43neXSGs%3Dc7zt-mj%2BJHHbubWBVDYT9NfCoF8TuQ%40mail.gmail.com [2] shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Oct 9, 2020 at 3:50 PM Bharath Rupireddy wrote: > > On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila wrote: > > > > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy > > wrote: > > > > > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila > > > wrote: > > > > > > > > From the testing perspective, > > > > 1. Test by having something force_parallel_mode = regress which means > > > > that all existing Copy tests in the regression will be executed via > > > > new worker code. You can have this as a test-only patch for now and > > > > make sure all existing tests passed with this. > > > > > > > > > > I don't think all the existing copy test cases(except the new test cases > > > added in the parallel copy patch set) would run inside the parallel > > > worker if force_parallel_mode is on. This is because, the parallelism > > > will be picked up for parallel copy only if parallel option is specified > > > unlike parallelism for select queries. > > > > > > > Sure, you need to change the code such that when force_parallel_mode = > > 'regress' is specified then it always uses one worker. This is > > primarily for testing purposes and will help during the development of > > this patch as it will make all exiting Copy tests to use quite a good > > portion of the parallel infrastructure. > > > > IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS > as default value in guc.c, > No need to set this as the default value. You can change it in postgresql.conf before running tests. > and then adjust the parallelism related > code in copy.c such that it always picks 1 worker and spawns it. This > way, all the existing copy test cases would be run in parallel worker. > Please let me know if this is okay. > Yeah, this sounds fine. > If yes, I will do this and update > here. > Okay, thanks, but ensure the difference in test execution before and after your change. After your change, all the 'copy' tests should invoke the worker to perform a copy. > > > > > All the above tests are performed on the latest v6 patch set (attached > > > here in this thread) with custom postgresql.conf[1]. The results are of > > > the triplet form (exec time in sec, number of workers, gain) > > > > > > > Okay, so I am assuming the performance is the same as we have seen > > with the earlier versions of patches. > > > > Yes. Most recent run on v5 patch set [1] > Okay, good to know that. -- With Regards, Amit Kapila.
Re: Parallel copy
On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila wrote: > > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy > wrote: > > > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila wrote: > > > > > > From the testing perspective, > > > 1. Test by having something force_parallel_mode = regress which means > > > that all existing Copy tests in the regression will be executed via > > > new worker code. You can have this as a test-only patch for now and > > > make sure all existing tests passed with this. > > > > > > > I don't think all the existing copy test cases(except the new test cases > > added in the parallel copy patch set) would run inside the parallel worker > > if force_parallel_mode is on. This is because, the parallelism will be > > picked up for parallel copy only if parallel option is specified unlike > > parallelism for select queries. > > > > Sure, you need to change the code such that when force_parallel_mode = > 'regress' is specified then it always uses one worker. This is > primarily for testing purposes and will help during the development of > this patch as it will make all exiting Copy tests to use quite a good > portion of the parallel infrastructure. > IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS as default value in guc.c, and then adjust the parallelism related code in copy.c such that it always picks 1 worker and spawns it. This way, all the existing copy test cases would be run in parallel worker. Please let me know if this is okay. If yes, I will do this and update here. > > > All the above tests are performed on the latest v6 patch set (attached here > > in this thread) with custom postgresql.conf[1]. The results are of the > > triplet form (exec time in sec, number of workers, gain) > > > > Okay, so I am assuming the performance is the same as we have seen > with the earlier versions of patches. > Yes. Most recent run on v5 patch set [1] > > > Overall, we have below test cases to cover the code and for performance > > measurements. We plan to run these tests whenever a new set of patches is > > posted. > > > > 1. csv > > 2. binary > > Don't we need the tests for plain text files as well? > Will add one. > > > 3. force parallel mode = regress > > 4. toast data csv and binary > > 5. foreign key check, before row, after row, before statement, after > > statement, instead of triggers > > 6. partition case > > 7. foreign partitions and partitions having trigger cases > > 8. where clause having parallel unsafe and safe expression, default > > parallel unsafe and safe expression > > 9. temp, global, local, unlogged, inherited tables cases, foreign tables > > > > Sounds like good coverage. So, are you doing all this testing > manually? How are you maintaining these tests? > Yes, running them manually. Few of the tests(1,2,4) require huge datasets for performance measurements and other test cases are to ensure we don't choose parallelism. We will try to add test cases that are not meant for performance, to the patch test. [1] - https://www.postgresql.org/message-id/CALj2ACW%3Djm5ri%2B7rXiQaFT_c5h2rVS%3DcJOQVFR5R%2Bbowt3QDkw%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy wrote: > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila wrote: > > > > From the testing perspective, > > 1. Test by having something force_parallel_mode = regress which means > > that all existing Copy tests in the regression will be executed via > > new worker code. You can have this as a test-only patch for now and > > make sure all existing tests passed with this. > > > > I don't think all the existing copy test cases(except the new test cases > added in the parallel copy patch set) would run inside the parallel worker if > force_parallel_mode is on. This is because, the parallelism will be picked up > for parallel copy only if parallel option is specified unlike parallelism for > select queries. > Sure, you need to change the code such that when force_parallel_mode = 'regress' is specified then it always uses one worker. This is primarily for testing purposes and will help during the development of this patch as it will make all exiting Copy tests to use quite a good portion of the parallel infrastructure. > > All the above tests are performed on the latest v6 patch set (attached here > in this thread) with custom postgresql.conf[1]. The results are of the > triplet form (exec time in sec, number of workers, gain) > Okay, so I am assuming the performance is the same as we have seen with the earlier versions of patches. > Overall, we have below test cases to cover the code and for performance > measurements. We plan to run these tests whenever a new set of patches is > posted. > > 1. csv > 2. binary Don't we need the tests for plain text files as well? > 3. force parallel mode = regress > 4. toast data csv and binary > 5. foreign key check, before row, after row, before statement, after > statement, instead of triggers > 6. partition case > 7. foreign partitions and partitions having trigger cases > 8. where clause having parallel unsafe and safe expression, default parallel > unsafe and safe expression > 9. temp, global, local, unlogged, inherited tables cases, foreign tables > Sounds like good coverage. So, are you doing all this testing manually? How are you maintaining these tests? -- With Regards, Amit Kapila.
Re: Parallel copy
On Fri, Oct 9, 2020 at 5:40 PM Amit Kapila wrote: > > > Looking a bit deeper into this, I'm wondering if in fact your > > EstimateStringSize() and EstimateNodeSize() functions should be using > > BUFFERALIGN() for EACH stored string/node (rather than just calling > > shm_toc_estimate_chunk() once at the end, after the length of packed > > strings and nodes has been estimated), to ensure alignment of start of > > each string/node. Other Postgres code appears to be aligning each > > stored chunk using shm_toc_estimate_chunk(). See the definition of > > that macro and its current usages. > > > > I am not sure if this required for the purpose of correctness. AFAIU, > we do store/estimate multiple parameters in same way at other places, > see EstimateParamListSpace and SerializeParamList. Do you have > something else in mind? > The point I was trying to make is that potentially more efficient code can be used if the individual strings/nodes are aligned, rather than packed (as they are now), but as you point out, there are already cases (e.g. SerializeParamList) where within the separately-aligned chunks the data is not aligned, so maybe not a big deal. Oh well, without alignment, that means use of memcpy() cannot really be avoided here for serializing/de-serializing ints etc., let's hope the compiler optimizes it as best it can. Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel copy
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow wrote: > > On Thu, Oct 8, 2020 at 5:44 AM vignesh C wrote: > > > Attached v6 patch with the fixes. > > > > Hi Vignesh, > > I noticed a couple of issues when scanning the code in the following patch: > > v6-0003-Allow-copy-from-command-to-process-data-from-file.patch > > In the following code, it will put a junk uint16 value into *destptr > (and thus may well cause a crash) on a Big Endian architecture > (Solaris Sparc, s390x, etc.): > You're storing a (uint16) string length in a uint32 and then pulling > out the lower two bytes of the uint32 and copying them into the > location pointed to by destptr. > > > static void > +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr, > + uint32 *copiedsize) > +{ > + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0; > + > + memcpy(destptr, (uint16 *) &len, sizeof(uint16)); > + *copiedsize += sizeof(uint16); > + if (len) > + { > + memcpy(destptr + sizeof(uint16), srcPtr, len); > + *copiedsize += len; > + } > +} > > I suggest you change the code to: > > uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; > memcpy(destptr, &len, sizeof(uint16)); > > [I assume string length here can't ever exceed (65535 - 1), right?] > Your suggestion makes sense to me if the assumption related to string length is correct. If we can't ensure that then we need to probably use four bytes uint32 to store the length. > Looking a bit deeper into this, I'm wondering if in fact your > EstimateStringSize() and EstimateNodeSize() functions should be using > BUFFERALIGN() for EACH stored string/node (rather than just calling > shm_toc_estimate_chunk() once at the end, after the length of packed > strings and nodes has been estimated), to ensure alignment of start of > each string/node. Other Postgres code appears to be aligning each > stored chunk using shm_toc_estimate_chunk(). See the definition of > that macro and its current usages. > I am not sure if this required for the purpose of correctness. AFAIU, we do store/estimate multiple parameters in same way at other places, see EstimateParamListSpace and SerializeParamList. Do you have something else in mind? While looking at the latest code, I observed below issue in patch v6-0003-Allow-copy-from-command-to-process-data-from-file: + /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */ + est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState)); + shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared); + shm_toc_estimate_keys(&pcxt->estimator, 1); + + strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr, + &rangeTableStr, &attnameListStr, + ¬nullListStr, &nullListStr, + &convertListStr); Here, do we need to separately estimate the size of SerializedParallelCopyState when it is also done in EstimateCstateSize? -- With Regards, Amit Kapila.
Re: Parallel copy
On Thu, Oct 8, 2020 at 12:14 AM vignesh C wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > + */ > > > > +typedef struct ParallelCopyLineBoundary > > > > > > > > Are we doing all this state management to avoid using locks while > > > > processing lines? If so, I think we can use either spinlock or LWLock > > > > to keep the main patch simple and then provide a later patch to make > > > > it lock-less. This will allow us to first focus on the main design of > > > > the patch rather than trying to make this datastructure processing > > > > lock-less in the best possible way. > > > > > > > > > > The steps will be more or less same if we use spinlock too. step 1, step > > > 3 & step 4 will be common we have to use lock & unlock instead of step 2 > > > & step 5. I feel we can retain the current implementation. > > > > > > > I'll study this in detail and let you know my opinion on the same but > > in the meantime, I don't follow one part of this comment: "If they > > don't follow this order the worker might process wrong line_size and > > leader might populate the information which worker has not yet > > processed or in the process of processing." > > > > Do you want to say that leader might overwrite some information which > > worker hasn't read yet? If so, it is not clear from the comment. > > Another minor point about this comment: > > > > Here leader and worker must follow these steps to avoid any corruption > or hang issue. Changed it to: > * The leader & worker process access the shared line information by following > * the below steps to avoid any data corruption or hang: > Actually, I wanted more on the lines why such corruption or hang can happen? It might help reviewers to understand why you have followed such a sequence. > > > > How did you ensure that this is fixed? Have you tested it, if so > > please share the test? I see a basic problem with your fix. > > > > + /* Report WAL/buffer usage during parallel execution */ > > + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); > > + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); > > + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > > + &walusage[ParallelWorkerNumber]); > > > > You need to call InstrStartParallelQuery() before the actual operation > > starts, without that stats won't be accurate? Also, after calling > > WaitForParallelWorkersToFinish(), you need to accumulate the stats > > collected from workers which neither you have done nor is possible > > with the current code in your patch because you haven't made any > > provision to capture them in BeginParallelCopy. > > > > I suggest you look into lazy_parallel_vacuum_indexes() and > > begin_parallel_vacuum() to understand how the buffer/wal usage stats > > are accumulated. Also, please test this functionality using > > pg_stat_statements. > > > > Made changes accordingly. > I have verified it using: > postgres=# select * from pg_stat_statements where query like '%copy%'; > userid | dbid | queryid| > query >| plans | total_plan_time | > min_plan_time | max_plan_time | mean_plan_time | stddev_plan_time | > calls | total_exec_time | min_exec_time | max_exec_time | > mean_exec_time | stddev_exec_time | rows | shared_blks_hi > t | shared_blks_read | shared_blks_dirtied | shared_blks_written | > local_blks_hit | local_blks_read | local_blks_dirtied | > local_blks_written | temp_blks_read | temp_blks_written | blk_ > read_time | blk_write_time | wal_records | wal_fpi | wal_bytes > +---+--+-+---+-+- > --+---++--+---+-+---+---++--++--- > --+--+-+-++-++++---+- > --++-+-+--- > 10 | 13743 | -6947756673093447609 | copy hw from > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > csv, delimiter ',') | 0 | 0 | > 0 | 0 | 0 |0 | > 1 | 265.195105 |265.195105 |265.195105 | 265.195105 > |0 | 175000 |191 > 6 |0 | 946 | 946 | > 0 | 0 | 0 | 0 > | 0 | 0 | > 0 | 0 |1116 | 0 | 3587203 > 10 | 13743 | 8570215596364326047 | copy hw from > '/home/vignesh/postgres/postgres/inst/bin/hw_175000.csv' with(format > csv, delimiter ',', paral
Re: Parallel copy
On Thu, Oct 8, 2020 at 12:14 AM vignesh C wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > > > I am convinced by the reason given by Kyotaro-San in that another > > thread [1] and performance data shown by Peter that this can't be an > > independent improvement and rather in some cases it can do harm. Now, > > if you need it for a parallel-copy path then we can change it > > specifically to the parallel-copy code path but I don't understand > > your reason completely. > > > > Whenever we need data to be populated, we will get a new data block & > pass it to CopyGetData to populate the data. In case of file copy, the > server will completely fill the data block. We expect the data to be > filled completely. If data is available it will completely load the > complete data block in case of file copy. There is no scenario where > even if data is present a partial data block will be returned except > for EOF or no data available. But in case of STDIN data copy, even > though there is 8K data available in data block & 8K data available in > STDIN, CopyGetData will return as soon as libpq buffer data is more > than the minread. We will pass new data block every time to load data. > Every time we pass an 8K data block but CopyGetData loads a few bytes > in the new data block & returns. I wanted to keep the same data > population logic for both file copy & STDIN copy i.e copy full 8K data > blocks & then the populated data can be required. There is an > alternative solution I can have some special handling in case of STDIN > wherein the existing data block can be passed with the index from > where the data should be copied. Thoughts? > What you are proposing as an alternative solution, isn't that what we are doing without the patch? IIUC, you require this because of your corresponding changes to handle COPY_NEW_FE in CopyReadLine(), is that right? If so, what is the difficulty in making it behave similar to the non-parallel case? -- With Regards, Amit Kapila.
Re: Parallel copy
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > Few additional comments: > > == > > Some more comments: > > v5-0002-Framework-for-leader-worker-in-parallel-copy > === > 1. > These values > + * help in handover of multiple records with significant size of data to be > + * processed by each of the workers to make sure there is no context > switch & the > + * work is fairly distributed among the workers. > > How about writing it as: "These values help in the handover of > multiple records with the significant size of data to be processed by > each of the workers. This also ensures there is no context switch and > the work is fairly distributed among the workers." Changed as suggested. > > 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as > power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024, > and accordingly choose RINGSIZE. At many places, we do that way. I > think it can sometimes help in faster processing due to cache size > requirements and in this case, I don't see a reason why we can't > choose these values to be power-of-two. If you agree with this change > then also do some performance testing after this change? > Modified as suggested, Have checked few performance tests & verified there is no degradation. We will post a performance run of this separately in the coming days.. > 3. > + bool curr_blk_completed; > + char data[DATA_BLOCK_SIZE]; /* data read from file */ > + uint8 skip_bytes; > +} ParallelCopyDataBlock; > > Is there a reason to keep skip_bytes after data? Normally the variable > size data is at the end of the structure. Also, there is no comment > explaining the purpose of skip_bytes. > Modified as suggested and added comments. > 4. > + * Copy data block information. > + * ParallelCopyDataBlock's will be created in DSM. Data read from file will > be > + * copied in these DSM data blocks. The leader process identifies the records > + * and the record information will be shared to the workers. The workers will > + * insert the records into the table. There can be one or more number > of records > + * in each of the data block based on the record size. > + */ > +typedef struct ParallelCopyDataBlock > > Keep one empty line after the description line like below. I also > suggested to do a minor tweak in the above sentence which is as > follows: > > * Copy data block information. > * > * These data blocks are created in DSM. Data read ... > > Try to follow a similar format in other comments as well. > Modified as suggested. > 5. I think it is better to move parallelism related code to a new file > (we can name it as copyParallel.c or something like that). > Modified, added copyparallel.c file to include copy parallelism functionality & copyparallel.c file & some of the function prototype & data structure were moved to copy.h header file so that it can be shared between copy.c & copyparallel.c > 6. copy.c(1648,25): warning C4133: 'function': incompatible types - > from 'ParallelCopyLineState *' to 'uint32 *' > Getting above compilation warning on Windows. > Modified the data type. > v5-0003-Allow-copy-from-command-to-process-data-from-file > == > 1. > @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate, > * only in text mode. > */ > initStringInfo(&cstate->attribute_buf); > - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1); > + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *) > palloc(RAW_BUF_SIZE + 1); > > Is there anyway IsParallelCopy can be true by this time? AFAICS, we do > anything about parallelism after this. If you want to save this > allocation then we need to move this after we determine that > parallelism can be used or not and accordingly the below code in the > patch needs to be changed. > > * ParallelCopyFrom - parallel copy leader's functionality. > * > * Leader executes the before statement for before statement trigger, if > before > @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate) > ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info; > ereport(DEBUG1, (errmsg("Running parallel copy leader"))); > > + /* raw_buf is not used in parallel copy, instead data blocks are used.*/ > + pfree(cstate->raw_buf); > + cstate->raw_buf = NULL; > Removed the palloc change, raw_buf will be allocated both for parallel and non parallel copy. One other solution that I thought was to move the memory allocation to CopyFrom, but this solution might affect fdw where they use BeginCopyFrom, NextCopyFrom & EndCopyFrom. So I have kept the allocation as in BeginCopyFrom & freeing for parallel copy in ParallelCopyFrom. > Is there anything else also the allocation of which depends on parallelism? > I felt this is the only allocated memory that sequential copy requires and which is not required in parallel copy. > 2. > +static
Re: Parallel copy
On Mon, Sep 28, 2020 at 6:37 PM Ashutosh Sharma wrote: > > On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila wrote: > > > > On Tue, Sep 22, 2020 at 2:44 PM vignesh C wrote: > > > > > > Thanks Ashutosh for your comments. > > > > > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma > > > wrote: > > > > > > > > Hi Vignesh, > > > > > > > > I've spent some time today looking at your new set of patches and I've > > > > some thoughts and queries which I would like to put here: > > > > > > > > Why are these not part of the shared cstate structure? > > > > > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, > > > > cstate->null_print); > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > > > > > > > I have used shared_cstate mainly to share the integer & bool data > > > types from the leader to worker process. The above data types are of > > > char* data type, I will not be able to use it like how I could do it > > > for integer type. So I preferred to send these as separate keys to the > > > worker. Thoughts? > > > > > > > I think the way you have written will work but if we go with > > Ashutosh's proposal it will look elegant and in the future, if we need > > to share more strings as part of cstate structure then that would be > > easier. You can probably refer to EstimateParamListSpace, > > SerializeParamList, and RestoreParamList to see how we can share > > different types of data in one key. > > > > Yeah. And in addition to that it will also reduce the number of DSM > keys that we need to maintain. > Thanks Ashutosh, This is handled as part of the v6 patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Oct 8, 2020 at 5:44 AM vignesh C wrote: > Attached v6 patch with the fixes. > Hi Vignesh, I noticed a couple of issues when scanning the code in the following patch: v6-0003-Allow-copy-from-command-to-process-data-from-file.patch In the following code, it will put a junk uint16 value into *destptr (and thus may well cause a crash) on a Big Endian architecture (Solaris Sparc, s390x, etc.): You're storing a (uint16) string length in a uint32 and then pulling out the lower two bytes of the uint32 and copying them into the location pointed to by destptr. static void +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr, + uint32 *copiedsize) +{ + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0; + + memcpy(destptr, (uint16 *) &len, sizeof(uint16)); + *copiedsize += sizeof(uint16); + if (len) + { + memcpy(destptr + sizeof(uint16), srcPtr, len); + *copiedsize += len; + } +} I suggest you change the code to: uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; memcpy(destptr, &len, sizeof(uint16)); [I assume string length here can't ever exceed (65535 - 1), right?] Looking a bit deeper into this, I'm wondering if in fact your EstimateStringSize() and EstimateNodeSize() functions should be using BUFFERALIGN() for EACH stored string/node (rather than just calling shm_toc_estimate_chunk() once at the end, after the length of packed strings and nodes has been estimated), to ensure alignment of start of each string/node. Other Postgres code appears to be aligning each stored chunk using shm_toc_estimate_chunk(). See the definition of that macro and its current usages. Then you could safely use: uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0; *(uint16 *)destptr = len; *copiedsize += sizeof(uint16); if (len) { memcpy(destptr + sizeof(uint16), srcPtr, len); *copiedsize += len; } and in the CopyStringFromSharedMemory() function, then could safely use: len = *(uint16 *)srcPtr; The compiler may be smart enough to optimize-away the memcpy() in this case anyway, but there are issues in doing this for architectures that take a performance hit for unaligned access, or don't support unaligned access. Also, in CopyFromSharedMemory() functions, you should use palloc() instead of palloc0(), as you're filling the entire palloc'd buffer anyway, so no need to ask for additional MemSet() of all buffer bytes to 0 prior to memcpy(). Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel copy
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 2:44 PM vignesh C wrote: > > > > Thanks Ashutosh for your comments. > > > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma > > wrote: > > > > > > Hi Vignesh, > > > > > > I've spent some time today looking at your new set of patches and I've > > > some thoughts and queries which I would like to put here: > > > > > > Why are these not part of the shared cstate structure? > > > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, > > > cstate->null_print); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > > > > I have used shared_cstate mainly to share the integer & bool data > > types from the leader to worker process. The above data types are of > > char* data type, I will not be able to use it like how I could do it > > for integer type. So I preferred to send these as separate keys to the > > worker. Thoughts? > > > > I think the way you have written will work but if we go with > Ashutosh's proposal it will look elegant and in the future, if we need > to share more strings as part of cstate structure then that would be > easier. You can probably refer to EstimateParamListSpace, > SerializeParamList, and RestoreParamList to see how we can share > different types of data in one key. > Thanks for the solution Amit, I have fixed this and handled it in the v6 patch shared in my previous mail. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow wrote: > > Hi Vignesh and Bharath, > > Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as > parallel-unsafe. > Can you explain why this is? Yes we don't need to restrict parallelism for RI_TRIGGER_PK cases as we don't do any command counter increments while performing PK checks as opposed to RI_TRIGGER_FK/foreign key checks. We have modified this in the v6 patch set. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Sat, Oct 3, 2020 at 6:20 AM Tomas Vondra wrote: > > Hello Vignesh, > > I've done some basic benchmarking on the v4 version of the patches (but > AFAIKC the v5 should perform about the same), and some initial review. > > For the benchmarking, I used the lineitem table from TPC-H - for 75GB > data set, this largest table is about 64GB once loaded, with another > 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and > NVME storage. > > The COPY duration with varying number of workers (specified using the > parallel COPY option) looks like this: > > workersduration > - > 01366 > 11255 > 2 704 > 3 526 > 4 434 > 5 385 > 6 347 > 7 322 > 8 327 > > So this seems to work pretty well - initially we get almost linear > speedup, then it slows down (likely due to contention for locks, I/O > etc.). Not bad. > +1. These numbers (> 4x speed up) look good to me. -- With Regards, Amit Kapila.
Re: Parallel copy
Hello Vignesh, I've done some basic benchmarking on the v4 version of the patches (but AFAIKC the v5 should perform about the same), and some initial review. For the benchmarking, I used the lineitem table from TPC-H - for 75GB data set, this largest table is about 64GB once loaded, with another 54GB in 5 indexes. This is on a server with 32 cores, 64GB of RAM and NVME storage. The COPY duration with varying number of workers (specified using the parallel COPY option) looks like this: workersduration - 01366 11255 2 704 3 526 4 434 5 385 6 347 7 322 8 327 So this seems to work pretty well - initially we get almost linear speedup, then it slows down (likely due to contention for locks, I/O etc.). Not bad. I've only done a quick review, but overall the patch looks in fairly good shape. 1) I don't quite understand why we need INCREMENTPROCESSED and RETURNPROCESSED, considering it just does ++ or return. It just obfuscated the code, I think. 2) I find it somewhat strange that BeginParallelCopy can just decide not to do parallel copy after all. Why not to do this decisions in the caller? Or maybe it's fine this way, not sure. 3) AFAIK we don't modify typedefs.list in patches, so these changes should be removed. 4) IsTriggerFunctionParallelSafe actually checks all triggers, not just one, so the comment needs minor rewording. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Parallel copy
On Tue, Sep 29, 2020 at 3:16 PM Greg Nancarrow wrote: > > Hi Vignesh and Bharath, > > Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as > parallel-unsafe. > Can you explain why this is? > I don't think we need to restrict this case and even if there is some reason to do so then probably the same should be mentioned in the comments. -- With Regards, Amit Kapila.
Re: Parallel copy
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila wrote: > > On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > > > Few additional comments: > > == > > Some more comments: > Thanks Amit for the comments, I will work on the comments and provide a patch in the next few days. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Mon, Sep 28, 2020 at 12:19 PM Amit Kapila wrote: > > Few additional comments: > == Some more comments: v5-0002-Framework-for-leader-worker-in-parallel-copy === 1. These values + * help in handover of multiple records with significant size of data to be + * processed by each of the workers to make sure there is no context switch & the + * work is fairly distributed among the workers. How about writing it as: "These values help in the handover of multiple records with the significant size of data to be processed by each of the workers. This also ensures there is no context switch and the work is fairly distributed among the workers." 2. Can we keep WORKER_CHUNK_COUNT, MAX_BLOCKS_COUNT, and RINGSIZE as power-of-two? Say WORKER_CHUNK_COUNT as 64, MAX_BLOCK_COUNT as 1024, and accordingly choose RINGSIZE. At many places, we do that way. I think it can sometimes help in faster processing due to cache size requirements and in this case, I don't see a reason why we can't choose these values to be power-of-two. If you agree with this change then also do some performance testing after this change? 3. + bool curr_blk_completed; + char data[DATA_BLOCK_SIZE]; /* data read from file */ + uint8 skip_bytes; +} ParallelCopyDataBlock; Is there a reason to keep skip_bytes after data? Normally the variable size data is at the end of the structure. Also, there is no comment explaining the purpose of skip_bytes. 4. + * Copy data block information. + * ParallelCopyDataBlock's will be created in DSM. Data read from file will be + * copied in these DSM data blocks. The leader process identifies the records + * and the record information will be shared to the workers. The workers will + * insert the records into the table. There can be one or more number of records + * in each of the data block based on the record size. + */ +typedef struct ParallelCopyDataBlock Keep one empty line after the description line like below. I also suggested to do a minor tweak in the above sentence which is as follows: * Copy data block information. * * These data blocks are created in DSM. Data read ... Try to follow a similar format in other comments as well. 5. I think it is better to move parallelism related code to a new file (we can name it as copyParallel.c or something like that). 6. copy.c(1648,25): warning C4133: 'function': incompatible types - from 'ParallelCopyLineState *' to 'uint32 *' Getting above compilation warning on Windows. v5-0003-Allow-copy-from-command-to-process-data-from-file == 1. @@ -4294,7 +5047,7 @@ BeginCopyFrom(ParseState *pstate, * only in text mode. */ initStringInfo(&cstate->attribute_buf); - cstate->raw_buf = (char *) palloc(RAW_BUF_SIZE + 1); + cstate->raw_buf = (IsParallelCopy()) ? NULL : (char *) palloc(RAW_BUF_SIZE + 1); Is there anyway IsParallelCopy can be true by this time? AFAICS, we do anything about parallelism after this. If you want to save this allocation then we need to move this after we determine that parallelism can be used or not and accordingly the below code in the patch needs to be changed. * ParallelCopyFrom - parallel copy leader's functionality. * * Leader executes the before statement for before statement trigger, if before @@ -1110,8 +1547,302 @@ ParallelCopyFrom(CopyState cstate) ParallelCopyShmInfo *pcshared_info = cstate->pcdata->pcshared_info; ereport(DEBUG1, (errmsg("Running parallel copy leader"))); + /* raw_buf is not used in parallel copy, instead data blocks are used.*/ + pfree(cstate->raw_buf); + cstate->raw_buf = NULL; Is there anything else also the allocation of which depends on parallelism? 2. +static pg_attribute_always_inline bool +IsParallelCopyAllowed(CopyState cstate) +{ + /* Parallel copy not allowed for frontend (2.0 protocol) & binary option. */ + if ((cstate->copy_dest == COPY_OLD_FE) || cstate->binary) + return false; + + /* Check if copy is into foreign table or temporary table. */ + if (cstate->rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE || + RelationUsesLocalBuffers(cstate->rel)) + return false; + + /* Check if trigger function is parallel safe. */ + if (cstate->rel->trigdesc != NULL && + !IsTriggerFunctionParallelSafe(cstate->rel->trigdesc)) + return false; + + /* + * Check if there is after statement or instead of trigger or transition + * table triggers. + */ + if (cstate->rel->trigdesc != NULL && + (cstate->rel->trigdesc->trig_insert_after_statement || + cstate->rel->trigdesc->trig_insert_instead_row || + cstate->rel->trigdesc->trig_insert_new_table)) + return false; + + /* Check if the volatile expressions are parallel safe, if present any. */ + if (!CheckExprParallelSafety(cstate)) + return false; + + /* Check if the insertion mode is single. */ + if (FindInsertMethod(cstate) == CIM_SINGLE) + return false; + + return true; +} In the comments, we should write why parallelism is not allowed for a particular case
Re: Parallel copy
Hi Vignesh and Bharath, Seems like the Parallel Copy patch is regarding RI_TRIGGER_PK as parallel-unsafe. Can you explain why this is? Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel copy
On Mon, Sep 28, 2020 at 3:01 PM Amit Kapila wrote: > > On Tue, Sep 22, 2020 at 2:44 PM vignesh C wrote: > > > > Thanks Ashutosh for your comments. > > > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma > > wrote: > > > > > > Hi Vignesh, > > > > > > I've spent some time today looking at your new set of patches and I've > > > some thoughts and queries which I would like to put here: > > > > > > Why are these not part of the shared cstate structure? > > > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, > > > cstate->null_print); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > > > > I have used shared_cstate mainly to share the integer & bool data > > types from the leader to worker process. The above data types are of > > char* data type, I will not be able to use it like how I could do it > > for integer type. So I preferred to send these as separate keys to the > > worker. Thoughts? > > > > I think the way you have written will work but if we go with > Ashutosh's proposal it will look elegant and in the future, if we need > to share more strings as part of cstate structure then that would be > easier. You can probably refer to EstimateParamListSpace, > SerializeParamList, and RestoreParamList to see how we can share > different types of data in one key. > Yeah. And in addition to that it will also reduce the number of DSM keys that we need to maintain. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Parallel copy
On Tue, Sep 22, 2020 at 2:44 PM vignesh C wrote: > > Thanks Ashutosh for your comments. > > On Wed, Sep 16, 2020 at 6:36 PM Ashutosh Sharma wrote: > > > > Hi Vignesh, > > > > I've spent some time today looking at your new set of patches and I've > > some thoughts and queries which I would like to put here: > > > > Why are these not part of the shared cstate structure? > > > > SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); > > SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); > > SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); > > SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); > > > > I have used shared_cstate mainly to share the integer & bool data > types from the leader to worker process. The above data types are of > char* data type, I will not be able to use it like how I could do it > for integer type. So I preferred to send these as separate keys to the > worker. Thoughts? > I think the way you have written will work but if we go with Ashutosh's proposal it will look elegant and in the future, if we need to share more strings as part of cstate structure then that would be easier. You can probably refer to EstimateParamListSpace, SerializeParamList, and RestoreParamList to see how we can share different types of data in one key. -- With Regards, Amit Kapila.
Re: Parallel copy
On Wed, Jul 22, 2020 at 7:48 PM vignesh C wrote: > > On Tue, Jul 21, 2020 at 3:54 PM Amit Kapila wrote: > > > > > Review comments: > > === > > > > 0001-Copy-code-readjustment-to-support-parallel-copy > > 1. > > @@ -807,8 +835,11 @@ CopyLoadRawBuf(CopyState cstate) > > else > > nbytes = 0; /* no data need be saved */ > > > > + if (cstate->copy_dest == COPY_NEW_FE) > > + minread = RAW_BUF_SIZE - nbytes; > > + > > inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes, > > - 1, RAW_BUF_SIZE - nbytes); > > + minread, RAW_BUF_SIZE - nbytes); > > > > No comment to explain why this change is done? > > > > 0002-Framework-for-leader-worker-in-parallel-copy > > Currently CopyGetData copies a lesser amount of data to buffer even though > space is available in buffer because minread was passed as 1 to CopyGetData. > Because of this there are frequent call to CopyGetData for fetching the data. > In this case it will load only some data due to the below check: > while (maxread > 0 && bytesread < minread && !cstate->reached_eof) > After reading some data bytesread will be greater than minread which is > passed as 1 and return with lesser amount of data, even though there is some > space. > This change is required for parallel copy feature as each time we get a new > DSM data block which is of 64K size and copy the data. If we copy less data > into DSM data blocks we might end up consuming all the DSM data blocks. > Why can't we reuse the DSM block which has unfilled space? > I felt this issue can be fixed as part of HEAD. Have posted a separate > thread [1] for this. I'm planning to remove that change once it gets > committed. Can that go as a separate > patch or should we include it here? > [1] - > https://www.postgresql.org/message-id/CALDaNm0v4CjmvSnftYnx_9pOS_dKRG%3DO3NnBgJsQmi0KipvLog%40mail.gmail.com > I am convinced by the reason given by Kyotaro-San in that another thread [1] and performance data shown by Peter that this can't be an independent improvement and rather in some cases it can do harm. Now, if you need it for a parallel-copy path then we can change it specifically to the parallel-copy code path but I don't understand your reason completely. > > 2. .. > > + */ > > +typedef struct ParallelCopyLineBoundary > > > > Are we doing all this state management to avoid using locks while > > processing lines? If so, I think we can use either spinlock or LWLock > > to keep the main patch simple and then provide a later patch to make > > it lock-less. This will allow us to first focus on the main design of > > the patch rather than trying to make this datastructure processing > > lock-less in the best possible way. > > > > The steps will be more or less same if we use spinlock too. step 1, step 3 & > step 4 will be common we have to use lock & unlock instead of step 2 & step > 5. I feel we can retain the current implementation. > I'll study this in detail and let you know my opinion on the same but in the meantime, I don't follow one part of this comment: "If they don't follow this order the worker might process wrong line_size and leader might populate the information which worker has not yet processed or in the process of processing." Do you want to say that leader might overwrite some information which worker hasn't read yet? If so, it is not clear from the comment. Another minor point about this comment: + * ParallelCopyLineBoundary is common data structure between leader & worker, + * Leader process will be populating data block, data block offset & the size of I think there should be a full-stop after worker instead of a comma. > > > 6. > > In function BeginParallelCopy(), you need to keep a provision to > > collect wal_usage and buf_usage stats. See _bt_begin_parallel for > > reference. Those will be required for pg_stat_statements. > > > > Fixed > How did you ensure that this is fixed? Have you tested it, if so please share the test? I see a basic problem with your fix. + /* Report WAL/buffer usage during parallel execution */ + bufferusage = shm_toc_lookup(toc, PARALLEL_COPY_BUFFER_USAGE, false); + walusage = shm_toc_lookup(toc, PARALLEL_COPY_WAL_USAGE, false); + InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], + &walusage[ParallelWorkerNumber]); You need to call InstrStartParallelQuery() before the actual operation starts, without that stats won't be accurate? Also, after calling WaitForParallelWorkersToFinish(), you need to accumulate the stats collected from workers which neither you have done nor is possible with the current code in your patch because you haven't made any provision to capture them in BeginParallelCopy. I suggest you look into lazy_parallel_vacuum_indexes() and begin_parallel_vacuum() to understand how the buffer/wal usage stats are accumulated. Also, please test this functionality using pg_stat_statements. > > > 0003-Allow-copy-from-command-to-process-data-from-file-ST > > 10. > > In the commit message, you have writt
Re: Parallel copy
On Thu, Sep 24, 2020 at 3:00 PM Bharath Rupireddy wrote: > > > > > > Have you tested your patch when encoding conversion is needed? If so, > > > could you please point out the email that has the test results. > > > > > > > We have not yet done encoding testing, we will do and post the results > > separately in the coming days. > > > > Hi Ashutosh, > > I ran the tests ensuring pg_server_to_any() gets called from copy.c. I > specified the encoding option of COPY command, with client and server > encodings being UTF-8. > Thanks Bharath for the testing. The results look impressive. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: Parallel copy
> > > Have you tested your patch when encoding conversion is needed? If so, > > could you please point out the email that has the test results. > > > > We have not yet done encoding testing, we will do and post the results > separately in the coming days. > Hi Ashutosh, I ran the tests ensuring pg_server_to_any() gets called from copy.c. I specified the encoding option of COPY command, with client and server encodings being UTF-8. Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB data. The results are of the triplet form (exec time in sec, number of workers, gain) Use case 1: 2 indexes on integer columns, 1 index on text column (1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4, 3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X), (134.511, 30, 8.72X) Use case 2: 1 gist index on text column (811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4, 3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X), (199.402, 30, 4.06) Use case 3: 3 indexes on integer columns (220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4, 1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X), (100.656, 30, 2.19X) The results are similar to our earlier runs[2]. [1] shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off [2] https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
Thanks Greg for the testing. On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow wrote: > > > 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue. > > Yes, have run 4 times. Seems to be a performance hit (whether normal > copy or parallel-1 copy) on the first COPY run on a freshly created > database. After that, results are consistent. > >From the logs, I see that it is happening only with default postgresql.conf, and there's inconsistency in table insertion times, especially from the 1st time to 2nd time. Also, the table insertion time variation is more. This is expected with the default postgresql.conf, because of the background processes interference. That's the reason we usually run with custom configuration to correctly measure the performance gain. br_default_0_1.log: 2020-09-23 22:32:36.944 JST [112616] LOG: totaltableinsertiontime = 155068.244 ms 2020-09-23 22:33:57.615 JST [11426] LOG: totaltableinsertiontime = 42096.275 ms 2020-09-23 22:37:39.192 JST [43097] LOG: totaltableinsertiontime = 29135.262 ms 2020-09-23 22:38:56.389 JST [54205] LOG: totaltableinsertiontime = 38953.912 ms 2020-09-23 22:40:27.573 JST [66485] LOG: totaltableinsertiontime = 27895.326 ms 2020-09-23 22:41:34.948 JST [77523] LOG: totaltableinsertiontime = 28929.642 ms 2020-09-23 22:43:18.938 JST [89857] LOG: totaltableinsertiontime = 30625.015 ms 2020-09-23 22:44:21.938 JST [101372] LOG: totaltableinsertiontime = 24624.045 ms br_default_1_0.log: 2020-09-24 11:12:14.989 JST [56146] LOG: totaltableinsertiontime = 192068.350 ms 2020-09-24 11:13:38.228 JST [88455] LOG: totaltableinsertiontime = 30999.942 ms 2020-09-24 11:15:50.381 JST [108935] LOG: totaltableinsertiontime = 31673.204 ms 2020-09-24 11:17:14.260 JST [118541] LOG: totaltableinsertiontime = 31367.027 ms 2020-09-24 11:20:18.975 JST [17270] LOG: totaltableinsertiontime = 26858.924 ms 2020-09-24 11:22:17.822 JST [26852] LOG: totaltableinsertiontime = 66531.442 ms 2020-09-24 11:24:09.221 JST [47971] LOG: totaltableinsertiontime = 38943.384 ms 2020-09-24 11:25:30.955 JST [58849] LOG: totaltableinsertiontime = 28286.634 ms br_custom_0_1.log: 2020-09-24 10:29:44.956 JST [110477] LOG: totaltableinsertiontime = 20207.928 ms 2020-09-24 10:30:49.570 JST [120568] LOG: totaltableinsertiontime = 23360.006 ms 2020-09-24 10:32:31.659 JST [2753] LOG: totaltableinsertiontime = 19837.588 ms 2020-09-24 10:35:49.245 JST [31118] LOG: totaltableinsertiontime = 21759.253 ms 2020-09-24 10:36:54.834 JST [41763] LOG: totaltableinsertiontime = 23547.323 ms 2020-09-24 10:38:53.507 JST [56779] LOG: totaltableinsertiontime = 21543.984 ms 2020-09-24 10:39:58.713 JST [67489] LOG: totaltableinsertiontime = 25254.563 ms br_custom_1_0.log: 2020-09-24 10:49:03.242 JST [15308] LOG: totaltableinsertiontime = 16541.201 ms 2020-09-24 10:50:11.848 JST [23324] LOG: totaltableinsertiontime = 15076.577 ms 2020-09-24 10:51:24.497 JST [35394] LOG: totaltableinsertiontime = 16400.777 ms 2020-09-24 10:52:32.354 JST [42953] LOG: totaltableinsertiontime = 15591.051 ms 2020-09-24 10:54:30.327 JST [61136] LOG: totaltableinsertiontime = 16700.954 ms 2020-09-24 10:55:38.377 JST [68719] LOG: totaltableinsertiontime = 15435.150 ms 2020-09-24 10:57:08.927 JST [83335] LOG: totaltableinsertiontime = 17133.251 ms 2020-09-24 10:58:17.420 JST [90905] LOG: totaltableinsertiontime = 15352.753 ms > > Test results show that Parallel COPY with 1 worker is performing > better than normal COPY in the test scenarios run. > Good to know :) With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
Hi Bharath, > Few things to follow before testing: > 1. Is the table being dropped/truncated after the test with 0 workers and > before running with 1 worker? If not, then the index insertion time would > increase.[1](for me it is increasing by 10 sec). This is obvious because the > 1st time index will be created from bottom up manner(from leaves to root), > but for the 2nd time it has to search and insert at the proper leaves and > inner B+Tree nodes. Yes, it' being truncated before running each and every COPY. > 2. If possible, can you also run with custom postgresql.conf settings[2] > along with default? Just to ensure that other bg processes such as > checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For > instance, with default postgresql.conf file, it looks like checkpointing[3] > is happening frequently, could you please let us know if that happens at your > end? Yes, have run with default and your custom settings. With default settings, I can confirm that checkpointing is happening frequently with the tests I've run here. > 3. Could you please run the test case 3 times at least? Just to ensure the > consistency of the issue. Yes, have run 4 times. Seems to be a performance hit (whether normal copy or parallel-1 copy) on the first COPY run on a freshly created database. After that, results are consistent. > 4. I ran the tests in a performance test system where no other user > processes(except system processes) are running. Is it possible for you to do > the same? > > Please capture and share the timing logs with us. > Yes, I have ensured the system is as idle as possible prior to testing. I have attached the test results obtained after building with your Parallel Copy patch and testing patch applied (HEAD at 733fa9aa51c526582f100aa0d375e0eb9a6bce8b). Test results show that Parallel COPY with 1 worker is performing better than normal COPY in the test scenarios run. There is a performance hit (regardless of COPY type) on the very first COPY run on a freshly-created database. I ran the test case 4 times. and also in reverse order, with truncate run before each COPY (output and logs named _0_1 run normal COPY then parallel COPY, and named _1_0 run parallel COPY and then normal COPY). Please refer to attached results. Regards, Greg testing_patch_results.tar.gz Description: application/gzip
Re: Parallel copy
On Thu, Sep 17, 2020 at 11:06 AM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > > On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow wrote: > > > > Fortunately I have been given permission to share the exact table > > definition and data I used, so you can check the behaviour and timings > > on your own test machine. > > > > Thanks Greg for the script. I ran your test case and I didn't observe > any increase in exec time with 1 worker, indeed, we have benefitted a > few seconds from 0 to 1 worker as expected. > > Execution time is in seconds. Each test case is executed 3 times on > release build. Each time the data directory is recreated. > > Case 1: 100 rows, 2GB > With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423 > With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678 > > With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822 > With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573 > > Case 2: 255 rows, 5GB > With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683 > With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307 > > With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049 > With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090 > Hi Greg, If you still observe the issue in your testing environment, I'm attaching a testing patch(that applies on top of the latest parallel copy patch set i.e. v5 1 to 6) to capture various timings such as total copy time in leader and worker, index and table insertion time, leader and worker waiting time. These logs are shown in the server log file. Few things to follow before testing: 1. Is the table being dropped/truncated after the test with 0 workers and before running with 1 worker? If not, then the index insertion time would increase.[1](for me it is increasing by 10 sec). This is obvious because the 1st time index will be created from bottom up manner(from leaves to root), but for the 2nd time it has to search and insert at the proper leaves and inner B+Tree nodes. 2. If possible, can you also run with custom postgresql.conf settings[2] along with default? Just to ensure that other bg processes such as checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For instance, with default postgresql.conf file, it looks like checkpointing[3] is happening frequently, could you please let us know if that happens at your end? 3. Could you please run the test case 3 times at least? Just to ensure the consistency of the issue. 4. I ran the tests in a performance test system where no other user processes(except system processes) are running. Is it possible for you to do the same? Please capture and share the timing logs with us. Here's a snapshot of how the added timings show up in the logs: ( I captured this with your test case case 1: 100 rows, 2GB, custom postgresql.conf file settings[2]). with 0 workers: 2020-09-22 10:49:27.508 BST [163910] LOG: totaltableinsertiontime = 24072.034 ms 2020-09-22 10:49:27.508 BST [163910] LOG: totalindexinsertiontime = 60.682 ms 2020-09-22 10:49:27.508 BST [163910] LOG: totalcopytime = 59664.594 ms with 1 worker: 2020-09-22 10:53:58.409 BST [163947] LOG: totalcopyworkerwaitingtime = 59.815 ms 2020-09-22 10:53:58.409 BST [163947] LOG: totaltableinsertiontime = 23585.881 ms 2020-09-22 10:53:58.409 BST [163947] LOG: totalindexinsertiontime = 30.946 ms 2020-09-22 10:53:58.409 BST [163947] LOG: totalcopytimeworker = 47047.956 ms 2020-09-22 10:53:58.429 BST [163946] LOG: totalcopyleaderwaitingtime = 26746.744 ms 2020-09-22 10:53:58.429 BST [163946] LOG: totalcopytime = 47150.002 ms [1] 0 worker: LOG: totaltableinsertiontime = 25491.881 ms LOG: totalindexinsertiontime = 14136.104 ms LOG: totalcopytime = 75606.858 ms table is not dropped and so are indexes 1 worker: LOG: totalcopyworkerwaitingtime = 64.582 ms LOG: totaltableinsertiontime = 21360.875 ms LOG: totalindexinsertiontime = 24843.570 ms LOG: totalcopytimeworker = 69837.162 ms LOG: totalcopyleaderwaitingtime = 49548.441 ms LOG: totalcopytime = 69997.778 ms [2] custom postgresql.conf configuration: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off [3] LOG: checkpoints are occurring too frequently (14 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size". With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v1-0001-Parallel-Copy-Exec-Time-Capture.patch Description: Binary data
Re: Parallel copy
On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow wrote: > > Fortunately I have been given permission to share the exact table > definition and data I used, so you can check the behaviour and timings > on your own test machine. > Thanks Greg for the script. I ran your test case and I didn't observe any increase in exec time with 1 worker, indeed, we have benefitted a few seconds from 0 to 1 worker as expected. Execution time is in seconds. Each test case is executed 3 times on release build. Each time the data directory is recreated. Case 1: 100 rows, 2GB With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423 With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678 With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822 With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573 Case 2: 255 rows, 5GB With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683 With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307 With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049 With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090 [1] - Custom configuration is set up to ensure that no other processes influence the results. The postgresql.conf used: shared_buffers = 40GB synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
Hi Vignesh, I've spent some time today looking at your new set of patches and I've some thoughts and queries which I would like to put here: Why are these not part of the shared cstate structure? SerializeString(pcxt, PARALLEL_COPY_KEY_NULL_PRINT, cstate->null_print); SerializeString(pcxt, PARALLEL_COPY_KEY_DELIM, cstate->delim); SerializeString(pcxt, PARALLEL_COPY_KEY_QUOTE, cstate->quote); SerializeString(pcxt, PARALLEL_COPY_KEY_ESCAPE, cstate->escape); I think in the refactoring patch we could replace all the cstate variables that would be shared between the leader and workers with a common structure which would be used even for a serial copy. Thoughts? -- Have you tested your patch when encoding conversion is needed? If so, could you please point out the email that has the test results. -- Apart from above, I've noticed some cosmetic errors which I am sharing here: +#defineIsParallelCopy()(cstate->is_parallel) +#define IsLeader() (cstate->pcdata->is_leader) This doesn't look to be properly aligned. -- + shared_info_ptr = (ParallelCopyShmInfo *) shm_toc_allocate(pcxt->toc, sizeof(ParallelCopyShmInfo)); + PopulateParallelCopyShmInfo(shared_info_ptr, full_transaction_id); .. + /* Store shared build state, for which we reserved space. */ + shared_cstate = (SerializedParallelCopyState *)shm_toc_allocate(pcxt->toc, est_cstateshared); In the first case, while typecasting you've added a space between the typename and the function but that is missing in the second case. I think it would be good if you could make it consistent. Same comment applies here as well: + pg_atomic_uint32line_state; /* line state */ + uint64 cur_lineno; /* line number for error messages */ +}ParallelCopyLineBoundary; ... + CommandId mycid; /* command id */ + ParallelCopyLineBoundaries line_boundaries; /* line array */ +} ParallelCopyShmInfo; There is no space between the closing brace and the structure name in the first case but it is in the second one. So, again this doesn't look consistent. I could also find this type of inconsistency in comments. See below: +/* It can hold upto 1 record information for worker to process. RINGSIZE + * should be a multiple of WORKER_CHUNK_COUNT, as wrap around cases is currently + * not handled while selecting the WORKER_CHUNK_COUNT by the worker. */ +#define RINGSIZE (10 * 1000) ... +/* + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data + * block to process to avoid lock contention. Read RINGSIZE comments before + * changing this value. + */ +#define WORKER_CHUNK_COUNT 50 You may see these kinds of errors at other places as well if you scan through your patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Aug 19, 2020 at 11:51 AM vignesh C wrote: > > Thanks Greg for reviewing the patch. Please find my thoughts for your > comments. > > On Mon, Aug 17, 2020 at 9:44 AM Greg Nancarrow wrote: > > Some further comments: > > > > (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch > > > > +/* > > + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM > > data > > + * block to process to avoid lock contention. This value should be > > divisible by > > + * RINGSIZE, as wrap around cases is currently not handled while selecting > > the > > + * WORKER_CHUNK_COUNT by the worker. > > + */ > > +#define WORKER_CHUNK_COUNT 50 > > > > > > "This value should be divisible by RINGSIZE" is not a correct > > statement (since obviously 50 is not divisible by 1). > > It should say something like "This value should evenly divide into > > RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT". > > > > Fixed. Changed it to RINGSIZE should be a multiple of WORKER_CHUNK_COUNT. > > > (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch > > > > (i) > > > > + /* > > +* If the data is present in current block > > lineInfo. line_size > > +* will be updated. If the data is spread > > across the blocks either > > > > Somehow a space has been put between "lineinfo." and "line_size". > > It should be: "If the data is present in current block > > lineInfo.line_size will be updated" > > Fixed, changed it to lineinfo->line_size. > > > > > (ii) > > > > >This is not possible because of pg_atomic_compare_exchange_u32, this > > >will succeed only for one of the workers whose line_state is > > >LINE_LEADER_POPULATED, for other workers it will fail. This is > > >explained in detail above ParallelCopyLineBoundary. > > > > Yes, but prior to that call to pg_atomic_compare_exchange_u32(), > > aren't you separately reading line_state and line_state, so that > > between those reads, it may have transitioned from leader to another > > worker, such that the read line state ("cur_line_state", being checked > > in the if block) may
Re: Parallel copy
Hi Bharath, On Tue, Sep 15, 2020 at 11:49 PM Bharath Rupireddy wrote: > > Few questions: > 1. Was the run performed with default postgresql.conf file? If not, > what are the changed configurations? Yes, just default settings. > 2. Are the readings for normal copy(190.891sec, mentioned by you > above) taken on HEAD or with patch, 0 workers? With patch >How much is the runtime > with your test case on HEAD(Without patch) and 0 workers(With patch)? TBH, I didn't test that. Looking at the changes, I wouldn't expect a degradation of performance for normal copy (you have tested, right?). > 3. Was the run performed on release build? For generating the perf data I sent (normal copy vs parallel copy with 1 worker), I used a debug build (-g -O0), as that is needed for generating all the relevant perf data for Postgres code. Previously I ran with a release build (-O2). > 4. Were the readings taken on multiple runs(say 3 or 4 times)? The readings I sent were from just one run (not averaged), but I did run the tests several times to verify the readings were representative of the pattern I was seeing. Fortunately I have been given permission to share the exact table definition and data I used, so you can check the behaviour and timings on your own test machine. Please see the attachment. You can create the table using the table.sql and index_4.sql definitions in the "sql" directory. The data.csv file (to be loaded by COPY) can be created with the included "dupdata" tool in the "input" directory, which you need to build, then run, specifying a suitable number of records and path of the template record (see README). Obviously the larger the number of records, the larger the file ... The table can then be loaded using COPY with "format csv" (and "parallel N" if testing parallel copy). Regards, Greg Nancarrow Fujitsu Australia <>
Re: Parallel copy
On Fri, Sep 11, 2020 at 3:49 AM Greg Nancarrow wrote: > > I couldn't use the original machine from which I obtained the previous > results, but ended up using a 4-core CentOS7 VM, which showed a > similar pattern in the performance results for this test case. > I obtained the following results from loading a 2GB CSV file (100 > rows, 4 indexes): > > Copy TypeDuration (s) Load factor > === > Normal Copy190.891- > > Parallel Copy > (#workers) > 1210.947 0.90 > Hi Greg, I tried to recreate the test case(attached) and I didn't find much difference with the custom postgresql.config file. Test case: 25 tuples, 4 indexes(composite indexes with 10 columns), 3.7GB, 100 columns(as suggested by you and all the varchar(255) columns are having 255 characters), exec time in sec. With custom postgresql.conf[1], removed and recreated the data directory after every run(I couldn't perform the OS page cache flush due to some reasons. So, chose this recreation of data dir way, for testing purpose): HEAD: 129.547, 128.624, 128.890 Patch: 0 workers - 130.213, 131.298, 130.555 Patch: 1 worker - 127.757, 125.560, 128.275 With default postgresql.conf, removed and recreated the data directory after every run: HEAD: 138.276, 150.472, 153.304 Patch: 0 workers - 162.468, 149.423, 159.137 Patch: 1 worker - 136.055, 144.250, 137.916 Few questions: 1. Was the run performed with default postgresql.conf file? If not, what are the changed configurations? 2. Are the readings for normal copy(190.891sec, mentioned by you above) taken on HEAD or with patch, 0 workers? How much is the runtime with your test case on HEAD(Without patch) and 0 workers(With patch)? 3. Was the run performed on release build? 4. Were the readings taken on multiple runs(say 3 or 4 times)? [1] - Postgres configuration used for above testing: shared_buffers = 40GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com {\rtf1\ansi\ansicpg1252\cocoartf2513 \cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;} {\colortbl;\red255\green255\blue255;} {\*\expandedcolortbl;;} \paperw11900\paperh16840\margl1440\margr1440\vieww30560\viewh11000\viewkind0 \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural\partightenfactor0 \f0\fs24 \cf0 DROP TABLE IF EXISTS t100_1;\ CREATE TABLE t100_1 (\ c1_s_1 bigserial,\ c2_vc_1 varchar(255),\ c3_d_1 date,\ c4_n_1 numeric,\ c5_vc_2 varchar(255),\ c6_d_2 date,\ c7_vc_3 varchar(255),\ c8_t_1 time without time zone,\ c9_vc_4 varchar(255),\ c10_vc_5 varchar(255),\ c11_t_2 time without time zone,\ c12_vc_6 varchar(255),\ c13_vc_7 varchar(255),\ c14_n_2 numeric,\ c15_d_3 date,\ c16_t_3 time without time zone,\ c17_n_3 numeric,\ c18_vc_8 varchar(255),\ c19_d_4 date,\ c20_vc_9 varchar(255),\ c21_t_4 time without time zone,\ c22_vc_10 varchar(255),\ c23_d_5 date,\ c24_vc_11 varchar(255),\ c25_t_5 time without time zone,\ c26_t_6 time without time zone,\ c27_vc_12 varchar(255),\ c28_vc_13 varchar(255),\ c29_d_6 date,\ c30_vc_14 varchar(255),\ c31_vc_15 varchar(255),\ c32_t_7 time without time zone,\ c33_t_8 time without time zone,\ c34_vc_16 varchar(255),\ c35_vc_17 varchar(255),\ c36_d_7 date,\ c37_d_8 date,\ c38_vc_18 varchar(255),\ c39_t_9 time without time zone,\ c40_vc_19 varchar(255),\ c41_vc_20 varchar(255),\ c42_vc_21 varchar(255),\ c43_vc_22 varchar(255),\ c44_t_10 time without time zone,\ c45_d_9 date,\ c46_vc_23 varchar(255),\ c47_t_11 time without time zone,\ c48_vc_24 varchar(255),\ c49_vc_25 varchar(255),\ c50_vc_26 varchar(255),\ c51_d_10 date,\ c52_vc_27 varchar(255),\ c53_vc_28 varchar(255),\ c54_vc_29 varchar(255),\ c55_vc_30 varchar(255),\ c56_d_11 date,\ c57_vc_31 varchar(255),\ c58_vc_32 varchar(255),\ c59_vc_33 varchar(255),\ c60_vc_34 varchar(255),\ c61_vc_35 varchar(255),\ c62_vc_36 varchar(255),\ c63_vc_37 varchar(255),\ c64_vc_38 varchar(255),\ c65_vc_39 varchar(255),\ c66_n_4 numeric,\ c67_vc_40 varchar(255),\ c68_vc_41 varchar(255),\ c69_vc_42 varchar(255),\ c70_vc_43 varchar(255),\ c71_n_5 numeric,\ c72_vc_44 varchar(255),\ c73_n_6 numeric,\ c74_vc_45 varchar(255),\ c75_vc_46 varchar(255),\ c76_vc_47 varchar(255),\ c77_n_7 numeric,\ c78_n_8 numeric,\ c79_d_12 date,\ c80_n_9 numeric,\ c81_vc_48 varchar(255),\ c82_vc_49 varchar(255),\ c83_vc_50 varchar(255),\ c84_n_10 numeric,\ c85_vc_51 varchar(255),\ c86_vc_52 varchar(255),\ c87_vc_53 varchar(255),\ c88_d_13 date,\ c89_n_13 numeric,\ c90_vc_54 varchar(255),\ c91_vc_55 varchar(255),\ c92_n_11 numeric,\ c93_n_12 numeric,\ c94_vc_56 varchar(255),\
Re: Parallel copy
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow wrote: > > Hi Vignesh, > > >Can you share with me the script you used to generate the data & the ddl of > >the table, so that it will help me check that >scenario you faced the > >>problem. > > Unfortunately I can't directly share it (considered company IP), > though having said that it's only doing something that is relatively > simple and unremarkable, so I'd expect it to be much like what you are > currently doing. I can describe it in general. > > The table being used contains 100 columns (as I pointed out earlier), > with the first column of "bigserial" type, and the others of different > types like "character varying(255)", "numeric", "date" and "time > without timezone". There's about 60 of the "character varying(255)" > overall, with the other types interspersed. > Thanks Greg for executing & sharing the results. I tried with a similar test case that you suggested, I was not able to reproduce the degradation scenario. If it is possible, can you run perf for the scenario with 1 worker & non parallel mode & share the perf results, we will be able to find out which of the functions is consuming more time by doing a comparison of the perf reports. Steps for running perf: 1) get the postgres pid 2) perf record -a -g -p 3) Run copy command 4) Execute "perf report -g" once copy finishes. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
>On Wed, Sep 2, 2020 at 3:40 PM vignesh C wrote: > I have attached the scripts that I used for the test results I > mentioned in my previous mail. create.sql file has the table that I > used, insert_data_gen.txt has the insert data generation scripts. I > varied the count in insert_data_gen to generate csv files of 1GB, 2GB > & 5GB & varied the data to generate 1 char, 10 char & 100 char for > each column for various testing. You can rename insert_data_gen.txt to > insert_data_gen.sh & generate the csv file. Hi Vignesh, I used your script and table definition, multiplying the number of records to produce a 5GB and 9.5GB CSV file. I got the following results: (1) Postgres default settings, 5GB CSV (53 rows): Copy TypeDuration (s) Load factor === Normal Copy 132.197 - Parallel Copy (#workers) 198.428 1.34 252.753 2.51 337.630 3.51 433.554 3.94 533.636 3.93 633.821 3.91 734.270 3.86 834.465 3.84 934.315 3.85 10 33.543 3.94 (2) Postgres increased resources, 5GB CSV (53 rows): shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB work_mem = 10% of RAM = 38GB maintenance_work_mem = 10% of RAM = 38GB max_worker_processes = 16 max_parallel_workers = 16 checkpoint_timeout = 30min max_wal_size=2GB Copy TypeDuration (s) Load factor === Normal Copy 131.835 - Parallel Copy (#workers) 198.301 1.34 253.261 2.48 337.868 3.48 434.224 3.85 533.831 3.90 634.229 3.85 734.512 3.82 834.303 3.84 934.690 3.80 10 34.479 3.82 (3) Postgres default settings, 9.5GB CSV (100 rows): Copy TypeDuration (s) Load factor === Normal Copy 248.503 - Parallel Copy (#workers) 1185.724 1.34 299.832 2.49 370.560 3.52 463.328 3.92 563.182 3.93 664.108 3.88 764.131 3.87 864.350 3.86 964.293 3.87 10 63.818 3.89 (4) Postgres increased resources, 9.5GB CSV (100 rows): shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB work_mem = 10% of RAM = 38GB maintenance_work_mem = 10% of RAM = 38GB max_worker_processes = 16 max_parallel_workers = 16 checkpoint_timeout = 30min max_wal_size=2GB Copy TypeDuration (s) Load factor === Normal Copy 248.647- Parallel Copy (#workers) 1182.2361.36 292.814 2.68 367.347 3.69 463.839 3.89 562.672 3.97 663.873 3.89 764.930 3.83 863.885 3.89 962.397 3.98 10 64.477 3.86 So as you found, with this particular table definition and data, 1 parallel worker always performs better than normal copy. The different result obtained for this particular case seems to be caused by the following factors: - different table definition (I used a variety of column types) - amount of data per row (I used less data per row, so more rows per same size data file) As I previously observed, if the target table has no indexes, increasing resources beyond the default settings makes little difference to the performance. Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel copy
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow wrote: > > Hi Vignesh, > > >Can you share with me the script you used to generate the data & the ddl of > >the table, so that it will help me check that >scenario you faced the > >>problem. > > Unfortunately I can't directly share it (considered company IP), > though having said that it's only doing something that is relatively > simple and unremarkable, so I'd expect it to be much like what you are > currently doing. I can describe it in general. > > The table being used contains 100 columns (as I pointed out earlier), > with the first column of "bigserial" type, and the others of different > types like "character varying(255)", "numeric", "date" and "time > without timezone". There's about 60 of the "character varying(255)" > overall, with the other types interspersed. > > When testing with indexes, 4 b-tree indexes were used that each > included the first column and then distinctly 9 other columns. > > A CSV record (row) template file was created with test data > (corresponding to the table), and that was simply copied and appended > over and over with a record prefix in order to create the test data > file. > The following shell-script basically does it (but very slowly). I was > using a small C program to do similar, a lot faster. > In my case, N=255 produced about a 5GB CSV file. > > file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out; > cat sample_record.csv >> $file_out; done > > One other thing I should mention is that between each test run, I > cleared the OS page cache, as described here: > https://linuxhint.com/clear_cache_linux/ > That way, each COPY FROM is not taking advantage of any OS-cached data > from a previous COPY FROM. I will try with a similar test and check if I can reproduce. > If your data is somehow significantly different and you want to (and > can) share your script, then I can try it in my environment. I have attached the scripts that I used for the test results I mentioned in my previous mail. create.sql file has the table that I used, insert_data_gen.txt has the insert data generation scripts. I varied the count in insert_data_gen to generate csv files of 1GB, 2GB & 5GB & varied the data to generate 1 char, 10 char & 100 char for each column for various testing. You can rename insert_data_gen.txt to insert_data_gen.sh & generate the csv file. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com #!/bin/bash for i in {1..10} do echo "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789
Re: Parallel copy
Hi Vignesh, >Can you share with me the script you used to generate the data & the ddl of >the table, so that it will help me check that >scenario you faced the >problem. Unfortunately I can't directly share it (considered company IP), though having said that it's only doing something that is relatively simple and unremarkable, so I'd expect it to be much like what you are currently doing. I can describe it in general. The table being used contains 100 columns (as I pointed out earlier), with the first column of "bigserial" type, and the others of different types like "character varying(255)", "numeric", "date" and "time without timezone". There's about 60 of the "character varying(255)" overall, with the other types interspersed. When testing with indexes, 4 b-tree indexes were used that each included the first column and then distinctly 9 other columns. A CSV record (row) template file was created with test data (corresponding to the table), and that was simply copied and appended over and over with a record prefix in order to create the test data file. The following shell-script basically does it (but very slowly). I was using a small C program to do similar, a lot faster. In my case, N=255 produced about a 5GB CSV file. file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out; cat sample_record.csv >> $file_out; done One other thing I should mention is that between each test run, I cleared the OS page cache, as described here: https://linuxhint.com/clear_cache_linux/ That way, each COPY FROM is not taking advantage of any OS-cached data from a previous COPY FROM. If your data is somehow significantly different and you want to (and can) share your script, then I can try it in my environment. Regards, Greg
Re: Parallel copy
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow wrote: > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > of cases (I did question if allowing 1 worker was useful in my patch > review). Thanks Greg for your review & testing. I had executed various tests with 1GB, 2GB & 5GB with 100 columns without parallel mode & with 1 parallel worker. Test result for the same is as given below: Test Without parallel mode With 1 Parallel worker 1GB csv file 100 columns (100 bytes data in each column) 62 seconds 47 seconds (1.32X) 1GB csv file 100 columns (1000 bytes data in each column) 89 seconds 78 seconds (1.14X) 2GB csv file 100 columns (1 byte data in each column) 277 seconds 256 seconds (1.08X) 5GB csv file 100 columns (100 byte data in each column) 515 seconds 445 seconds (1.16X) I have run the tests multiple times and have noticed the similar execution times in all the runs for the above tests. In the above results there is slight improvement with 1 worker. In my tests I did not observe the degradation for copy with 1 worker compared to the non parallel copy. Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that scenario you faced the problem. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Aug 27, 2020 at 4:56 PM vignesh C wrote: > > On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila wrote: > > > > On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow wrote: > > > > > > > I have attached new set of patches with the fixes. > > > > Thoughts? > > > > > > Hi Vignesh, > > > > > > I don't really have any further comments on the code, but would like > > > to share some results of some Parallel Copy performance tests I ran > > > (attached). > > > > > > The tests loaded a 5GB CSV data file into a 100 column table (of > > > different data types). The following were varied as part of the test: > > > - Number of workers (1 – 10) > > > - No indexes / 4-indexes > > > - Default settings / increased resources (shared_buffers,work_mem, etc.) > > > > > > (I did not do any partition-related tests as I believe those type of > > > tests were previously performed) > > > > > > I built Postgres (latest OSS code) with the latest Parallel Copy patches > > > (v4). > > > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. > > > > > > > > > I observed the following trends: > > > - For the data file size used, Parallel Copy achieved best performance > > > using about 9 – 10 workers. Larger data files may benefit from using > > > more workers. However, I couldn’t really see any better performance, > > > for example, from using 16 workers on a 10GB CSV data file compared to > > > using 8 workers. Results may also vary depending on machine > > > characteristics. > > > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > > > of cases (I did question if allowing 1 worker was useful in my patch > > > review). > > > > I think the reason is that for 1 worker case there is not much > > parallelization as a leader doesn't perform the actual load work. > > Vignesh, can you please once see if the results are reproducible at > > your end, if so, we can once compare the perf profiles to see why in > > some cases we get improvement and in other cases not. Based on that we > > can decide whether to allow the 1 worker case or not. > > > > I will spend some time on this and update. > Thanks. -- With Regards, Amit Kapila.
Re: Parallel copy
On Thu, Aug 27, 2020 at 8:24 AM Amit Kapila wrote: > > On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow wrote: > > > > > I have attached new set of patches with the fixes. > > > Thoughts? > > > > Hi Vignesh, > > > > I don't really have any further comments on the code, but would like > > to share some results of some Parallel Copy performance tests I ran > > (attached). > > > > The tests loaded a 5GB CSV data file into a 100 column table (of > > different data types). The following were varied as part of the test: > > - Number of workers (1 – 10) > > - No indexes / 4-indexes > > - Default settings / increased resources (shared_buffers,work_mem, etc.) > > > > (I did not do any partition-related tests as I believe those type of > > tests were previously performed) > > > > I built Postgres (latest OSS code) with the latest Parallel Copy patches > > (v4). > > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. > > > > > > I observed the following trends: > > - For the data file size used, Parallel Copy achieved best performance > > using about 9 – 10 workers. Larger data files may benefit from using > > more workers. However, I couldn’t really see any better performance, > > for example, from using 16 workers on a 10GB CSV data file compared to > > using 8 workers. Results may also vary depending on machine > > characteristics. > > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > > of cases (I did question if allowing 1 worker was useful in my patch > > review). > > I think the reason is that for 1 worker case there is not much > parallelization as a leader doesn't perform the actual load work. > Vignesh, can you please once see if the results are reproducible at > your end, if so, we can once compare the perf profiles to see why in > some cases we get improvement and in other cases not. Based on that we > can decide whether to allow the 1 worker case or not. > I will spend some time on this and update. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Parallel copy
On Thu, Aug 27, 2020 at 8:04 AM Greg Nancarrow wrote: > > > I have attached new set of patches with the fixes. > > Thoughts? > > Hi Vignesh, > > I don't really have any further comments on the code, but would like > to share some results of some Parallel Copy performance tests I ran > (attached). > > The tests loaded a 5GB CSV data file into a 100 column table (of > different data types). The following were varied as part of the test: > - Number of workers (1 – 10) > - No indexes / 4-indexes > - Default settings / increased resources (shared_buffers,work_mem, etc.) > > (I did not do any partition-related tests as I believe those type of > tests were previously performed) > > I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4). > The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. > > > I observed the following trends: > - For the data file size used, Parallel Copy achieved best performance > using about 9 – 10 workers. Larger data files may benefit from using > more workers. However, I couldn’t really see any better performance, > for example, from using 16 workers on a 10GB CSV data file compared to > using 8 workers. Results may also vary depending on machine > characteristics. > - Parallel Copy with 1 worker ran slower than normal Copy in a couple > of cases (I did question if allowing 1 worker was useful in my patch > review). I think the reason is that for 1 worker case there is not much parallelization as a leader doesn't perform the actual load work. Vignesh, can you please once see if the results are reproducible at your end, if so, we can once compare the perf profiles to see why in some cases we get improvement and in other cases not. Based on that we can decide whether to allow the 1 worker case or not. > - Typical load time improvement (load factor) for Parallel Copy was > between 2x and 3x. Better load factors can be obtained by using larger > data files and/or more indexes. > Nice improvement and I think you are right that with larger load data we will get even better improvement. -- With Regards, Amit Kapila.
Re: Parallel copy
> I have attached new set of patches with the fixes. > Thoughts? Hi Vignesh, I don't really have any further comments on the code, but would like to share some results of some Parallel Copy performance tests I ran (attached). The tests loaded a 5GB CSV data file into a 100 column table (of different data types). The following were varied as part of the test: - Number of workers (1 – 10) - No indexes / 4-indexes - Default settings / increased resources (shared_buffers,work_mem, etc.) (I did not do any partition-related tests as I believe those type of tests were previously performed) I built Postgres (latest OSS code) with the latest Parallel Copy patches (v4). The test system was a 32-core Intel Xeon E5-4650 server with 378GB of RAM. I observed the following trends: - For the data file size used, Parallel Copy achieved best performance using about 9 – 10 workers. Larger data files may benefit from using more workers. However, I couldn’t really see any better performance, for example, from using 16 workers on a 10GB CSV data file compared to using 8 workers. Results may also vary depending on machine characteristics. - Parallel Copy with 1 worker ran slower than normal Copy in a couple of cases (I did question if allowing 1 worker was useful in my patch review). - Typical load time improvement (load factor) for Parallel Copy was between 2x and 3x. Better load factors can be obtained by using larger data files and/or more indexes. - Increasing Postgres resources made little or no difference to Parallel Copy performance when the target table had no indexes. Increasing Postgres resources improved Parallel Copy performance when the target table had indexes. Regards, Greg Nancarrow Fujitsu Australia (1) Postgres default settings, 5GB CSV (510 rows), no indexes on table: Copy Type Duration (s)Load factor === Normal Copy 132.838 - Parallel Copy (#workers) 1 97.537 1.36 2 61.700 2.15 3 52.788 2.52 4 46.607 2.85 5 45.524 2.92 6 43.799 3.03 7 42.970 3.09 8 42.974 3.09 9 43.698 3.04 10 43.362 3.06 (2) Postgres default settings, 5GB CSV (510 rows), 4 indexes on table: Copy Type Duration (s)Load factor === Normal Copy 221.111 - Parallel Copy (#workers) 1 331.609 0.66 2 99.085 2.23 3 89.751 2.46 4 81.137 2.73 5 79.138 2.79 6 77.155 2.87 7 75.813 2.92 8 74.961 2.95 9 77.803 2.84 10 75.399 2.93 (3) Postgres increased resources, 5GB CSV (510 rows), no indexes on table: shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB work_mem = 10% of RAM = 38GB maintenance_work_mem = 10% of RAM = 38GB max_worker_processes = 16 max_parallel_workers = 16 checkpoint_timeout = 30min max_wal_size=2GB Copy Type Duration (s)Load factor === Normal Copy 78.138 - Parallel Copy (#workers) 1 95.203 0.82 2 62.596 1.24 3 52.318 1.49 4 48.246 1.62 5 42.832 1.82 6 42.921 1.82 7 43.146 1.81 8 41.557 1.88 9 43.489 1.80 10 43.362 1.80 (4) Postgres increased resources, 5GB CSV (510 rows), 4 indexes on table: Copy Type Duration (s)Load factor
Re: Parallel copy
Hi Vignesh, Some further comments: (1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch +/* + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data + * block to process to avoid lock contention. This value should be divisible by + * RINGSIZE, as wrap around cases is currently not handled while selecting the + * WORKER_CHUNK_COUNT by the worker. + */ +#define WORKER_CHUNK_COUNT 50 "This value should be divisible by RINGSIZE" is not a correct statement (since obviously 50 is not divisible by 1). It should say something like "This value should evenly divide into RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT". (2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch (i) + /* +* If the data is present in current block lineInfo. line_size +* will be updated. If the data is spread across the blocks either Somehow a space has been put between "lineinfo." and "line_size". It should be: "If the data is present in current block lineInfo.line_size will be updated" (ii) >This is not possible because of pg_atomic_compare_exchange_u32, this >will succeed only for one of the workers whose line_state is >LINE_LEADER_POPULATED, for other workers it will fail. This is >explained in detail above ParallelCopyLineBoundary. Yes, but prior to that call to pg_atomic_compare_exchange_u32(), aren't you separately reading line_state and line_state, so that between those reads, it may have transitioned from leader to another worker, such that the read line state ("cur_line_state", being checked in the if block) may not actually match what is now in the line_state and/or the read line_size ("dataSize") doesn't actually correspond to the read line state? (sorry, still not 100% convinced that the synchronization and checks are safe in all cases) (3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch >raw_buf is not used in parallel copy, instead raw_buf will be pointing >to shared memory data blocks. This memory was allocated as part of >BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be >performed sequentially like in case max_worker_processes is not >available, if it switches to sequential mode raw_buf will be used >while performing copy operation. At this place we can safely free this >memory that was allocated So the following code (which checks raw_buf, which still points to memory that has been pfreed) is still valid? In the SetRawBufForLoad() function, which is called by CopyReadLineText(): cur_data_blk_ptr = (cstate->raw_buf) ? &pcshared_info->data_blocks[cur_block_pos] : NULL; The above code looks a bit dicey to me. I stepped over that line in the debugger when I debugged an instance of Parallel Copy, so it definitely gets executed. It makes me wonder what other code could possibly be checking raw_buf and using it in some way, when in fact what it points to has been pfreed. Are you able to add the following line of code, or will it (somehow) break logic that you are relying on? pfree(cstate->raw_buf); cstate->raw_buf = NULL; <=== I suggest that this line is added Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel copy
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, failed Hi, I don't claim to yet understand all of the Postgres internals that this patch is updating and interacting with, so I'm still testing and debugging portions of this patch, but would like to give feedback on what I've noticed so far. I have done some ad-hoc testing of the patch using parallel copies from text/csv/binary files and have not yet struck any execution problems other than some option validation and associated error messages on boundary cases. One general question that I have: is there a user benefit (over the normal non-parallel COPY) to allowing "COPY ... FROM ... WITH (PARALLEL 1)"? My following comments are broken down by patch: (1) v2-0001-Copy-code-readjustment-to-support-parallel-copy.patch (i) Whilst I can't entirely blame these patches for it (as they are following what is already there), I can't help noticing the use of numerous macros in src/backend/commands/copy.c which paste in multiple lines of code in various places. It's getting a little out-of-hand. Surely the majority of these would be best inline functions instead? Perhaps hasn't been done because too many parameters need to be passed - thoughts? (2) v2-0002-Framework-for-leader-worker-in-parallel-copy.patch (i) minor point: there are some tabbing/spacing issues in this patch (and the other patches), affecting alignment. e.g. mixed tabs/spaces and misalignment in PARALLEL_COPY_KEY_xxx definitions (ii) +/* + * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data + * block to process to avoid lock contention. This value should be mode of + * RINGSIZE, as wrap around cases is currently not handled while selecting the + * WORKER_CHUNK_COUNT by the worker. + */ +#define WORKER_CHUNK_COUNT 50 "This value should be mode of RINGSIZE ..." -> typo: mode (mod? should evenly divide into RINGSIZE?) (iii) + *using pg_atomic_compare_exchange_u32, worker will change the sate to ->typo: sate (should be "state") (iv) +errmsg("parallel option supported only for copy from"), -> suggest change to: errmsg("parallel option is supported only for COPY FROM"), (v) + errno = 0; /* To distinguish success/failure after call */ + val = strtol(str, &endptr, 10); + + /* Check for various possible errors */ + if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN)) + || (errno != 0 && val == 0) || + *endptr) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("improper use of argument to option \"%s\"", + defel->defname), +parser_errposition(pstate, defel->location))); + + if (endptr == str) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("no digits were found in argument to option \"%s\"", + defel->defname), +parser_errposition(pstate, defel->location))); + + cstate->nworkers = (int) val; + + if (cstate->nworkers <= 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("argument to option \"%s\" must be a positive integer greater than zero", + defel->defname), +parser_errposition(pstate, defel->location))); I think this validation code needs to be improved, including the error messages (e.g. when can a "positive integer" NOT be greater than zero?) There is some overlap in the "no digits were found" case between the two conditions above, depending, for example, if the argument is quoted. Also, "improper use of argument to option" sounds a bit odd and vague to me. Finally, not range checking before casting long to int can lead to allowing out-of-range int values like in the following case: test=# copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2147483648'); ERROR: argument to option "parallel" must be a positive integer greater than zero LINE 1: copy mytable from '/myspace/test_pcopy/tmp.dat' (parallel '-2...
Re: Parallel copy
On Tue, Aug 4, 2020 at 9:51 PM Tomas Vondra wrote: > > On Mon, Aug 03, 2020 at 12:33:48PM +0530, Bharath Rupireddy wrote: > >On Sat, Aug 1, 2020 at 9:55 AM vignesh C wrote: > >> > >> The patches were not applying because of the recent commits. > >> I have rebased the patch over head & attached. > >> > >I rebased v2-0006-Parallel-Copy-For-Binary-Format-Files.patch. > > > >Putting together all the patches rebased on to the latest commit > >b8fdee7d0ca8bd2165d46fb1468f75571b706a01. Patches from 0001 to 0005 > >are rebased by Vignesh, that are from the previous mail and the patch > >0006 is rebased by me. > > > >Please consider this patch set for further review. > > > > I'd suggest incrementing the version every time an updated version is > submitted, even if it's just a rebased version. It makes it clearer > which version of the code is being discussed, etc. Sure, we will take care of this when we are sending the next set of patches. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com