Re: Re: Perform COPY FROM encoding conversions in larger chunks

2021-05-01 Thread Chapman Flack
On 04/01/21 05:27, Heikki Linnakangas wrote:
> I read through the patches one more time, made a few small comment fixes,
> and pushed.

Wow, this whole thread escaped my attention at the time, though my ears
would have perked right up if the subject had been something like
'improve encoding conversion API to stream a buffer at a time'. I think
this is of great interest beyond one particular use case in COPY FROM.
For example, it could limit the allocations needed when streaming a large
text value out to a client; it might be used to advantage with the recent
work in incrementally detoasting large values, and so on.

This part seems a little underdeveloped:

> * TODO: The conversion function interface is not great.  Firstly, it
> * would be nice to pass through the destination buffer size to the
> * conversion function, so that if you pass a shorter destination buffer, it
> * could still continue to fill up the whole buffer.  Currently, we have to
> * assume worst case expansion and stop the conversion short, even if there
> * is in fact space left in the destination buffer.  Secondly, it would be
> * nice to return the number of bytes written to the caller, to avoid a call
> * to strlen().

If I understand correctly, this patch already makes a breaking change to
the conversion function API. If that's going to be the case anyway, I wonder
if it's worth going further and changing the API further to eliminate this
odd limitation.

There seems to be a sort of common shape that conversion APIs have evolved
toward, that can be seen in both the ICU4C converters [0] and in Java's [1].
This current tweak to our conversion API seems to get ammsst there,
but just not quite. For example, noError allows us to keep control when
the function has stopped converting, but we don't find out which reason
it stopped.

If we just went the rest of the way and structured the API like those
existing ones, then:

- it would be super easy to write wrappers around ICU4C converters, if
  there were any we wanted to use;

- I could very easily write wrappers presenting any PG-supported charset
  as a Java charset.

The essence of the API common to ICU4C and Java is this:

1. You pass the function the address and length of a source buffer,
   the address and length of a destination buffer, and a flag that is true
   if you know there is no further input where this source buffer came from.
   (It's allowable to pass false and only then discover you really have no
   more input after all; then you just make one final call passing true.)

2. The function eats as much as it can of the source buffer, fills as much
   as it can of the destination buffer, and returns indicating one of four
   reasons it stopped:

   underflow - ran out of source buffer
   overflow  - ran out of destination buffer
   malformed - something in source buffer isn't valid in that representation
   unmappable - a valid codepoint not available in destination encoding

   Based on that, the caller refills the source buffer, or drains the
   destination buffer, or handles or reports the malformed or unmappable,
   then repeats.

3. The function should update pointers on return to indicate how much
   of the source buffer it consumed and how much of the destination buffer
   it filled.

4. If it left any bytes unconsumed in the source buffer, the caller must
   preserve them (perhaps moved to the front) for the next call.

5. The converter can have internal state (so it is an object in Java, or
   has a UConverter struct allocated in ICU4C, to have a place for its
   state). The state gets flushed on the final call where the flag is
   passed true. In many cases, the converter can be implemented without
   keeping internal state, if it simply leaves, for example, an
   incomplete sequence at the end of the source buffer unconsumed, so the
   caller will move it to the front and supply the rest. On the other hand,
   any unconsumed input after the final call with flush flag true must be
   treated as malformed.

6. On a malformed or unmappable return, the source buffer is left pointed
   at the start of the offending sequence and the length in bytes of
   that sequence is available for error reporting/recovery.

The efficient handling of states, returning updated pointers, and so on,
probably requires a function signature with 'internal' in it ... but
the current function signature already has 'internal', so that doesn't
seem like a deal-breaker.


Thoughts? It seems a shame to make a breaking change in the conversion
API, only to still end up with an API that "is not great" and is still
impedance-mismatched to other existing prominent conversion APIs.

Regards,
-Chap


[0]
https://unicode-org.github.io/icu/userguide/conversion/converters.html#3-buffered-or-streamed

[1]
https://docs.oracle.com/javase/9/docs/api/java/nio/charset/CharsetDecoder.html#decode-java.nio.ByteBuffer-java.nio.CharBuffer-boolean-




Re: Perform COPY FROM encoding conversions in larger chunks

2021-04-01 Thread Heikki Linnakangas

On 01/04/2021 11:09, Heikki Linnakangas wrote:

On 18/03/2021 20:05, John Naylor wrote:

I wrote:

  > I went ahead and rebased these.


Thanks!


I read through the patches one more time, made a few small comment 
fixes, and pushed.


- Heikki




Re: Perform COPY FROM encoding conversions in larger chunks

2021-04-01 Thread Heikki Linnakangas

On 18/03/2021 20:05, John Naylor wrote:

I wrote:

 > I went ahead and rebased these.


Thanks!

I also wanted to see if this patch set had any performance effect, with 
and without changing how UTF-8 is validated, using the blackhole am from 
https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am 
.


create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"

where copy-test.txt is made by

for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ;  done;

On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):

master:
109ms

v6 do encoding in larger chunks:
109ms

v7 utf8 SIMD:
98ms


That's disappointing. Perhaps the file size is just too small to see the 
effect? I'm seeing results between 40 ms and 75 ms on my laptop when I 
run a test like that multiple times. I used "WHERE false" instead of the 
blackhole AM but I don't think that makes much difference (only showing 
a few runs here for brevity):


for i in {1..100}; do cat /tmp/utf8.html >> /tmp/test-copy.txt ;  done;

postgres=# create table blackhole_tab (a text) ;
CREATE TABLE
postgres=# \timing
Timing is on.
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
Time: 53.166 ms
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
Time: 43.981 ms
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
Time: 71.850 ms
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
...

I tested that with a larger file:

for i in {1..1}; do cat /tmp/utf8.html >> /tmp/test-copy.txt ;  done;
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;

v6 do encoding in larger chunks (best of five):
Time: 3955.514 ms (00:03.956)

master (best of five):
Time: 4133.767 ms (00:04.134)

So with that, I'm seeing a measurable difference.

- Heikki




Re: Perform COPY FROM encoding conversions in larger chunks

2021-03-18 Thread John Naylor
On Thu, Mar 18, 2021 at 2:05 PM John Naylor 
wrote:
>
> I wrote:
>
> > I went ahead and rebased these.
>
> It looks like FreeBSD doesn't like this for some reason.

On closer examination, make check was "terminated", not that the tests
failed...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform COPY FROM encoding conversions in larger chunks

2021-03-18 Thread John Naylor
I wrote:

> I went ahead and rebased these.

It looks like FreeBSD doesn't like this for some reason.

I also wanted to see if this patch set had any performance effect, with and
without changing how UTF-8 is validated, using the blackhole am from
https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am.

create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"

where copy-test.txt is made by

for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ;  done;

On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):

master:
109ms

v6 do encoding in larger chunks:
109ms

v7 utf8 SIMD:
98ms

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform COPY FROM encoding conversions in larger chunks

2021-02-09 Thread Heikki Linnakangas

On 09/02/2021 15:40, John Naylor wrote:
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas > wrote:

 >
 > On 02/02/2021 23:42, John Naylor wrote:
 > >
 > > In copyfromparse.c, this is now out of date:
 > >
 > >   * Read the next input line and stash it in line_buf, with 
conversion to

 > >   * server encoding.

This comment for CopyReadLine() is still there. Conversion already 
happened by now, so I think this comment is outdated.


Other than that, I think this is ready for commit.


Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1" 
somehow went missing again. That was causing a failure on Windows at 
cfbot.cputube.org.


I'll read through this one more time with fresh eyes tomorrow or the day 
after, and push. Thanks for all the review!


- Heikki




Re: Perform COPY FROM encoding conversions in larger chunks

2021-02-09 Thread John Naylor
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas  wrote:
>
> On 02/02/2021 23:42, John Naylor wrote:
> >
> > In copyfromparse.c, this is now out of date:
> >
> >   * Read the next input line and stash it in line_buf, with conversion
to
> >   * server encoding.

This comment for CopyReadLine() is still there. Conversion already happened
by now, so I think this comment is outdated.

Other than that, I think this is ready for commit.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform COPY FROM encoding conversions in larger chunks

2021-02-02 Thread John Naylor
On Mon, Feb 1, 2021 at 12:15 PM Heikki Linnakangas  wrote:
> Thanks. I fixed it slightly differently, and also changed LocalToUtf()
> to follow the same pattern, even though LocalToUtf() did not have the
> same bug.

Looks good to me.

> I added a bunch of tests for various built-in conversions.

Nice! I would like to have utf8 tests for every category of invalid byte
(overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch.

> I spent some time refactoring and adding comments all around the patch,
> hopefully making it all more clear. One notable difference is that I
> renamed 'raw_buf' (which exists in master too) to 'input_buf', and
> renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
> patch again another day with fresh eyes, and also try to add some tests
> for the corner cases at buffer boundaries.

The comments and renaming are really helpful in understanding that file!

Although a new patch is likely forthcoming, I did take a brief look and
found the following:


In copyfromparse.c, this is now out of date:

 * Read the next input line and stash it in line_buf, with conversion to
 * server encoding.


One of your FIXME comments seems to allude to this, but if we really need a
difference here, maybe it should be explained:

+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */

+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */


Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003
accidentally?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform COPY FROM encoding conversions in larger chunks

2021-01-30 Thread John Naylor
On Thu, Jan 28, 2021 at 7:36 AM Heikki Linnakangas  wrote:
>
> Even more surprising was that the second patch
> (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
> actually made things worse again. I thought it would give a modest gain,
> but nope.

Hmm, that surprised me too.

> Based on these results, I'm going to commit the first patch, but not the
> second one. There are much faster UTF-8 verification routines out there,
> using SIMD instructions and whatnot, and we should consider adopting one
> of those, but that's future work.

I have something in mind for that.

I took a look at v2, and for the first encoding I tried, it fails to report
the error for invalid input:

create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN'
LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;

\c euctest
create table foo (a text);

master:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
ERROR:  character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no
equivalent in encoding "EUC_CN"
CONTEXT:  COPY foo, line 1

patch:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
COPY 0
euctest=#

I believe the problem is in UtfToLocal(). I've attached a fix formatted as
a text file to avoid confusing the cfbot. The fix keeps the debugging
ereport() in case you find it useful. Some additional test coverage might
be good here, but not sure how much work that would be. I didn't check any
other conversions yet.


v2-0002 seems fine to me, I just have cosmetic comments here:

+ * the same, no conversion is required by we must still validate that the

s/by/but/

This comment in copyfrom_internal.h above the *StateData struct is the same
as the corresponding one in copyto.c:

 * Multi-byte encodings: all supported client-side encodings encode
multi-byte
 * characters by having the first byte's high bit set. Subsequent bytes of
the
 * character can have the high bit not set. When scanning data in such an
 * encoding to look for a match to a single-byte (ie ASCII) character, we
must
 * use the full pg_encoding_mblen() machinery to skip over multibyte
 * characters, else we might find a false match to a trailing byte. In
 * supported server encodings, there is no possibility of a false match, and
 * it's faster to make useless comparisons to trailing bytes than it is to
 * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is
true
 * when we have to do it the hard way.

The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of
date for copy-from. I'm not sure the rest is relevant to copy-from anymore,
either. Can you confirm?

This comment inside the struct is now out of date as well:

 * Similarly, line_buf holds the whole input line being processed. The
 * input cycle is first to read the whole line into line_buf, convert it
 * to server encoding there, and then extract the individual attribute

HEAD has this macro already:

/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len -
(cstate)->raw_buf_index)

It might make sense to create a CONVERSION_BUF_BYTES equivalent since the
patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index
in a couple places.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index 60dfebb0bd..1681efc7a0 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -397,6 +397,9 @@ CopyConvertBuf(CopyFromState cstate)

   (unsigned char *) cstate->raw_buf + cstate->raw_buf_len,

   dstlen,

   true);
+
+ereport(NOTICE, errmsg_internal("convertedbytes: %d", convertedbytes));
+
if (convertedbytes == 0)
{
/*
diff --git a/src/backend/utils/mb/conv.c b/src/backend/utils/mb/conv.c
index b83358bc7a..26600e68ee 100644
--- a/src/backend/utils/mb/conv.c
+++ b/src/backend/utils/mb/conv.c
@@ -497,7 +497,7 @@ UtfToLocal(const unsigned char *utf, int len,
int l;
const pg_utf_to_local_combined *cp;
const unsigned char *start = utf;
-   const unsigned char *cur = utf;
+   unsigned char *cur = unconstify(unsigned char *, utf);
 
if (!PG_VALID_ENCODING(encoding))
ereport(ERROR,
@@ -511,8 +511,6 @@ UtfToLocal(const unsigned char *utf, int len,
unsigned char b3 = 0;
unsigned char b4 = 0;
 
-   cur = iso;
-
   

Re: Perform COPY FROM encoding conversions in larger chunks

2021-01-28 Thread Heikki Linnakangas

On 28/01/2021 01:23, John Naylor wrote:

Hi Heikki,

0001 through 0003 are straightforward, and I think they can be committed 
now if you like.


0004 is also pretty straightforward. The check you proposed upthread for 
pg_upgrade seems like the best solution to make that workable. I'll take 
a look at 0005 soon.


I measured the conversions that were rewritten in 0003, and there is 
indeed a noticeable speedup:


Big5 to EUC-TW:

head    196ms
0001-3  152ms

EUC-TW to Big5:

head    190ms
0001-3  144ms

I've attached the driver function for reference. Example use:

select drive_conversion(
   1000, 'euc_tw'::name, 'big5'::name,
   convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);


Thanks! I have committed patches 0001 and 0003 in this series, with 
minor comment fixes. Next I'm going to write the pg_upgrade check for 
patch 0004, to get that into a committable state too.


I took a look at the test suite also, and the only thing to note is a 
couple places where the comment doesn't match the code:


+  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+  byte1 = hex('0e');
+  for byte2 in hex('a1')..hex('df') loop
+    return next b(byte1, byte2);
+  end loop;
+
+  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+  byte1 = hex('0f');
+  for byte2 in hex('a1')..hex('fe') loop
+    for byte3 in hex('a1')..hex('fe') loop
+      return next b(byte1, byte2, byte3);
+    end loop;
+  end loop;

Not sure if it matters , but thought I'd mention it anyway.


Good catch! The comments were correct, and the tests were wrong, not 
testing those 2- and 3-byte encoded characters as intened. Doesn't 
matter for testing this patch, I only included those euc_jis_2004 tets 
for the sake of completeness, but if someone finds this test suite in 
the archives and want to use it for something real, make sure you fix 
that first.


- Heikki




Re: Perform COPY FROM encoding conversions in larger chunks

2021-01-28 Thread Heikki Linnakangas

On 28/01/2021 01:23, John Naylor wrote:

Hi Heikki,

0001 through 0003 are straightforward, and I think they can be committed 
now if you like.


Thanks for the review!

I did some more rigorous microbenchmarking of patch 1 and 2. I used the 
attached test script, which calls convert_from() function to perform 
UTF-8 verification on two large strings, about 60kb each. One of the 
strings is pure ASCII, and the other is an HTML page that contains a mix 
of ASCII and multibyte characters.


Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6)

   | mixed | ascii
---+---+---
 master|  1866 |  1250
 patch 1   |   959 |   507
 patch 1+2 |  1396 |   987

So, the first patch, 
0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge 
difference. Even with pure ASCII input. That's very surprising, because 
there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len().


Even more surprising was that the second patch 
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch) 
actually made things worse again. I thought it would give a modest gain, 
but nope.


It seems to me that GCC is not doing good job at optimizing the loop in 
pg_verify_mbstr(). The first patch fixes that, but the second patch 
somehow trips up GCC again.


So I also tried this with "gcc -O3" and clang:

Compiled with "gcc -O3"

   | mixed | ascii
---+---+---
 master|  1522 |  1225
 patch 1   |   753 |   507
 patch 1+2 |   868 |   507

Compiled with "clang -O2", Debian clang version 11.0.1-2

   | mixed | ascii
---+---+---
 master|  1257 |   520
 patch 1   |   899 |   507
 patch 1+2 |   884 |   508

With gcc -O3, the results are a better, but still the second patch seems 
harmful. With clang, I got the result I expected: Almost no difference 
with pure-ASCII input, because there's already a fast-path for that, and 
a nice speedup with multibyte characters. Still, I was surprised how big 
the speedup from the first patch was, and how little additional gain the 
second patch gives.


Based on these results, I'm going to commit the first patch, but not the 
second one. There are much faster UTF-8 verification routines out there, 
using SIMD instructions and whatnot, and we should consider adopting one 
of those, but that's future work.


- Heikki


mbverifystr-speed.sql
Description: application/sql


Re: Perform COPY FROM encoding conversions in larger chunks

2021-01-27 Thread John Naylor
Hi Heikki,

0001 through 0003 are straightforward, and I think they can be committed
now if you like.

0004 is also pretty straightforward. The check you proposed upthread for
pg_upgrade seems like the best solution to make that workable. I'll take a
look at 0005 soon.

I measured the conversions that were rewritten in 0003, and there is indeed
a noticeable speedup:

Big5 to EUC-TW:

head196ms
0001-3  152ms

EUC-TW to Big5:

head190ms
0001-3  144ms

I've attached the driver function for reference. Example use:

select drive_conversion(
  1000, 'euc_tw'::name, 'big5'::name,
  convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);

I took a look at the test suite also, and the only thing to note is a
couple places where the comment doesn't match the code:

+  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+  byte1 = hex('0e');
+  for byte2 in hex('a1')..hex('df') loop
+return next b(byte1, byte2);
+  end loop;
+
+  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+  byte1 = hex('0f');
+  for byte2 in hex('a1')..hex('fe') loop
+for byte3 in hex('a1')..hex('fe') loop
+  return next b(byte1, byte2, byte3);
+end loop;
+  end loop;

Not sure if it matters , but thought I'd mention it anyway.

--
John Naylor
EDB: http://www.enterprisedb.com


drive_conversion.c
Description: Binary data


Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-23 Thread John Naylor
On Wed, Dec 23, 2020 at 3:41 AM Heikki Linnakangas  wrote:

>
> I'm not sure it's worth the trouble, though. Custom conversions are very
> rare. And I don't think any other object can depend on a conversion, so
> you can always drop the conversion before upgrade, and re-create it with
> the new function signature afterwards. A note in the release notes and a
> check in pg_upgrade, with instructions to drop and recreate the
> conversion, are probably enough.
>

That was my thought as well.

-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-22 Thread Heikki Linnakangas

On 22/12/2020 22:01, John Naylor wrote:
In 0004, it seems you have some doubts about upgrade compatibility. Is 
that because user-defined conversions would no longer have the right 
signature?


Exactly. If you have an extension that adds a custom conversion function 
and does CREATE CONVERSION, the old installation script will fail on the 
new version. That causes trouble for pg_dump+restore and pg_upgrade.


Perhaps we could accept the old signature in the server when you do 
CREATE CONVERSION, but somehow mark the conversion as broken in the 
catalog so that you would get a runtime error if you tried to use it. 
That would be enough to make pg_dump+restore (and pg_upgrade) not throw 
an error, and you could then upgrade the extension later (ALTER 
EXTENSION UPDATE).


I'm not sure it's worth the trouble, though. Custom conversions are very 
rare. And I don't think any other object can depend on a conversion, so 
you can always drop the conversion before upgrade, and re-create it with 
the new function signature afterwards. A note in the release notes and a 
check in pg_upgrade, with instructions to drop and recreate the 
conversion, are probably enough.


- Heikki




Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-22 Thread John Naylor
On Wed, Dec 16, 2020 at 8:18 AM Heikki Linnakangas  wrote:
>
> Currently, COPY FROM parses the input one line at a time. Each line is
> converted to the database encoding separately, or if the file encoding
> matches the database encoding, we just check that the input is valid for
> the encoding. It would be more efficient to do the encoding
> conversion/verification in larger chunks. At least potentially; the
> current conversion/verification implementations work one byte a time so
> it doesn't matter too much, but there are faster algorithms out there
> that use SIMD instructions or lookup tables that benefit from larger
inputs.

Hi Heikki,

This is great news. I've seen examples of such algorithms and that'd be
nice to have. I haven't studied the patch in detail, but it looks fine on
the whole.

In 0004, it seems you have some doubts about upgrade compatibility. Is that
because user-defined conversions would no longer have the right signature?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-17 Thread Bruce Momjian
On Wed, Dec 16, 2020 at 02:17:58PM +0200, Heikki Linnakangas wrote:
> I've been looking at the COPY FROM parsing code, trying to refactor it so
> that the parallel COPY would be easier to implement. I haven't touched
> parallelism itself, just looking for ways to smoothen the way. And for ways
> to speed up COPY in general.

Yes, this makes a lot of sense.  Glad you are looking into this.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee