Was this implemented? Is it a TODO? ---------------------------------------------------------------------------
Heikki Linnakangas wrote: > I'm reviving the effort I started a while back to make COPY faster: > > http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php > http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php > > The patch I now have is based on using memchr() to search end-of-line. > In a nutshell: > > * we perform possible encoding conversion early, one input block at a > time, rather than after splitting the input into lines. This allows us > to assume in the later stages that the data is in server encoding, > allowing us to search for the '\n' byte without worrying about > multi-byte characters. > > * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() > to find the next NL/CR character. This is where the speedup comes from. > Unfortunately we can't do that in the CSV codepath, because newlines can > be embedded in quoted, so that's unchanged. > > These changes seem to give an overall speedup of between 0-10%, > depending on the shape of the table. I tested various tables from the > TPC-H schema, and a narrow table consisting of just one short text column. > > I can't think of a case where these changes would be a net loss in > performance, and it didn't perform worse on any of the cases I tested > either. > > There's a small fly in the ointment: the patch won't recognize backslash > followed by a linefeed as an escaped linefeed. I think we should simply > drop support for that. The docs already say: > > > It is strongly recommended that applications generating COPY data convert > > data newlines and carriage returns to the \n and \r sequences respectively. > > At present it is possible to represent a data carriage return by a > > backslash and carriage return, and to represent a data newline by a > > backslash and newline. However, these representations might not be accepted > > in future releases. They are also highly vulnerable to corruption if the > > COPY file is transferred across different machines (for example, from Unix > > to Windows or vice versa). > > I vaguely recall that we discussed this some time ago already and agreed > that we can drop it if it makes life easier. > > This patch is in pretty good shape, however it needs to be tested with > different exotic input formats. Also, the loop in CopyReadLineText could > probaby be cleaned up a bit, some of the uglifications that were done > for performance reasons in the old code are no longer necessary, as > memchr() is doing the heavy-lifting and the loop only iterates 1-2 times > per line in typical cases. > > > It's not strictly necessary, but how about dropping support for the old > COPY protocol, and the EOF marker \. while we're at it? It would allow > us to drop some code, making the remaining code simpler, and reduce the > testing effort. Thoughts on that? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com [ Attachment, skipping... ] > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers