Re: [HACKERS] CopyReadLineText optimization revisited

Bruce Momjian Fri, 18 Feb 2011 18:39:27 -0800

Was this implemented?  Is it a TODO?

---------------------------------------------------------------------------


Heikki Linnakangas wrote:
> I'm reviving the effort I started a while back to make COPY faster:
> 
> http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php
> http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php
> 
> The patch I now have is based on using memchr() to search end-of-line. 
> In a nutshell:
> 
> * we perform possible encoding conversion early, one input block at a 
> time, rather than after splitting the input into lines. This allows us 
> to assume in the later stages that the data is in server encoding, 
> allowing us to search for the '\n' byte without worrying about 
> multi-byte characters.
> 
> * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() 
> to find the next NL/CR character. This is where the speedup comes from. 
> Unfortunately we can't do that in the CSV codepath, because newlines can 
> be embedded in quoted, so that's unchanged.
> 
> These changes seem to give an overall speedup of between 0-10%, 
> depending on the shape of the table. I tested various tables from the 
> TPC-H schema, and a narrow table consisting of just one short text column.
> 
> I can't think of a case where these changes would be a net loss in 
> performance, and it didn't perform worse on any of the cases I tested 
> either.
> 
> There's a small fly in the ointment: the patch won't recognize backslash 
> followed by a linefeed as an escaped linefeed. I think we should simply 
> drop support for that. The docs already say:
> 
> > It is strongly recommended that applications generating COPY data convert 
> > data newlines and carriage returns to the \n and \r sequences respectively. 
> > At present it is possible to represent a data carriage return by a 
> > backslash and carriage return, and to represent a data newline by a 
> > backslash and newline. However, these representations might not be accepted 
> > in future releases. They are also highly vulnerable to corruption if the 
> > COPY file is transferred across different machines (for example, from Unix 
> > to Windows or vice versa).
> 
> I vaguely recall that we discussed this some time ago already and agreed 
> that we can drop it if it makes life easier.
> 
> This patch is in pretty good shape, however it needs to be tested with 
> different exotic input formats. Also, the loop in CopyReadLineText could 
> probaby be cleaned up a bit, some of the uglifications that were done 
> for performance reasons in the old code are no longer necessary, as 
> memchr() is doing the heavy-lifting and the loop only iterates 1-2 times 
> per line in typical cases.
> 
> 
> It's not strictly necessary, but how about dropping support for the old 
> COPY protocol, and the EOF marker \. while we're at it? It would allow 
> us to drop some code, making the remaining code simpler, and reduce the 
> testing effort. Thoughts on that?
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com

[ Attachment, skipping... ]

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  <br...@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CopyReadLineText optimization revisited

Reply via email to