Re: [HACKERS] COPY speedup

Pierre Frédéric Caillau d Wed, 12 Aug 2009 14:27:04 -0700

>> We don't touch datatype APIs
>> lightly, because it affects too much code.
>>
>>                        regards, tom lane
>
>        I definitely agree with that.


Actually, let me clarify:

When I modified the datatype API, I was feeling uneasy, like "I shouldn'treally touch this".

But when I see a big red button, I just press it to see what happens.

Ugly hacks are useful to know how fast the thing can go ; then theinteresting part is to reimplement it cleanly, trying to reach the sameperformance...

Is there any way to do this that is not as invasive?


Maybe add new methods, fastrecv/fastsend etc.  Types that don't
implement them would simply use the slow methods, maintaining
backwards compatibility.


Well, this would certainly work, and it would be even faster.

I considered doing it like this, but it is a lot more work : addingentries to the system catalogs, creating all the new functions, decidingwhat to do with getTypeBinaryOutputInfo (since there would be 2 variants),etc. Types that don't support the new functions would need some form ofindirection to call the old functions instead, etc. In a word, doable, butkludgy, and I would need help from a system catalog expert. Also, onupgrade, information about the new functions must be inserted in thesystem catalogs ? (I don't know how this process works). If you want tohelp...

The way I see COPY BINARY is that its speed should be really somethingmassive.COPY foo FROM ... BINARY should be as fast as CREATE TABLE foo AS SELECT *FROM bar (which is extremely fast).

COPY foo TO ... BINARY should be as fast as the disk allows.

Why else would anyone use a binary format if it's slower than portabletext ?


So, there are two other ways (besides fastsend/fastrecv) that I can see :

1- The way I implemented

I'm not saying it's the definitive solution : just a simple way to see howmuch overhead is introduced by the current API, returning BYTEAs andpalloc()'ing every tuple of every row. I think this approach gave twointeresting answers :

- once COPY's output buffer has been made more efficient, with things likeremoving fwrite() for every row etc (see patch), all that remains is theAPI overhead, which is very significant for binary mode, since I could getmassive speedups (3-4x !) by bypassing it. The table scan itself issuper-fast.

- however, for text mode, it is not so significant, as the speedupsbypassing the API were roughly 0-20%, since most of the time is spent indatum to text conversions.

Now, I don't think the hack is so ugly. It does make me feel a bit uneasy,but :

- The context field in the fcinfo struct is there for a reason, so I usedit.- I checked every place in the code where SendFunctionCall() appears(which are quite few actually).- The context field is never used for SendFuncs or ReceiveFuncs (it isalways set to NULL)


2- Another way

- palloc() could be made faster for short blocks
- a generous sprinkling of inline's
- a few modifications to pq_send*
- a few modifications to StringInfo
- bits of my previous patch in copy.c (like not fwriting every row)

This would be less fast, but you'd still get a substantial speedup.

As a conclusion, I think :

- Alvaro's fastsend/fastrecv provides the cleanest solutin
- Method 2 is the easiest, but slower

- Method 1 is an intermediate, but the use of the context field is atouchy subject.

Also, I will work on COPY FROM ... BINARY. I should be able to make italso much faster. This will be useful for big imports.


Regards,
Pierre

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY speedup

Reply via email to