Hi,

Le mercredi 12 décembre 2007, Josh Berkus a écrit :
> > I'm curious what you feel is missing that pgloader doesn't fill that
> > requirement:  http://pgfoundry.org/projects/pgloader/
>
> Because pgloader is implemented in middleware, it carries a very high
> overhead if you have bad rows.  As little as 1% bad rows will slow down
> loading by 20% due to retries.

Not that much, in fact, I'd say.
pgloader allows its user to configure how large a COPY buffer to use (global 
parameter as of now, could easily be a per-section configuration knob, just 
didn't see any need for this yet).
It's the 'copy_every' parameter as seen on the man page here:
  http://pgloader.projects.postgresql.org/#toc4

pgloader will obviously prepare a in-memory buffer of copy_every tuples to 
give to COPY, and in case of error will cut it and retry. Classic dichotomy 
approach, from initial implementation by Jan Wieck.

So you can easily balance the error recovery costs against the COPY bulk size.

Note also that the overall loading time with pgloader is not scaling the same 
as the COPY buffer size, the optimal choice depends on the dataset --- and 
the data massaging pgloader has to make on it ---, and I've experienced best 
results with 10000 and 15000 tuples buffers so far.

FYI, now the pgloader topic is on the table, the next items I think I'm gonna 
develop for it are configurable behavior on errors tuples (load to another 
table when pk error, e.g.), and some limited ddl-partioning support.

I'm playing with the idea for pgloader to be able to read some partitioning 
schemes (parsing CHECK constraint on inherited tables) and load directly into 
the right partitions.
That would of course be done only when configured this way, and if constraints 
are misread it would only result in a lot more rejected rows than expected, 
and you still can retry using your insert trigger instead of pgloader buggy 
smartness.

Comments welcome, regards,
-- 
dim

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to