Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if something blows up, you can COPY it again... and it will be much faster because no WAL will be written.

If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process per core, and avoid writing WAL, it will scale.

This doesn't solve the other half of your problem (removing the duplicates) which isn't easy to parallelize, but it will make the COPY part a lot faster.

Note that you can have 1 core process the INSERT / removing duplicates while the others are handling COPY and filling temp tables, so if you pipeline it, you could save some time.

Does your data contain a lot of duplicates, or are they rare ? What percentage ?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to