Well the job is done. The talend component is working (
https://github.com/parisni/talend/tree/master/tPostgresqlOutputBulkAPHP).
It allows creating a file (binary or csv) locally, and then use the COPY
function with "FROM STDIN" that does not need to push the file on a remote
database server.

I have made a little comparison test:

column1: character varying
column2: integer
column3: boolean
10 000 000 tuples

Type    | Create file time       | Bulk load time           | Total
Time              |   File size
Binary | 11137 milliseconds |  21661 milliseconds | 32798 milliseconds |
250 MO
CSV     | 23226 milliseconds |  22192 milliseconds |  45418 milliseconds |
179 MO


Binary format is definitely faster and safer
- faster because writing binary is faster than text file. I guess the bulk
load time bottleneck is the network, then this is equivalent for both
format. It is two time faster to load a binary when the file is on the
database server.
- safer thanks to the format (each value is preceded by its lenght) more
robust thant CSV and separators (that can be present in the text).


Code has been based on :

-
https://github.com/uwescience/myria/blob/master/src/edu/washington/escience/myria/PostgresBinaryTupleWriter.java
-
https://github.com/bytefish/PgBulkInsert/tree/master/PgBulkInsert/src/main/de/bytefish/pgbulkinsert/pgsql/handlers

Thanks,

2016-05-10 15:08 GMT+02:00 Cat <c...@zip.com.au>:

> On Tue, May 10, 2016 at 03:00:55PM +0200, Nicolas Paris wrote:
> > > The way I want is :
> > > csv -> binary -> postgresql
> > >
> > > Is this just to be quicker or are you going to add some business logic
> > > while converting CSV data?
> > > As you mentioned ETL, I assume the second, as I don't think that
> > > converting CSV to binary and then loading it to PostgreSQL will be more
> > > convenient than loading directly from CSV... as quicker as it can be,
> you
> > > have anyway to load data from CSV.
> > >
> > ​
> > Right, ETL process means huge business logic.
> > get the data (csv or other) -> transform it -> produce a binary -> copy
> > from binary from stdin ​
> >
> > Producing 100GO CSVs, is a waste of time.
>
> Ah. You need to fiddle with the data. Then you need to weigh the pros of
> something agnostic to Postgres's internals to something that needs to be
> aware of them.
>
> You will need to delve into the source code for data types more complex
> than INTEGER, TEXT and BYTEA (which was the majority of my data when I
> was just looking into it).
>
> --
>   "A search of his car uncovered pornography, a homemade sex aid, women's
>   stockings and a Jack Russell terrier."
>     -
> http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-1111118083480
>

Reply via email to