Re: [GENERAL] Creating table with data from a join

Igor Stassiy Wed, 15 Jul 2015 05:37:46 -0700

David, I did something like this:

psql -f /dev/fd/3 3 << IN1 & psql -f /dev/fd/4 4 << IN2 ...
 INSERT INTO c SELECT * FRO a JOIN b ON a.ad=b.id WHERE a.id < 0.25th
quantile
IN1
INSERT INTO c SELECT * FRO a JOIN b ON a.ad=b.id WHERE a.id < 0.5th
quantile AND a.id >= 0.25th quantile
IN2
...
IN3
...
IN4


And quantiles were computed using the function:

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY id) FROM a, and then
once more for lower and upper halves. But unfortunately, I got only about
16% improvement from non-parallelized version of INSERT INTO .. SELECT ..

Marc, I am using postgres 9.4. I didn't benchmark, but intuitively the
modulo operator will force traversing every record in table "a" 4 times, as
it can't use an index.

Julien, my concern was why the option 3 (with parallel) is not the fastest.
And now, even with parallel INSERT INTO .. SELECT its not the fastest. I
can't really use the UNLOGGED table in this case.

The following document summarises why is CREATE TABLE AS .. the fastest:
14.4.7
http://www.postgresql.org/docs/current/static/populate.html#POPULATE-PITR

Basically CREATE TABLE AS .. just doesn't write to wal if the wal_level is
minimal and hence cuts IO about in half.

On Tue, Jul 14, 2015 at 1:42 PM David Rowley <[email protected]>
wrote:

> On 14 July 2015 at 21:12, Igor Stassiy <[email protected]> wrote:
>
>> Hello,
>>
>> I am benchmarking different ways of putting data into table on table
>> creation:
>>
>> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
>> 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
>> 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
>> parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
>>
>> (the parallel command is available as part of parallel deb package in
>> Ubuntu for example, it splits the stdin by newline character and feeds it
>> to the corresponding command)
>>
>> Both tables a and b have ~16M records and one of the columns in a is
>> geometry (ranging from several KB in size to several MB). Columns in b are
>> mostly integers.
>>
>> The machine that I am running these commands on has the following
>> parameters:
>>
>> default_statistics_target = 50 # pgtune wizard 2012-06-06
>> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 constraint_exclusion
>> = on # pgtune wizard 2012-06-06 checkpoint_completion_target = 0.9 # pgtune
>> wizard 2012-06-06 effective_cache_size = 48GB # pgtune wizard 2012-06-06
>> work_mem = 80MB # pgtune wizard 2012-06-06
>> wal_buffers = 8MB # pgtune wizard 2012-06-06
>> checkpoint_segments = 16 # pgtune wizard 2012-06-06
>> shared_buffers = 16GB # pgtune wizard 2012-06-06
>> max_connections = 400 # pgtune wizard 2012-06-06
>>
>> One would expect the 3rd option to be faster than 1 and 2, however 2
>> outperforms both by a large margin (sometimes x2). This is especially
>> surprising taking into account that COPY doesn't acquire a global lock on
>> the table, only a RowExclusiveLock
>> (according to
>> http://www.postgresql.org/message-id/[email protected])
>>
>> So is option 2 a winner by design? Could you please suggest other
>> alternatives to try (if there are any)? And what might be the reason that 3
>> is not outperforming the other 2?
>>
>
> I would imagine that the calling of the output function to translate the
> each value's internal representation to it's user visible/external
> representation plus all the overhead of sending results to the client would
> be a likely candidate of the slow down. In either case 3 would only be as
> fast as the query generating the output. With 1 and 2 all the tuple
> representations of each record stays in the internal format.
>
> If you have some logical way to break the query down into parts, then
> maybe that would be a place to look.
> For example:
>
> INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id < 8000000;
> INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id >= 8000000;
>
> Of course, you'd need to be very careful to ensure that the results of
> each SELECT never overlap. It would be nice to invent some better way than
> this that divided the workload evenly even when the tables grow.
>
> Then you could run these concurrently.
>
> Regards
>
> David Rowley
>
> --
>  David Rowley                   http://www.2ndQuadrant.com/
> <http://www.2ndquadrant.com/>
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>

Re: [GENERAL] Creating table with data from a join

Reply via email to