Re: [GENERAL] Creating table with data from a join

2015-07-15 Thread Marc Mamin
> Marc, I am using postgres 9.4. I didn't benchmark, but intuitively the modulo 
> operator will force traversing every record in table "a" 4 times, as it can't 
> use an index.

Not necessarily. seq scans can be synchronized:

"This allows sequential scans of large tables to synchronize with each other, 
so that concurrent scans read the same block at about the same time and hence 
share the I/O workload. When this is enabled, a scan might start in the middle 
of the table and then "wrap around" the end to cover all rows, so as to 
synchronize with the activity of scans already in progress. This can result in 
unpredictable changes in the row ordering returned by queries that have no 
ORDER BY clause."

Marc

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating table with data from a join

2015-07-15 Thread Igor Stassiy
David, I did something like this:

psql -f /dev/fd/3 3 << IN1 & psql -f /dev/fd/4 4 << IN2 ...
 INSERT INTO c SELECT * FRO a JOIN b ON a.ad=b.id WHERE a.id < 0.25th
quantile
IN1
INSERT INTO c SELECT * FRO a JOIN b ON a.ad=b.id WHERE a.id < 0.5th
quantile AND a.id >= 0.25th quantile
IN2
...
IN3
...
IN4

And quantiles were computed using the function:

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY id) FROM a, and then
once more for lower and upper halves. But unfortunately, I got only about
16% improvement from non-parallelized version of INSERT INTO .. SELECT ..

Marc, I am using postgres 9.4. I didn't benchmark, but intuitively the
modulo operator will force traversing every record in table "a" 4 times, as
it can't use an index.

Julien, my concern was why the option 3 (with parallel) is not the fastest.
And now, even with parallel INSERT INTO .. SELECT its not the fastest. I
can't really use the UNLOGGED table in this case.

The following document summarises why is CREATE TABLE AS .. the fastest:
14.4.7
http://www.postgresql.org/docs/current/static/populate.html#POPULATE-PITR

Basically CREATE TABLE AS .. just doesn't write to wal if the wal_level is
minimal and hence cuts IO about in half.

On Tue, Jul 14, 2015 at 1:42 PM David Rowley 
wrote:

> On 14 July 2015 at 21:12, Igor Stassiy  wrote:
>
>> Hello,
>>
>> I am benchmarking different ways of putting data into table on table
>> creation:
>>
>> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
>> 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
>> 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
>> parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
>>
>> (the parallel command is available as part of parallel deb package in
>> Ubuntu for example, it splits the stdin by newline character and feeds it
>> to the corresponding command)
>>
>> Both tables a and b have ~16M records and one of the columns in a is
>> geometry (ranging from several KB in size to several MB). Columns in b are
>> mostly integers.
>>
>> The machine that I am running these commands on has the following
>> parameters:
>>
>> default_statistics_target = 50 # pgtune wizard 2012-06-06
>> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 constraint_exclusion
>> = on # pgtune wizard 2012-06-06 checkpoint_completion_target = 0.9 # pgtune
>> wizard 2012-06-06 effective_cache_size = 48GB # pgtune wizard 2012-06-06
>> work_mem = 80MB # pgtune wizard 2012-06-06
>> wal_buffers = 8MB # pgtune wizard 2012-06-06
>> checkpoint_segments = 16 # pgtune wizard 2012-06-06
>> shared_buffers = 16GB # pgtune wizard 2012-06-06
>> max_connections = 400 # pgtune wizard 2012-06-06
>>
>> One would expect the 3rd option to be faster than 1 and 2, however 2
>> outperforms both by a large margin (sometimes x2). This is especially
>> surprising taking into account that COPY doesn't acquire a global lock on
>> the table, only a RowExclusiveLock
>> (according to
>> http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
>>
>> So is option 2 a winner by design? Could you please suggest other
>> alternatives to try (if there are any)? And what might be the reason that 3
>> is not outperforming the other 2?
>>
>
> I would imagine that the calling of the output function to translate the
> each value's internal representation to it's user visible/external
> representation plus all the overhead of sending results to the client would
> be a likely candidate of the slow down. In either case 3 would only be as
> fast as the query generating the output. With 1 and 2 all the tuple
> representations of each record stays in the internal format.
>
> If you have some logical way to break the query down into parts, then
> maybe that would be a place to look.
> For example:
>
> INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id < 800;
> INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id >= 800;
>
> Of course, you'd need to be very careful to ensure that the results of
> each SELECT never overlap. It would be nice to invent some better way than
> this that divided the workload evenly even when the tables grow.
>
> Then you could run these concurrently.
>
> Regards
>
> David Rowley
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
> 
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Julien Rouhaud
On 14/07/2015 18:50, Igor Stassiy wrote:
> Julien, I would gladly provide more information, I am just not sure what
> to add.
> 

Well, was your concern about why option #2 is the quickest, or is this
runtime with option #2 still too slow for you ?

> I would be willing to leave the server compromised for things like
> corrupts or data losses during the time of this import, but the server
> has to be up and running before and after the import, if it is
> successful (so I can't take it down then change some parameters and
> start it up with again).

Check http://www.postgresql.org/docs/current/static/sql-createtable.html
and the "UNLOGGED" part to check if an unlogged table is suitable for you.

> 
> On Tue, Jul 14, 2015 at 6:37 PM Julien Rouhaud
> mailto:julien.rouh...@dalibo.com>> wrote:
> 
> On 14/07/2015 18:21, Igor Stassiy wrote:
> > Julien, I have the following setting for WAL level: #wal_level =
> minimal
> > (which defaults to minimal anyway)
> >
> 
> Sorry, I sent my mail too early :/
> 
> So, option #2 is winner by design. You didn't say anything about your
> needs, so it's hard to help you much more.
> 
> If you don't care about losing data on this table if your server
> crashes, you can try option #3 with an unlogged table.
> 
> 
> > On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud
> > mailto:julien.rouh...@dalibo.com>
>  >> wrote:
> >
> > On 14/07/2015 11:12, Igor Stassiy wrote:
> > > Hello,
> > >
> > > I am benchmarking different ways of putting data into table
> on table
> > > creation:
> > >
> > > 1. INSERT INTO c SELECT * FROM a JOIN b on a.id
>  
> >  = b.id  
> > > ;
> > > 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id
>  
> >  = b.id  
> > > ;
> > > 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id
>  
> >  = b.id  
> > > ) TO STDOUT" |
> > > parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM
> STDIN";
> > >
> > > (the parallel command is available as part of parallel deb
> package in
> > > Ubuntu for example, it splits the stdin by newline character
> and feeds
> > > it to the corresponding command)
> > >
> > > Both tables a and b have ~16M records and one of the columns
> in a is
> > > geometry (ranging from several KB in size to several MB).
> Columns in b
> > > are mostly integers.
> > >
> > > The machine that I am running these commands on has the
> following
> > > parameters:
> > >
> > > default_statistics_target = 50 # pgtune wizard 2012-06-06
> > > maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
> > > constraint_exclusion = on # pgtune wizard 2012-06-06
> > > checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
> > > effective_cache_size = 48GB # pgtune wizard 2012-06-06
> > > work_mem = 80MB # pgtune wizard 2012-06-06
> > > wal_buffers = 8MB # pgtune wizard 2012-06-06
> > > checkpoint_segments = 16 # pgtune wizard 2012-06-06
> > > shared_buffers = 16GB # pgtune wizard 2012-06-06
> > > max_connections = 400 # pgtune wizard 2012-06-06
> > >
> > > One would expect the 3rd option to be faster than 1 and 2,
> however 2
> > > outperforms both by a large margin (sometimes x2). This is
> especially
> > > surprising taking into account that COPY doesn't acquire a
> global lock
> > > on the table, only a RowExclusiveLock
> > > (according
> > > to
> >   
>  http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
> > >
> >
> > What is wal_level value? I think this is because of an
> optimisation
> > happening with wal_level = minimal:
> >
> > "In minimal level, WAL-logging of some bulk operations can be
> safely
> > skipped, which can make those operations much faster"
> >
> > see
> >   
>  http://www.postgresql.org/docs/current/static/runtime-config-wal.html
> >
> > > So is option 2 a winner by design? Could you please suggest
> other
> > > alternatives to try (if there are any)? And what might be
> the reason
> > > that 3 is not outperforming the other 2?
> > >
> > > Thank you,
> > > Igor
> > >
> > >
> >
> >
> > --
> > Julien Rouhaud
> > http://dalibo.com - http://dalibo.org
> >
> 
> 
> --
>

Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Marc Mamin

>Hello,
>
>I am benchmarking different ways of putting data into table on table creation:
>
>1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
>2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
>3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
>parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
>
>(the parallel command is available as part of parallel deb package in Ubuntu 
>for example, it splits the stdin by newline character and feeds it to the 
>corresponding command)
>
>Both tables a and b have ~16M records and one of the columns in a is geometry 
>(ranging from several KB in size to several MB). Columns in b are mostly 
>integers.
>
>The machine that I am running these commands on has the following parameters:
>
>default_statistics_target = 50 # pgtune wizard 2012-06-06 maintenance_work_mem 
>= 1GB # pgtune wizard 2012-06-06 constraint_exclusion = on # pgtune wizard 
>2012-06-06 checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06 
>effective_cache_size = 48GB # pgtune wizard 2012-06-06
>work_mem = 80MB # pgtune wizard 2012-06-06
>wal_buffers = 8MB # pgtune wizard 2012-06-06
>checkpoint_segments = 16 # pgtune wizard 2012-06-06
>shared_buffers = 16GB # pgtune wizard 2012-06-06
>max_connections = 400 # pgtune wizard 2012-06-06
>
>One would expect the 3rd option to be faster than 1 and 2, however 2 
>outperforms both by a large margin (sometimes x2). This is especially 
>surprising taking into account that COPY doesn't acquire a global lock on the 
>table, only a RowExclusiveLock

which PG Version ?

I find interesting, that 2 outperforms 1.
The only explanation I can imagine is that "CREATE TABLE AS" freezes the data 
on the fly, as possible with "COPY FROM"
(http://www.postgresql.org/docs/9.4/interactive/sql-copy.html)

You may try parallel insert without using STDIN using modulo. Just start these 
4 queries simultaneously:
INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id
WHERE a.id%4 = [0,1,2,3}

I usually avoid parallel INSERTS to avoid I/O contention and random 
distribution within the target tables.
Are you monitoring the I/O activity in your tests ?
Have you tried to use only 2 parallel processes?

regards,

Marc Mamin


>(according to 
>http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
>
>So is option 2 a winner by design? Could you please suggest other alternatives 
>to try (if there are any)? And what might be the reason that 3 is not 
>outperforming the other 2?
>
>Thank you,
>Igor
>
>
>


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Igor Stassiy
Julien, I would gladly provide more information, I am just not sure what to
add.

I would be willing to leave the server compromised for things like corrupts
or data losses during the time of this import, but the server has to be up
and running before and after the import, if it is successful (so I can't
take it down then change some parameters and start it up with again).

On Tue, Jul 14, 2015 at 6:37 PM Julien Rouhaud 
wrote:

> On 14/07/2015 18:21, Igor Stassiy wrote:
> > Julien, I have the following setting for WAL level: #wal_level = minimal
> > (which defaults to minimal anyway)
> >
>
> Sorry, I sent my mail too early :/
>
> So, option #2 is winner by design. You didn't say anything about your
> needs, so it's hard to help you much more.
>
> If you don't care about losing data on this table if your server
> crashes, you can try option #3 with an unlogged table.
>
>
> > On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud
> > mailto:julien.rouh...@dalibo.com>> wrote:
> >
> > On 14/07/2015 11:12, Igor Stassiy wrote:
> > > Hello,
> > >
> > > I am benchmarking different ways of putting data into table on
> table
> > > creation:
> > >
> > > 1. INSERT INTO c SELECT * FROM a JOIN b on a.id 
> >  = b.id 
> > > ;
> > > 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id 
> >  = b.id 
> > > ;
> > > 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id 
> >  = b.id 
> > > ) TO STDOUT" |
> > > parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
> > >
> > > (the parallel command is available as part of parallel deb package
> in
> > > Ubuntu for example, it splits the stdin by newline character and
> feeds
> > > it to the corresponding command)
> > >
> > > Both tables a and b have ~16M records and one of the columns in a
> is
> > > geometry (ranging from several KB in size to several MB). Columns
> in b
> > > are mostly integers.
> > >
> > > The machine that I am running these commands on has the following
> > > parameters:
> > >
> > > default_statistics_target = 50 # pgtune wizard 2012-06-06
> > > maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
> > > constraint_exclusion = on # pgtune wizard 2012-06-06
> > > checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
> > > effective_cache_size = 48GB # pgtune wizard 2012-06-06
> > > work_mem = 80MB # pgtune wizard 2012-06-06
> > > wal_buffers = 8MB # pgtune wizard 2012-06-06
> > > checkpoint_segments = 16 # pgtune wizard 2012-06-06
> > > shared_buffers = 16GB # pgtune wizard 2012-06-06
> > > max_connections = 400 # pgtune wizard 2012-06-06
> > >
> > > One would expect the 3rd option to be faster than 1 and 2, however
> 2
> > > outperforms both by a large margin (sometimes x2). This is
> especially
> > > surprising taking into account that COPY doesn't acquire a global
> lock
> > > on the table, only a RowExclusiveLock
> > > (according
> > > to
> > http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
> > >
> >
> > What is wal_level value? I think this is because of an optimisation
> > happening with wal_level = minimal:
> >
> > "In minimal level, WAL-logging of some bulk operations can be safely
> > skipped, which can make those operations much faster"
> >
> > see
> >
> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
> >
> > > So is option 2 a winner by design? Could you please suggest other
> > > alternatives to try (if there are any)? And what might be the
> reason
> > > that 3 is not outperforming the other 2?
> > >
> > > Thank you,
> > > Igor
> > >
> > >
> >
> >
> > --
> > Julien Rouhaud
> > http://dalibo.com - http://dalibo.org
> >
>
>
> --
> Julien Rouhaud
> http://dalibo.com - http://dalibo.org
>


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Julien Rouhaud
On 14/07/2015 18:21, Igor Stassiy wrote:
> Julien, I have the following setting for WAL level: #wal_level = minimal
> (which defaults to minimal anyway)
> 

Sorry, I sent my mail too early :/

So, option #2 is winner by design. You didn't say anything about your
needs, so it's hard to help you much more.

If you don't care about losing data on this table if your server
crashes, you can try option #3 with an unlogged table.


> On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud
> mailto:julien.rouh...@dalibo.com>> wrote:
> 
> On 14/07/2015 11:12, Igor Stassiy wrote:
> > Hello,
> >
> > I am benchmarking different ways of putting data into table on table
> > creation:
> >
> > 1. INSERT INTO c SELECT * FROM a JOIN b on a.id 
>  = b.id 
> > ;
> > 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id 
>  = b.id 
> > ;
> > 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id 
>  = b.id 
> > ) TO STDOUT" |
> > parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
> >
> > (the parallel command is available as part of parallel deb package in
> > Ubuntu for example, it splits the stdin by newline character and feeds
> > it to the corresponding command)
> >
> > Both tables a and b have ~16M records and one of the columns in a is
> > geometry (ranging from several KB in size to several MB). Columns in b
> > are mostly integers.
> >
> > The machine that I am running these commands on has the following
> > parameters:
> >
> > default_statistics_target = 50 # pgtune wizard 2012-06-06
> > maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
> > constraint_exclusion = on # pgtune wizard 2012-06-06
> > checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
> > effective_cache_size = 48GB # pgtune wizard 2012-06-06
> > work_mem = 80MB # pgtune wizard 2012-06-06
> > wal_buffers = 8MB # pgtune wizard 2012-06-06
> > checkpoint_segments = 16 # pgtune wizard 2012-06-06
> > shared_buffers = 16GB # pgtune wizard 2012-06-06
> > max_connections = 400 # pgtune wizard 2012-06-06
> >
> > One would expect the 3rd option to be faster than 1 and 2, however 2
> > outperforms both by a large margin (sometimes x2). This is especially
> > surprising taking into account that COPY doesn't acquire a global lock
> > on the table, only a RowExclusiveLock
> > (according
> > to
> http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
> >
> 
> What is wal_level value? I think this is because of an optimisation
> happening with wal_level = minimal:
> 
> "In minimal level, WAL-logging of some bulk operations can be safely
> skipped, which can make those operations much faster"
> 
> see
> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
> 
> > So is option 2 a winner by design? Could you please suggest other
> > alternatives to try (if there are any)? And what might be the reason
> > that 3 is not outperforming the other 2?
> >
> > Thank you,
> > Igor
> >
> >
> 
> 
> --
> Julien Rouhaud
> http://dalibo.com - http://dalibo.org
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 14/07/2015 18:21, Igor Stassiy wrote:
> Julien, I have the following setting for WAL level: #wal_level =
> minimal (which defaults to minimal anyway)
> 
> On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud 
> mailto:julien.rouh...@dalibo.com>>
> wrote:
> 
> On 14/07/2015 11:12, Igor Stassiy wrote:
>> Hello,
>> 
>> I am benchmarking different ways of putting data into table on
>> table creation:
>> 
>> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id 
>  = b.id 
>> ; 2. CREATE TABLE c AS SELECT * FROM a JOIN b on
>> a.id 
>  = b.id 
>> ; 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id
>> 
>  = b.id 
>> ) TO STDOUT" | parallel --block 128M --jobs 4 --pipe
>> psql -c "COPY c FROM STDIN";
>> 
>> (the parallel command is available as part of parallel deb
>> package in Ubuntu for example, it splits the stdin by newline
>> character and feeds it to the corresponding command)
>> 
>> Both tables a and b have ~16M records and one of the columns in a
>> is geometry (ranging from several KB in size to several MB).
>> Columns in b are mostly integers.
>> 
>> The machine that I am running these commands on has the
>> following parameters:
>> 
>> default_statistics_target = 50 # pgtune wizard 2012-06-06 
>> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 
>> constraint_exclusion = on # pgtune wizard 2012-06-06 
>> checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06 
>> effective_cache_size = 48GB # pgtune wizard 2012-06-06 work_mem =
>> 80MB # pgtune wizard 2012-06-06 wal_buffers = 8MB # pgtune wizard
>> 2012-06-06 checkpoint_segments = 16 # pgtune wizard 2012-06-06 
>> shared_buffers = 16GB # pgtune wizard 2012-06-06 max_connections
>> = 400 # pgtune wizard 2012-06-06
>> 
>> One would expect the 3rd option to be faster than 1 and 2,
>> however 2 outperforms both by a large margin (sometimes x2). This
>> is especially surprising taking into account that COPY doesn't
>> acquire a global lock on the table, only a RowExclusiveLock 
>> (according to
> http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
>
> 
> 
> 
> What is wal_level value? I think this is because of an
> optimisation happening with wal_level = minimal:
> 
> "In minimal level, WAL-logging of some bulk operations can be
> safely skipped, which can make those operations much faster"
> 
> see 
> http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>
> 
>> So is option 2 a winner by design? Could you please suggest
>> other alternatives to try (if there are any)? And what might be
>> the reason that 3 is not outperforming the other 2?
>> 
>> Thank you, Igor
>> 
>> 
> 
> 
> -- Julien Rouhaud http://dalibo.com - http://dalibo.org
> 


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJVpTlRAAoJELGaJ8vfEpOqvI4H/RZygc5QXOuEZDWqmWRoZZ5N
kNLWxPJbQ7cLpSNIUj3gJmq9bj0I3K071L09KbJWgxtwvQCzgiTsUIVURv7V83C6
nQ8CmrRr96+jKprx5Gw/uqSel8qnbi9LApl1IDqx9Hnd/HnyVOemND2gzHOQhsKN
tvGuo4ac5yR+rsFA8FHuwXgSgVH2NEDL2n4Zv6jI2uwh5NRBeeGEn8MFKDZCSWN6
HXG9wZaelSrYbcSfumRg07RLnAmP6E/xbY1eB8dA17XmnFxE9AMTFy0YqJb8Kl5Z
KvzQ6+VHnrW2zaoCUOGE56ra2La7TPeJxxeNA9U9Li+8GmvJIQHqIoQvLz7CzT8=
=Ztkl
-END PGP SIGNATURE-


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Igor Stassiy
Julien, I have the following setting for WAL level: #wal_level = minimal
(which defaults to minimal anyway)

On Tue, Jul 14, 2015 at 6:19 PM Julien Rouhaud 
wrote:

> On 14/07/2015 11:12, Igor Stassiy wrote:
> > Hello,
> >
> > I am benchmarking different ways of putting data into table on table
> > creation:
> >
> > 1. INSERT INTO c SELECT * FROM a JOIN b on a.id  = b.id
> > ;
> > 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id  = b.id
> > ;
> > 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id  = b.id
> > ) TO STDOUT" |
> > parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
> >
> > (the parallel command is available as part of parallel deb package in
> > Ubuntu for example, it splits the stdin by newline character and feeds
> > it to the corresponding command)
> >
> > Both tables a and b have ~16M records and one of the columns in a is
> > geometry (ranging from several KB in size to several MB). Columns in b
> > are mostly integers.
> >
> > The machine that I am running these commands on has the following
> > parameters:
> >
> > default_statistics_target = 50 # pgtune wizard 2012-06-06
> > maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
> > constraint_exclusion = on # pgtune wizard 2012-06-06
> > checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
> > effective_cache_size = 48GB # pgtune wizard 2012-06-06
> > work_mem = 80MB # pgtune wizard 2012-06-06
> > wal_buffers = 8MB # pgtune wizard 2012-06-06
> > checkpoint_segments = 16 # pgtune wizard 2012-06-06
> > shared_buffers = 16GB # pgtune wizard 2012-06-06
> > max_connections = 400 # pgtune wizard 2012-06-06
> >
> > One would expect the 3rd option to be faster than 1 and 2, however 2
> > outperforms both by a large margin (sometimes x2). This is especially
> > surprising taking into account that COPY doesn't acquire a global lock
> > on the table, only a RowExclusiveLock
> > (according
> > to http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
> >
>
> What is wal_level value? I think this is because of an optimisation
> happening with wal_level = minimal:
>
> "In minimal level, WAL-logging of some bulk operations can be safely
> skipped, which can make those operations much faster"
>
> see http://www.postgresql.org/docs/current/static/runtime-config-wal.html
>
> > So is option 2 a winner by design? Could you please suggest other
> > alternatives to try (if there are any)? And what might be the reason
> > that 3 is not outperforming the other 2?
> >
> > Thank you,
> > Igor
> >
> >
>
>
> --
> Julien Rouhaud
> http://dalibo.com - http://dalibo.org
>


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Julien Rouhaud
On 14/07/2015 11:12, Igor Stassiy wrote:
> Hello,
> 
> I am benchmarking different ways of putting data into table on table
> creation:
> 
> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id  = b.id
> ;
> 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id  = b.id
> ;
> 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id  = b.id
> ) TO STDOUT" | 
> parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
> 
> (the parallel command is available as part of parallel deb package in
> Ubuntu for example, it splits the stdin by newline character and feeds
> it to the corresponding command)
> 
> Both tables a and b have ~16M records and one of the columns in a is
> geometry (ranging from several KB in size to several MB). Columns in b
> are mostly integers.
> 
> The machine that I am running these commands on has the following
> parameters:
> 
> default_statistics_target = 50 # pgtune wizard 2012-06-06
> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06
> constraint_exclusion = on # pgtune wizard 2012-06-06
> checkpoint_completion_target = 0.9 # pgtune wizard 2012-06-06
> effective_cache_size = 48GB # pgtune wizard 2012-06-06
> work_mem = 80MB # pgtune wizard 2012-06-06 
> wal_buffers = 8MB # pgtune wizard 2012-06-06 
> checkpoint_segments = 16 # pgtune wizard 2012-06-06 
> shared_buffers = 16GB # pgtune wizard 2012-06-06 
> max_connections = 400 # pgtune wizard 2012-06-06
> 
> One would expect the 3rd option to be faster than 1 and 2, however 2
> outperforms both by a large margin (sometimes x2). This is especially
> surprising taking into account that COPY doesn't acquire a global lock
> on the table, only a RowExclusiveLock 
> (according
> to http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
> 

What is wal_level value? I think this is because of an optimisation
happening with wal_level = minimal:

"In minimal level, WAL-logging of some bulk operations can be safely
skipped, which can make those operations much faster"

see http://www.postgresql.org/docs/current/static/runtime-config-wal.html

> So is option 2 a winner by design? Could you please suggest other
> alternatives to try (if there are any)? And what might be the reason
> that 3 is not outperforming the other 2?
> 
> Thank you,
> Igor
> 
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread Igor Stassiy
Thank you David. I tried to measure the time of COPY (SELECT * FROM a JOIN
b ON a.id = b.id) TO '/tmp/dump.sql' and it took an order of magnitude time
less (~10x) than the complete command (together with INSERT), so conversion
is probably not the main factor of slowdown (unless conversion from text
->internal is significantly slower than that of from internal -> text).

I will also try your suggestion with limiting the ids range.

On Tue, Jul 14, 2015 at 1:42 PM David Rowley 
wrote:

> On 14 July 2015 at 21:12, Igor Stassiy  wrote:
>
>> Hello,
>>
>> I am benchmarking different ways of putting data into table on table
>> creation:
>>
>> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
>> 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
>> 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
>> parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
>>
>> (the parallel command is available as part of parallel deb package in
>> Ubuntu for example, it splits the stdin by newline character and feeds it
>> to the corresponding command)
>>
>> Both tables a and b have ~16M records and one of the columns in a is
>> geometry (ranging from several KB in size to several MB). Columns in b are
>> mostly integers.
>>
>> The machine that I am running these commands on has the following
>> parameters:
>>
>> default_statistics_target = 50 # pgtune wizard 2012-06-06
>> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 constraint_exclusion
>> = on # pgtune wizard 2012-06-06 checkpoint_completion_target = 0.9 # pgtune
>> wizard 2012-06-06 effective_cache_size = 48GB # pgtune wizard 2012-06-06
>> work_mem = 80MB # pgtune wizard 2012-06-06
>> wal_buffers = 8MB # pgtune wizard 2012-06-06
>> checkpoint_segments = 16 # pgtune wizard 2012-06-06
>> shared_buffers = 16GB # pgtune wizard 2012-06-06
>> max_connections = 400 # pgtune wizard 2012-06-06
>>
>> One would expect the 3rd option to be faster than 1 and 2, however 2
>> outperforms both by a large margin (sometimes x2). This is especially
>> surprising taking into account that COPY doesn't acquire a global lock on
>> the table, only a RowExclusiveLock
>> (according to
>> http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
>>
>> So is option 2 a winner by design? Could you please suggest other
>> alternatives to try (if there are any)? And what might be the reason that 3
>> is not outperforming the other 2?
>>
>
> I would imagine that the calling of the output function to translate the
> each value's internal representation to it's user visible/external
> representation plus all the overhead of sending results to the client would
> be a likely candidate of the slow down. In either case 3 would only be as
> fast as the query generating the output. With 1 and 2 all the tuple
> representations of each record stays in the internal format.
>
> If you have some logical way to break the query down into parts, then
> maybe that would be a place to look.
> For example:
>
> INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id < 800;
> INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id >= 800;
>
> Of course, you'd need to be very careful to ensure that the results of
> each SELECT never overlap. It would be nice to invent some better way than
> this that divided the workload evenly even when the tables grow.
>
> Then you could run these concurrently.
>
> Regards
>
> David Rowley
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
> 
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>


Re: [GENERAL] Creating table with data from a join

2015-07-14 Thread David Rowley
On 14 July 2015 at 21:12, Igor Stassiy  wrote:

> Hello,
>
> I am benchmarking different ways of putting data into table on table
> creation:
>
> 1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
> 2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
> 3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
> parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";
>
> (the parallel command is available as part of parallel deb package in
> Ubuntu for example, it splits the stdin by newline character and feeds it
> to the corresponding command)
>
> Both tables a and b have ~16M records and one of the columns in a is
> geometry (ranging from several KB in size to several MB). Columns in b are
> mostly integers.
>
> The machine that I am running these commands on has the following
> parameters:
>
> default_statistics_target = 50 # pgtune wizard 2012-06-06
> maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 constraint_exclusion
> = on # pgtune wizard 2012-06-06 checkpoint_completion_target = 0.9 # pgtune
> wizard 2012-06-06 effective_cache_size = 48GB # pgtune wizard 2012-06-06
> work_mem = 80MB # pgtune wizard 2012-06-06
> wal_buffers = 8MB # pgtune wizard 2012-06-06
> checkpoint_segments = 16 # pgtune wizard 2012-06-06
> shared_buffers = 16GB # pgtune wizard 2012-06-06
> max_connections = 400 # pgtune wizard 2012-06-06
>
> One would expect the 3rd option to be faster than 1 and 2, however 2
> outperforms both by a large margin (sometimes x2). This is especially
> surprising taking into account that COPY doesn't acquire a global lock on
> the table, only a RowExclusiveLock
> (according to
> http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)
>
> So is option 2 a winner by design? Could you please suggest other
> alternatives to try (if there are any)? And what might be the reason that 3
> is not outperforming the other 2?
>

I would imagine that the calling of the output function to translate the
each value's internal representation to it's user visible/external
representation plus all the overhead of sending results to the client would
be a likely candidate of the slow down. In either case 3 would only be as
fast as the query generating the output. With 1 and 2 all the tuple
representations of each record stays in the internal format.

If you have some logical way to break the query down into parts, then maybe
that would be a place to look.
For example:

INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id < 800;
INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id WHERE a.id >= 800;

Of course, you'd need to be very careful to ensure that the results of each
SELECT never overlap. It would be nice to invent some better way than this
that divided the workload evenly even when the tables grow.

Then you could run these concurrently.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


[GENERAL] Creating table with data from a join

2015-07-14 Thread Igor Stassiy
Hello,

I am benchmarking different ways of putting data into table on table
creation:

1. INSERT INTO c SELECT * FROM a JOIN b on a.id = b.id;
2. CREATE TABLE c AS SELECT * FROM a JOIN b on a.id = b.id;
3. psql -c "COPY (SELECT * FROM a JOIN b on a.id = b.id) TO STDOUT" |
parallel --block 128M --jobs 4 --pipe psql -c "COPY c FROM STDIN";

(the parallel command is available as part of parallel deb package in
Ubuntu for example, it splits the stdin by newline character and feeds it
to the corresponding command)

Both tables a and b have ~16M records and one of the columns in a is
geometry (ranging from several KB in size to several MB). Columns in b are
mostly integers.

The machine that I am running these commands on has the following
parameters:

default_statistics_target = 50 # pgtune wizard 2012-06-06
maintenance_work_mem = 1GB # pgtune wizard 2012-06-06 constraint_exclusion
= on # pgtune wizard 2012-06-06 checkpoint_completion_target = 0.9 # pgtune
wizard 2012-06-06 effective_cache_size = 48GB # pgtune wizard 2012-06-06
work_mem = 80MB # pgtune wizard 2012-06-06
wal_buffers = 8MB # pgtune wizard 2012-06-06
checkpoint_segments = 16 # pgtune wizard 2012-06-06
shared_buffers = 16GB # pgtune wizard 2012-06-06
max_connections = 400 # pgtune wizard 2012-06-06

One would expect the 3rd option to be faster than 1 and 2, however 2
outperforms both by a large margin (sometimes x2). This is especially
surprising taking into account that COPY doesn't acquire a global lock on
the table, only a RowExclusiveLock
(according to
http://www.postgresql.org/message-id/10611.1014867...@sss.pgh.pa.us)

So is option 2 a winner by design? Could you please suggest other
alternatives to try (if there are any)? And what might be the reason that 3
is not outperforming the other 2?

Thank you,
Igor