Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-08 Thread Francisco Olarte
Hi Victor:

> You're right, our internet connection is going to be the limiting factor.
...

Good to know this. Then you have 1 restriction, your bandwidth...

> Essentially, the PostgreSQL server is in a datacentre, the server we're
> dumping to is in the office.
> Running a script on the PostgreSQL server in the datacentre is going to be
> tricky (not so much technically, just from a procedures/security point of
> view).

... another restriction, you cannot run code on the server.

An aside here, you know postgres can do server side copy, and even
pipe the results to a program, so if you have 600Mb spare disk, or
about 150 and access to gzip, or access to ssh CLIENT in the server
and ssh SERVER in the office you could transfer the file easily. Ie,
if you have a directory which you can access from the server account,
you could do

COPY the_table_or_query TO '/the/dir/the/filename' FORMAT .

or

COPY the_table_or_query TO PROGRAM 'gzip -c >
/the/dir/the/filename.czip' FORMAT .

and then transfer the appropiate file to the office at your leisure
using whichever method you are used to.

In fact, if you have an ssh server in your office accesible from the
server you could transfer it there directly, or use a listening socket
in your program and netcat, but in this case you'll run in the same
timeout problems ( except you can compress and reduce the compression
time ).

> Dumping to a spare table seems like an interesting point - so we'd just
> create the table, COPY the results to that table, then use LIMIT/OFFSET to
> paginate through that, then drop the table afterwards?

That is more or less it. You do not copy, just create the table with
the proper data, i.e., instead of "COPY (my query) ..." you do a
'CREATE TABLE tmpxxx as SELECT...'

> Currently, I'm doing a quick hack where we download an ordered list of the
> ids (auto-incrementing integer) into Python, chunk it up into groups of ids,
> then use a WHERE IN clause to download each chunk via COPY.

This has a problem, you need to wrap everything in a single
transaction to avoid the update / delete / insert in the middle
problem you commented previously, so your transaction time is going to
be even bigger. Also,if the integers are autoincrementing, you can do
better. First get min and max  ( I do not remember if postgres already
optimized them or you'll need the 'order by limit 1' trick ). Then
just loop in appropiately sized steps ( I suppose table is dense, but
you can go to the stats, and you should know your data patterns ).

> Would dumping to a spare table and paginating a better approach? Reasons?
> (Not challenging it, I just want to understand everything).

The table is created as the single operation on a single transaction,
which has no delay problems as it does not generate output.

>From there on the table is read-only, so you can use a transaction for
each chunk, and you know the ids do not change. Even if a chunk copy
fails due to the internet connection farting a bit, you just retry it.

If you are going to use limit/offset you'll need an index on the field
anyway, but you can do the index on a separate transaction after
copying ( so you do not interfere with anyone ).

You can even do everything in autocommit mode, saving server roundtrips.

Anyway, this can be done, but if you can do the server side copy and
ssh thing, that is much better. I do not know how to do it in another
OSs, but in unix, if you have a ssh-accessible account with enough
quota on the server you could do something like create a directory in
the user home, give adequate access to the server user to it, let the
server dump the data there.

Francisco Olarte.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-08 Thread Victor Hooi
Hi,

@Francisco - Yeah, the file is around 600 Mb currently, uncompressed.

You're right, our internet connection is going to be the limiting factor.

Essentially, the PostgreSQL server is in a datacentre, the server we're
dumping to is in the office.

Running a script on the PostgreSQL server in the datacentre is going to be
tricky (not so much technically, just from a procedures/security point of
view).

Dumping to a spare table seems like an interesting point - so we'd just
create the table, COPY the results to that table, then use LIMIT/OFFSET to
paginate through that, then drop the table afterwards?

Currently, I'm doing a quick hack where we download an ordered list of the
ids (auto-incrementing integer) into Python, chunk it up into groups of
ids, then use a WHERE IN clause to download each chunk via COPY.

Would dumping to a spare table and paginating a better approach? Reasons?
(Not challenging it, I just want to understand everything).

Cheers,
Victor


On Fri, Nov 8, 2013 at 6:36 PM, Francisco Olarte wrote:

> On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi  wrote:
> > They think that it might be limited by the network, and how fast the
> > PostgreSQL server can push the data across the internet. (The Postgres
> > server and the box running the query are connected over the internet).
>
> You previously said you had 600Mb. Over the internet. ¿ Is it a very
> fat pipe ? Because otherwise the limitng factor is probably not the
> speed at which postgres can push the resuts, but he throughput of your
> link.
>
> If, as you stated, you need a single transaction to get a 600Mb
> snapshot I would recommend to dump it to disk, compressing on the fly
> ( you should get easily four o five fold reduction on a CSV file using
> any decent compressor ), and then send the file. If you do not have
> disk for the dump but can run programs near the server, you can try
> compressing on the fly. If you have got none of this but have got
> space for a spare table, use a select into, paginate this output and
> drop it after. Or just look at the configs and set longer query times,
> if your app NEEDS two hour queries, they can be enabled. But anyway,
> doing a long transaction over the internet does not seem like a good
> idea to me.
>
> Francisco Olarte
>


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread Francisco Olarte
On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi  wrote:
> They think that it might be limited by the network, and how fast the
> PostgreSQL server can push the data across the internet. (The Postgres
> server and the box running the query are connected over the internet).

You previously said you had 600Mb. Over the internet. ¿ Is it a very
fat pipe ? Because otherwise the limitng factor is probably not the
speed at which postgres can push the resuts, but he throughput of your
link.

If, as you stated, you need a single transaction to get a 600Mb
snapshot I would recommend to dump it to disk, compressing on the fly
( you should get easily four o five fold reduction on a CSV file using
any decent compressor ), and then send the file. If you do not have
disk for the dump but can run programs near the server, you can try
compressing on the fly. If you have got none of this but have got
space for a spare table, use a select into, paginate this output and
drop it after. Or just look at the configs and set longer query times,
if your app NEEDS two hour queries, they can be enabled. But anyway,
doing a long transaction over the internet does not seem like a good
idea to me.

Francisco Olarte


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread Victor Hooi
Hi,

Aha, I spoke to the somebody, apparently we've actually got those values
set to 15 minutes currently...

They think that it might be limited by the network, and how fast the
PostgreSQL server can push the data across the internet. (The Postgres
server and the box running the query are connected over the internet).

Cheers,
Victor


On Fri, Nov 8, 2013 at 1:44 PM, Victor Hooi  wrote:

> Hi,
>
> Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that
> helps.
>
> Do these settings still work if you only have a single Postgres instance?
> (I'll need to check out setup).
>
> So my understanding is that the default is 30 seconds (
> http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html)
> - but we're increasing it to 600 seconds, and that should give the COPY
> command enough time to pull down the data?
>
> As a rough guide, the dumped CSV file is around 600 Mb.
>
> Is there any other background you might be able to give on what you think
> might be happening, or how this might fix it?
>
> And you'd recommend tweaking these values over trying to chunk up the
> COPY/SELECT, is that right?
>
> I've just realised the LIMIT/ORDER thing may not work well to paginate,
> since there may be new records, or deleted records between each time I call
> it?
>
> Cheers,
> Victor
>
>
> On Fri, Nov 8, 2013 at 1:15 PM, wd  wrote:
>
>> Try this,
>>
>> max_standby_archive_delay = 600s# max delay before canceling queries
>> # when reading WAL from archive;
>> # -1 allows indefinite delay
>> max_standby_streaming_delay = 600s  # max delay before canceling queries
>> # when reading streaming WAL;
>> # -1 allows indefinite delay
>>
>> or try
>> pg_xlog_replay_pause()
>> pg_xlog_replay_resume()
>>
>>
>>
>>
>> On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi wrote:
>>
>>> Hi,
>>>
>>> We're using psycopg2 with COPY to dump CSV output from a large query.
>>>
>>> The actual SELECT query itself is large (both in number of
>>> records/columns, and also in width of values in columns), but still
>>> completes in around under a minute on the server.
>>>
>>> However, if you then use a COPY with it, it will often time out.
>>>
>>> We're using psycopg2 to run the command, the trace we get is something
>>> like:
>>>
>>> Traceback (most recent call last):
>>>   File "foo.py", line 259, in 
>>> jobs[job].run_all()
>>>   File "foo.py", line 127, in run_all
>>> self.export_to_csv()
>>>   File "foo.py", line 168, in export_to_csv
>>> cur.copy_expert(self.export_sql_statement, f)
>>> psycopg2.extensions.TransactionRollbackError: canceling statement due to
>>> conflict with recovery
>>>  DETAIL:  User was holding shared buffer pin for too long.
>>>
>>> My question is, what are some simple ways we can use to chunk up the
>>> query?
>>>
>>> Could we pull down a list of all the ids (auto-incrementing int), break
>>> this list up, then use a WHERE clause to break it up, running multiple COPY
>>> commands?
>>>
>>> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure
>>> how we'd figure out when we reached the end of the results set though
>>> (apart from just counting the results?).
>>>
>>> Or are there other approaches you guys could recommend?
>>>
>>> Cheers,
>>> Victor
>>>
>>
>>
>


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread Victor Hooi
Hi,

Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that helps.

Do these settings still work if you only have a single Postgres instance?
(I'll need to check out setup).

So my understanding is that the default is 30 seconds (
http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html)
- but we're increasing it to 600 seconds, and that should give the COPY
command enough time to pull down the data?

As a rough guide, the dumped CSV file is around 600 Mb.

Is there any other background you might be able to give on what you think
might be happening, or how this might fix it?

And you'd recommend tweaking these values over trying to chunk up the
COPY/SELECT, is that right?

I've just realised the LIMIT/ORDER thing may not work well to paginate,
since there may be new records, or deleted records between each time I call
it?

Cheers,
Victor


On Fri, Nov 8, 2013 at 1:15 PM, wd  wrote:

> Try this,
>
> max_standby_archive_delay = 600s# max delay before canceling queries
> # when reading WAL from archive;
> # -1 allows indefinite delay
> max_standby_streaming_delay = 600s  # max delay before canceling queries
> # when reading streaming WAL;
> # -1 allows indefinite delay
>
> or try
> pg_xlog_replay_pause()
> pg_xlog_replay_resume()
>
>
>
>
> On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi  wrote:
>
>> Hi,
>>
>> We're using psycopg2 with COPY to dump CSV output from a large query.
>>
>> The actual SELECT query itself is large (both in number of
>> records/columns, and also in width of values in columns), but still
>> completes in around under a minute on the server.
>>
>> However, if you then use a COPY with it, it will often time out.
>>
>> We're using psycopg2 to run the command, the trace we get is something
>> like:
>>
>> Traceback (most recent call last):
>>   File "foo.py", line 259, in 
>> jobs[job].run_all()
>>   File "foo.py", line 127, in run_all
>> self.export_to_csv()
>>   File "foo.py", line 168, in export_to_csv
>> cur.copy_expert(self.export_sql_statement, f)
>> psycopg2.extensions.TransactionRollbackError: canceling statement due to
>> conflict with recovery
>>  DETAIL:  User was holding shared buffer pin for too long.
>>
>> My question is, what are some simple ways we can use to chunk up the
>> query?
>>
>> Could we pull down a list of all the ids (auto-incrementing int), break
>> this list up, then use a WHERE clause to break it up, running multiple COPY
>> commands?
>>
>> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure
>> how we'd figure out when we reached the end of the results set though
>> (apart from just counting the results?).
>>
>> Or are there other approaches you guys could recommend?
>>
>> Cheers,
>> Victor
>>
>
>


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread wd
Try this,

max_standby_archive_delay = 600s# max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
max_standby_streaming_delay = 600s  # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay

or try
pg_xlog_replay_pause()
pg_xlog_replay_resume()




On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi  wrote:

> Hi,
>
> We're using psycopg2 with COPY to dump CSV output from a large query.
>
> The actual SELECT query itself is large (both in number of
> records/columns, and also in width of values in columns), but still
> completes in around under a minute on the server.
>
> However, if you then use a COPY with it, it will often time out.
>
> We're using psycopg2 to run the command, the trace we get is something
> like:
>
> Traceback (most recent call last):
>   File "foo.py", line 259, in 
> jobs[job].run_all()
>   File "foo.py", line 127, in run_all
> self.export_to_csv()
>   File "foo.py", line 168, in export_to_csv
> cur.copy_expert(self.export_sql_statement, f)
> psycopg2.extensions.TransactionRollbackError: canceling statement due to
> conflict with recovery
>  DETAIL:  User was holding shared buffer pin for too long.
>
> My question is, what are some simple ways we can use to chunk up the query?
>
> Could we pull down a list of all the ids (auto-incrementing int), break
> this list up, then use a WHERE clause to break it up, running multiple COPY
> commands?
>
> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how
> we'd figure out when we reached the end of the results set though (apart
> from just counting the results?).
>
> Or are there other approaches you guys could recommend?
>
> Cheers,
> Victor
>