Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?
Hi Victor: > You're right, our internet connection is going to be the limiting factor. ... Good to know this. Then you have 1 restriction, your bandwidth... > Essentially, the PostgreSQL server is in a datacentre, the server we're > dumping to is in the office. > Running a script on the PostgreSQL server in the datacentre is going to be > tricky (not so much technically, just from a procedures/security point of > view). ... another restriction, you cannot run code on the server. An aside here, you know postgres can do server side copy, and even pipe the results to a program, so if you have 600Mb spare disk, or about 150 and access to gzip, or access to ssh CLIENT in the server and ssh SERVER in the office you could transfer the file easily. Ie, if you have a directory which you can access from the server account, you could do COPY the_table_or_query TO '/the/dir/the/filename' FORMAT . or COPY the_table_or_query TO PROGRAM 'gzip -c > /the/dir/the/filename.czip' FORMAT . and then transfer the appropiate file to the office at your leisure using whichever method you are used to. In fact, if you have an ssh server in your office accesible from the server you could transfer it there directly, or use a listening socket in your program and netcat, but in this case you'll run in the same timeout problems ( except you can compress and reduce the compression time ). > Dumping to a spare table seems like an interesting point - so we'd just > create the table, COPY the results to that table, then use LIMIT/OFFSET to > paginate through that, then drop the table afterwards? That is more or less it. You do not copy, just create the table with the proper data, i.e., instead of "COPY (my query) ..." you do a 'CREATE TABLE tmpxxx as SELECT...' > Currently, I'm doing a quick hack where we download an ordered list of the > ids (auto-incrementing integer) into Python, chunk it up into groups of ids, > then use a WHERE IN clause to download each chunk via COPY. This has a problem, you need to wrap everything in a single transaction to avoid the update / delete / insert in the middle problem you commented previously, so your transaction time is going to be even bigger. Also,if the integers are autoincrementing, you can do better. First get min and max ( I do not remember if postgres already optimized them or you'll need the 'order by limit 1' trick ). Then just loop in appropiately sized steps ( I suppose table is dense, but you can go to the stats, and you should know your data patterns ). > Would dumping to a spare table and paginating a better approach? Reasons? > (Not challenging it, I just want to understand everything). The table is created as the single operation on a single transaction, which has no delay problems as it does not generate output. >From there on the table is read-only, so you can use a transaction for each chunk, and you know the ids do not change. Even if a chunk copy fails due to the internet connection farting a bit, you just retry it. If you are going to use limit/offset you'll need an index on the field anyway, but you can do the index on a separate transaction after copying ( so you do not interfere with anyone ). You can even do everything in autocommit mode, saving server roundtrips. Anyway, this can be done, but if you can do the server side copy and ssh thing, that is much better. I do not know how to do it in another OSs, but in unix, if you have a ssh-accessible account with enough quota on the server you could do something like create a directory in the user home, give adequate access to the server user to it, let the server dump the data there. Francisco Olarte. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?
Hi, @Francisco - Yeah, the file is around 600 Mb currently, uncompressed. You're right, our internet connection is going to be the limiting factor. Essentially, the PostgreSQL server is in a datacentre, the server we're dumping to is in the office. Running a script on the PostgreSQL server in the datacentre is going to be tricky (not so much technically, just from a procedures/security point of view). Dumping to a spare table seems like an interesting point - so we'd just create the table, COPY the results to that table, then use LIMIT/OFFSET to paginate through that, then drop the table afterwards? Currently, I'm doing a quick hack where we download an ordered list of the ids (auto-incrementing integer) into Python, chunk it up into groups of ids, then use a WHERE IN clause to download each chunk via COPY. Would dumping to a spare table and paginating a better approach? Reasons? (Not challenging it, I just want to understand everything). Cheers, Victor On Fri, Nov 8, 2013 at 6:36 PM, Francisco Olarte wrote: > On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi wrote: > > They think that it might be limited by the network, and how fast the > > PostgreSQL server can push the data across the internet. (The Postgres > > server and the box running the query are connected over the internet). > > You previously said you had 600Mb. Over the internet. ¿ Is it a very > fat pipe ? Because otherwise the limitng factor is probably not the > speed at which postgres can push the resuts, but he throughput of your > link. > > If, as you stated, you need a single transaction to get a 600Mb > snapshot I would recommend to dump it to disk, compressing on the fly > ( you should get easily four o five fold reduction on a CSV file using > any decent compressor ), and then send the file. If you do not have > disk for the dump but can run programs near the server, you can try > compressing on the fly. If you have got none of this but have got > space for a spare table, use a select into, paginate this output and > drop it after. Or just look at the configs and set longer query times, > if your app NEEDS two hour queries, they can be enabled. But anyway, > doing a long transaction over the internet does not seem like a good > idea to me. > > Francisco Olarte >
Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?
On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi wrote: > They think that it might be limited by the network, and how fast the > PostgreSQL server can push the data across the internet. (The Postgres > server and the box running the query are connected over the internet). You previously said you had 600Mb. Over the internet. ¿ Is it a very fat pipe ? Because otherwise the limitng factor is probably not the speed at which postgres can push the resuts, but he throughput of your link. If, as you stated, you need a single transaction to get a 600Mb snapshot I would recommend to dump it to disk, compressing on the fly ( you should get easily four o five fold reduction on a CSV file using any decent compressor ), and then send the file. If you do not have disk for the dump but can run programs near the server, you can try compressing on the fly. If you have got none of this but have got space for a spare table, use a select into, paginate this output and drop it after. Or just look at the configs and set longer query times, if your app NEEDS two hour queries, they can be enabled. But anyway, doing a long transaction over the internet does not seem like a good idea to me. Francisco Olarte -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?
Hi, Aha, I spoke to the somebody, apparently we've actually got those values set to 15 minutes currently... They think that it might be limited by the network, and how fast the PostgreSQL server can push the data across the internet. (The Postgres server and the box running the query are connected over the internet). Cheers, Victor On Fri, Nov 8, 2013 at 1:44 PM, Victor Hooi wrote: > Hi, > > Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that > helps. > > Do these settings still work if you only have a single Postgres instance? > (I'll need to check out setup). > > So my understanding is that the default is 30 seconds ( > http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html) > - but we're increasing it to 600 seconds, and that should give the COPY > command enough time to pull down the data? > > As a rough guide, the dumped CSV file is around 600 Mb. > > Is there any other background you might be able to give on what you think > might be happening, or how this might fix it? > > And you'd recommend tweaking these values over trying to chunk up the > COPY/SELECT, is that right? > > I've just realised the LIMIT/ORDER thing may not work well to paginate, > since there may be new records, or deleted records between each time I call > it? > > Cheers, > Victor > > > On Fri, Nov 8, 2013 at 1:15 PM, wd wrote: > >> Try this, >> >> max_standby_archive_delay = 600s# max delay before canceling queries >> # when reading WAL from archive; >> # -1 allows indefinite delay >> max_standby_streaming_delay = 600s # max delay before canceling queries >> # when reading streaming WAL; >> # -1 allows indefinite delay >> >> or try >> pg_xlog_replay_pause() >> pg_xlog_replay_resume() >> >> >> >> >> On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi wrote: >> >>> Hi, >>> >>> We're using psycopg2 with COPY to dump CSV output from a large query. >>> >>> The actual SELECT query itself is large (both in number of >>> records/columns, and also in width of values in columns), but still >>> completes in around under a minute on the server. >>> >>> However, if you then use a COPY with it, it will often time out. >>> >>> We're using psycopg2 to run the command, the trace we get is something >>> like: >>> >>> Traceback (most recent call last): >>> File "foo.py", line 259, in >>> jobs[job].run_all() >>> File "foo.py", line 127, in run_all >>> self.export_to_csv() >>> File "foo.py", line 168, in export_to_csv >>> cur.copy_expert(self.export_sql_statement, f) >>> psycopg2.extensions.TransactionRollbackError: canceling statement due to >>> conflict with recovery >>> DETAIL: User was holding shared buffer pin for too long. >>> >>> My question is, what are some simple ways we can use to chunk up the >>> query? >>> >>> Could we pull down a list of all the ids (auto-incrementing int), break >>> this list up, then use a WHERE clause to break it up, running multiple COPY >>> commands? >>> >>> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure >>> how we'd figure out when we reached the end of the results set though >>> (apart from just counting the results?). >>> >>> Or are there other approaches you guys could recommend? >>> >>> Cheers, >>> Victor >>> >> >> >
Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?
Hi, Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that helps. Do these settings still work if you only have a single Postgres instance? (I'll need to check out setup). So my understanding is that the default is 30 seconds ( http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html) - but we're increasing it to 600 seconds, and that should give the COPY command enough time to pull down the data? As a rough guide, the dumped CSV file is around 600 Mb. Is there any other background you might be able to give on what you think might be happening, or how this might fix it? And you'd recommend tweaking these values over trying to chunk up the COPY/SELECT, is that right? I've just realised the LIMIT/ORDER thing may not work well to paginate, since there may be new records, or deleted records between each time I call it? Cheers, Victor On Fri, Nov 8, 2013 at 1:15 PM, wd wrote: > Try this, > > max_standby_archive_delay = 600s# max delay before canceling queries > # when reading WAL from archive; > # -1 allows indefinite delay > max_standby_streaming_delay = 600s # max delay before canceling queries > # when reading streaming WAL; > # -1 allows indefinite delay > > or try > pg_xlog_replay_pause() > pg_xlog_replay_resume() > > > > > On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi wrote: > >> Hi, >> >> We're using psycopg2 with COPY to dump CSV output from a large query. >> >> The actual SELECT query itself is large (both in number of >> records/columns, and also in width of values in columns), but still >> completes in around under a minute on the server. >> >> However, if you then use a COPY with it, it will often time out. >> >> We're using psycopg2 to run the command, the trace we get is something >> like: >> >> Traceback (most recent call last): >> File "foo.py", line 259, in >> jobs[job].run_all() >> File "foo.py", line 127, in run_all >> self.export_to_csv() >> File "foo.py", line 168, in export_to_csv >> cur.copy_expert(self.export_sql_statement, f) >> psycopg2.extensions.TransactionRollbackError: canceling statement due to >> conflict with recovery >> DETAIL: User was holding shared buffer pin for too long. >> >> My question is, what are some simple ways we can use to chunk up the >> query? >> >> Could we pull down a list of all the ids (auto-incrementing int), break >> this list up, then use a WHERE clause to break it up, running multiple COPY >> commands? >> >> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure >> how we'd figure out when we reached the end of the results set though >> (apart from just counting the results?). >> >> Or are there other approaches you guys could recommend? >> >> Cheers, >> Victor >> > >
Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?
Try this, max_standby_archive_delay = 600s# max delay before canceling queries # when reading WAL from archive; # -1 allows indefinite delay max_standby_streaming_delay = 600s # max delay before canceling queries # when reading streaming WAL; # -1 allows indefinite delay or try pg_xlog_replay_pause() pg_xlog_replay_resume() On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi wrote: > Hi, > > We're using psycopg2 with COPY to dump CSV output from a large query. > > The actual SELECT query itself is large (both in number of > records/columns, and also in width of values in columns), but still > completes in around under a minute on the server. > > However, if you then use a COPY with it, it will often time out. > > We're using psycopg2 to run the command, the trace we get is something > like: > > Traceback (most recent call last): > File "foo.py", line 259, in > jobs[job].run_all() > File "foo.py", line 127, in run_all > self.export_to_csv() > File "foo.py", line 168, in export_to_csv > cur.copy_expert(self.export_sql_statement, f) > psycopg2.extensions.TransactionRollbackError: canceling statement due to > conflict with recovery > DETAIL: User was holding shared buffer pin for too long. > > My question is, what are some simple ways we can use to chunk up the query? > > Could we pull down a list of all the ids (auto-incrementing int), break > this list up, then use a WHERE clause to break it up, running multiple COPY > commands? > > Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how > we'd figure out when we reached the end of the results set though (apart > from just counting the results?). > > Or are there other approaches you guys could recommend? > > Cheers, > Victor >