Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-08 Thread Victor Hooi
Hi,

@Francisco - Yeah, the file is around 600 Mb currently, uncompressed.

You're right, our internet connection is going to be the limiting factor.

Essentially, the PostgreSQL server is in a datacentre, the server we're
dumping to is in the office.

Running a script on the PostgreSQL server in the datacentre is going to be
tricky (not so much technically, just from a procedures/security point of
view).

Dumping to a spare table seems like an interesting point - so we'd just
create the table, COPY the results to that table, then use LIMIT/OFFSET to
paginate through that, then drop the table afterwards?

Currently, I'm doing a quick hack where we download an ordered list of the
ids (auto-incrementing integer) into Python, chunk it up into groups of
ids, then use a WHERE IN clause to download each chunk via COPY.

Would dumping to a spare table and paginating a better approach? Reasons?
(Not challenging it, I just want to understand everything).

Cheers,
Victor


On Fri, Nov 8, 2013 at 6:36 PM, Francisco Olarte wrote:

> On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi  wrote:
> > They think that it might be limited by the network, and how fast the
> > PostgreSQL server can push the data across the internet. (The Postgres
> > server and the box running the query are connected over the internet).
>
> You previously said you had 600Mb. Over the internet. ¿ Is it a very
> fat pipe ? Because otherwise the limitng factor is probably not the
> speed at which postgres can push the resuts, but he throughput of your
> link.
>
> If, as you stated, you need a single transaction to get a 600Mb
> snapshot I would recommend to dump it to disk, compressing on the fly
> ( you should get easily four o five fold reduction on a CSV file using
> any decent compressor ), and then send the file. If you do not have
> disk for the dump but can run programs near the server, you can try
> compressing on the fly. If you have got none of this but have got
> space for a spare table, use a select into, paginate this output and
> drop it after. Or just look at the configs and set longer query times,
> if your app NEEDS two hour queries, they can be enabled. But anyway,
> doing a long transaction over the internet does not seem like a good
> idea to me.
>
> Francisco Olarte
>


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread Victor Hooi
Hi,

Aha, I spoke to the somebody, apparently we've actually got those values
set to 15 minutes currently...

They think that it might be limited by the network, and how fast the
PostgreSQL server can push the data across the internet. (The Postgres
server and the box running the query are connected over the internet).

Cheers,
Victor


On Fri, Nov 8, 2013 at 1:44 PM, Victor Hooi  wrote:

> Hi,
>
> Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that
> helps.
>
> Do these settings still work if you only have a single Postgres instance?
> (I'll need to check out setup).
>
> So my understanding is that the default is 30 seconds (
> http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html)
> - but we're increasing it to 600 seconds, and that should give the COPY
> command enough time to pull down the data?
>
> As a rough guide, the dumped CSV file is around 600 Mb.
>
> Is there any other background you might be able to give on what you think
> might be happening, or how this might fix it?
>
> And you'd recommend tweaking these values over trying to chunk up the
> COPY/SELECT, is that right?
>
> I've just realised the LIMIT/ORDER thing may not work well to paginate,
> since there may be new records, or deleted records between each time I call
> it?
>
> Cheers,
> Victor
>
>
> On Fri, Nov 8, 2013 at 1:15 PM, wd  wrote:
>
>> Try this,
>>
>> max_standby_archive_delay = 600s# max delay before canceling queries
>> # when reading WAL from archive;
>> # -1 allows indefinite delay
>> max_standby_streaming_delay = 600s  # max delay before canceling queries
>> # when reading streaming WAL;
>>     # -1 allows indefinite delay
>>
>> or try
>> pg_xlog_replay_pause()
>> pg_xlog_replay_resume()
>>
>>
>>
>>
>> On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi wrote:
>>
>>> Hi,
>>>
>>> We're using psycopg2 with COPY to dump CSV output from a large query.
>>>
>>> The actual SELECT query itself is large (both in number of
>>> records/columns, and also in width of values in columns), but still
>>> completes in around under a minute on the server.
>>>
>>> However, if you then use a COPY with it, it will often time out.
>>>
>>> We're using psycopg2 to run the command, the trace we get is something
>>> like:
>>>
>>> Traceback (most recent call last):
>>>   File "foo.py", line 259, in 
>>> jobs[job].run_all()
>>>   File "foo.py", line 127, in run_all
>>> self.export_to_csv()
>>>   File "foo.py", line 168, in export_to_csv
>>> cur.copy_expert(self.export_sql_statement, f)
>>> psycopg2.extensions.TransactionRollbackError: canceling statement due to
>>> conflict with recovery
>>>  DETAIL:  User was holding shared buffer pin for too long.
>>>
>>> My question is, what are some simple ways we can use to chunk up the
>>> query?
>>>
>>> Could we pull down a list of all the ids (auto-incrementing int), break
>>> this list up, then use a WHERE clause to break it up, running multiple COPY
>>> commands?
>>>
>>> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure
>>> how we'd figure out when we reached the end of the results set though
>>> (apart from just counting the results?).
>>>
>>> Or are there other approaches you guys could recommend?
>>>
>>> Cheers,
>>> Victor
>>>
>>
>>
>


Re: [GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread Victor Hooi
Hi,

Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that helps.

Do these settings still work if you only have a single Postgres instance?
(I'll need to check out setup).

So my understanding is that the default is 30 seconds (
http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html)
- but we're increasing it to 600 seconds, and that should give the COPY
command enough time to pull down the data?

As a rough guide, the dumped CSV file is around 600 Mb.

Is there any other background you might be able to give on what you think
might be happening, or how this might fix it?

And you'd recommend tweaking these values over trying to chunk up the
COPY/SELECT, is that right?

I've just realised the LIMIT/ORDER thing may not work well to paginate,
since there may be new records, or deleted records between each time I call
it?

Cheers,
Victor


On Fri, Nov 8, 2013 at 1:15 PM, wd  wrote:

> Try this,
>
> max_standby_archive_delay = 600s# max delay before canceling queries
> # when reading WAL from archive;
> # -1 allows indefinite delay
> max_standby_streaming_delay = 600s  # max delay before canceling queries
> # when reading streaming WAL;
> # -1 allows indefinite delay
>
> or try
> pg_xlog_replay_pause()
> pg_xlog_replay_resume()
>
>
>
>
> On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi  wrote:
>
>> Hi,
>>
>> We're using psycopg2 with COPY to dump CSV output from a large query.
>>
>> The actual SELECT query itself is large (both in number of
>> records/columns, and also in width of values in columns), but still
>> completes in around under a minute on the server.
>>
>> However, if you then use a COPY with it, it will often time out.
>>
>> We're using psycopg2 to run the command, the trace we get is something
>> like:
>>
>> Traceback (most recent call last):
>>   File "foo.py", line 259, in 
>> jobs[job].run_all()
>>   File "foo.py", line 127, in run_all
>> self.export_to_csv()
>>   File "foo.py", line 168, in export_to_csv
>> cur.copy_expert(self.export_sql_statement, f)
>> psycopg2.extensions.TransactionRollbackError: canceling statement due to
>> conflict with recovery
>>  DETAIL:  User was holding shared buffer pin for too long.
>>
>> My question is, what are some simple ways we can use to chunk up the
>> query?
>>
>> Could we pull down a list of all the ids (auto-incrementing int), break
>> this list up, then use a WHERE clause to break it up, running multiple COPY
>> commands?
>>
>> Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure
>> how we'd figure out when we reached the end of the results set though
>> (apart from just counting the results?).
>>
>> Or are there other approaches you guys could recommend?
>>
>> Cheers,
>> Victor
>>
>
>


[GENERAL] Breaking up a PostgreSQL COPY command into chunks?

2013-11-07 Thread Victor Hooi
Hi,

We're using psycopg2 with COPY to dump CSV output from a large query.

The actual SELECT query itself is large (both in number of records/columns,
and also in width of values in columns), but still completes in around
under a minute on the server.

However, if you then use a COPY with it, it will often time out.

We're using psycopg2 to run the command, the trace we get is something like:

Traceback (most recent call last):
  File "foo.py", line 259, in 
jobs[job].run_all()
  File "foo.py", line 127, in run_all
self.export_to_csv()
  File "foo.py", line 168, in export_to_csv
cur.copy_expert(self.export_sql_statement, f)
psycopg2.extensions.TransactionRollbackError: canceling statement due to
conflict with recovery
 DETAIL:  User was holding shared buffer pin for too long.

My question is, what are some simple ways we can use to chunk up the query?

Could we pull down a list of all the ids (auto-incrementing int), break
this list up, then use a WHERE clause to break it up, running multiple COPY
commands?

Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how
we'd figure out when we reached the end of the results set though (apart
from just counting the results?).

Or are there other approaches you guys could recommend?

Cheers,
Victor


Re: [GENERAL] Performance of ORDER BY RANDOM to select random rows?

2013-08-11 Thread Victor Hooi
Hi,

@Hubert/@Sergey - Thanks for your response.

Hmm, aha, so the ORDER BY RANDOM behaviour hasn't changed - just to confirm
- this means that Postgres will duplicate the table, add a new column,
generate random numbers for every record, then sort by that new column,
right?

I've just read the above anecdotally on the internet, but I'm curious if
the actual implementation is documented somewhere officially apart from the
source? Running the query through EXPLAIN didn't seem to tell me much
additional information.

@Sergey - Thanks for the tip about using WITH RECURSIVE. I'm actually doing
something similar in my application code in Django - basically take the max
id, then generate a random integer between 0 and max id. However, it is
dependent on how evenly distributed the record IDs are - in our case, if we
delete a large number of records, it might affect things.

Are there any other database backends that do things differently?

(I know that SQL Server suggests using NEWID to do things -
http://msdn.microsoft.com/en-us/library/cc441928.aspx).

Cheers,
Victor


On Fri, Aug 9, 2013 at 10:43 AM, Sergey Konoplev  wrote:

> On Wed, Aug 7, 2013 at 7:01 PM, Victor Hooi  wrote:
> > also seems to suggest that using ORDER BY RANDOM() will perform poorly on
> > Postgres.
> >
> > I'm just wondering if this is still the case?
> >
> > I just ran those benchmarks on my system (Postgres 9.2.4), and using
> ORDERY
> > BY RANDOM did not seem substantially to generating random integers in
> Python
> > and picking those out (and handling non-existent rows).
> >
> > Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?
>
> Unfortunately, It has not. However, there always is a workaround. You
> can get a random results fast by WITH RECURSIVE query.
>
> WITH RECURSIVE r AS (
> WITH b AS (SELECT min(id), max(id) FROM table1)
> (
> SELECT id, min, max, array[]::integer[] AS a, 0 AS n
> FROM table1, b
> WHERE id > min + (max - min) * random()
> LIMIT 1
> ) UNION ALL (
> SELECT t.id, min, max, a || t.id, r.n + 1 AS n
> FROM table1 AS t, r
> WHERE
> t.id > min + (max - min) * random() AND
> t.id <> all(a) AND
> r.n + 1 < 10
> LIMIT 1
> )
> )
> SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;
>
> The general idea is that we get a random value between min(id) and
> max(id) and then get the first row with id bigger than this value.
> Then we repeat until we get 10 of such rows, checking that this id has
> not been retrieved earlier.
>
> Surely, the probability of appearing one or another value in the
> result depends on the distribution of id values in the table, but in
> the most cases I faced it works good.
>
> I had an idea to play with pg_stats.histogram_bounds to work around
> the described issue, but it was never so critical for tasks I solved.
>
> --
> Kind regards,
> Sergey Konoplev
> PostgreSQL Consultant and DBA
>
> http://www.linkedin.com/in/grayhemp
> +1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
> gray...@gmail.com
>


[GENERAL] Performance of ORDER BY RANDOM to select random rows?

2013-08-07 Thread Victor Hooi
Hi,

I have a Django application where we need to pull random rows out of a
table.

According to the Django documentation:

https://docs.djangoproject.com/en/dev/ref/models/querysets/#order-by

Note: order_by('?') queries may be expensive and slow, depending on the
> database backend you’re using.


My understanding is that order_by('?') in the Django ORM will generate a
query with ORDER BY RANDOM.

This blog post:

http://www.peterbe.com/plog/getting-random-rows-postgresql-django

also seems to suggest that using ORDER BY RANDOM() will perform poorly on
Postgres.

I'm just wondering if this is still the case?

I just ran those benchmarks on my system (Postgres 9.2.4), and using ORDERY
BY RANDOM did not seem substantially to generating random integers in
Python and picking those out (and handling non-existent rows).

Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?

Cheers,
Victor


[GENERAL] Porting from MS Access 2007 to PostgreSQL

2010-11-08 Thread Victor Hooi
Hi,

Disclaimer: Not a DBA, nor I am not a DB guy, so please excuse any ignorance
in the below.

*1. Background*

We have a MS Access 2003 database that we are using to manage registration
and workshop/accommodation allocation for a conference. The database isn't
particularly complicated (around 20 tables or so), nor is the dataset large
(largest table has around 13,000 records, most of the others have around
5000 or so records.)

The structure is a bit convoluted though (mostly for historical reasons),
and most of the queries we use are quite join-heavy. Some of these seem to
take longer than you'd expect them to, for such a small dataset.

The database design is a bit quirky - there's heavy use of varchars for many
things, stacks of NULLs everywhere, and not really much use of
validation/constraints.

*2. MS Access to MySQL *

Recently, this was ported over from a pure-Access database to a Access
front-end over a MySQL backend on somebody's desktop, mostly to provide
multi-user capabilities. I've been told automated tools were used for this,
so I assume we weren't using too many MySQL-specific features.

*3. MySQL to Postgres*

I recently looked at moving this over to a PostgreSQL in a proper server.
Postgres was chosen mainly for convenience since we already have a Postgres
instance setup there (used for some Django projects).

I tried a MySQL to PostgreSQL conversion using Enterprise DB's Migration
Studio, hit an issue with two of the tables complaining about CLOB's...

*4. MS Access to Postgres*

Anyhow, somebody else suggested it might be better to just go straight from
the original MS Access database to PostgreSQL.

My first question is, what is the current recommended procedure for this?

I saw this page from 2001:

http://wiki.postgresql.org/wiki/Microsoft_Access_to_PostgreSQL_Conversion

and the tool referenced there appears to lead to a 404 page.

I also saw the tools referenced there:

http://wiki.postgresql.org/wiki/Converting_from_other_Databases_to_PostgreSQL#Microsoft_Access

and most of them appear quite dated, from the MS Access 97 era.

Has anybody had any experiencing doing a similar port (Access 2007 to
Postgres) recently, what tools did you use, were there any gotchas you hit
etc? Or just any general advice at all here?

*5. Postgres Specific Features*

Once the port is done, I assume they'll probably be some work involved to
clean it up, and leverage on some of Postgres's features. In particular, I'm
hoping to at least get some decent data validations/constraints in.

The issue is we still need to maintain compatibility, where we can, with the
Access frontend.

That and hopefully clean up some of the queries a bit, and try and figure
out why some forms in Access are taking so long to load.

Any particularly good books here that you'd recommend? I saw some
Postgres-specific books on Amazon, but none seem to have particularly good
reviews (or were rather simplistic). Recommendations?

Cheers,
Victor