Re: [PERFORM] Large insert and delete batches

2012-03-02 Thread Anssi Kääriäinen

On 03/01/2012 10:51 PM, Marti Raudsepp wrote:

The problem with IN() and ARRAY[] is that the whole list of numbers
has to be parsed by the SQL syntax parser, which has significant
memory and CPU overhead (it has to accept arbitrary expressions in the
list). But there's a shortcut around the parser: you can pass in the
list as an array literal string, e.g:
select * from the_table where id = ANY('{1,2,3,4,5}')

OK, that explains the memory usage.

The SQL parser considers the value one long string and passes it to
the array input function, which is a much simpler routine. This should
scale up much better.

Even better if you could pass in the array as a query parameter, so
the SQL parser doesn't even see the long string -- but I think you
have to jump through some hoops to do that in psycopg2.
Luckily there is no need to do any tricks. The question I was trying to 
seek answer for was should there be some default batch size for inserts 
and deletes in Django, and the answer seems clear: the problems appear 
only when the batch sizes are enormous, so there doesn't seem to be a 
reason to have default limits. Actually, the batch sizes are so large 
that it is likely the Python process will OOM before you can trigger 
problems in the DB.


 - Anssi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Large insert and delete batches

2012-03-01 Thread Kääriäinen Anssi
Quoting myself:

So, is there some common wisdom about the batch sizes? Or is it better
to do the inserts and deletes in just one batch? I think the case for
performance problems needs to be strong before default limits are
considered for PostgreSQL.


I did a little test about this. My test was to see if there is any interesting 
difference
in performance between doing queries in small batches vs doing them in one go.

The test setup is simple: one table with an integer primary key containing a 
million rows.
The queries are select * from the_table where id = ANY(ARRAY[list_of_numbers])
and the similar delete, too.

For any sane amount of numbers in the list, the result is that doing the 
queries in smaller
batches might be a little faster, but nothing conclusive found. However, once 
you go into
millions of items in the list, the query will OOM my Postgres server. With 
million items
in the list the process uses around 700MB of memory, 2 million items is 1.4GB, 
and beyond
that it is an OOM condition. The problem seems to be the array which takes all 
the memory.
So, you can OOM the server by doing SELECT 
ARRAY[large_enough_list_of_numbers].

Conclusion: as long as you are not doing anything really stupid it seems that 
there isn't any important
performance reasons to split the bulk queries into smaller batches.

For inserts the conclusion is similar. A lot of memory is used if you go to the 
millions of items range,
but otherwise it seems it doesn't matter if you do many smaller batches versus 
one larger batch.

 - Anssi
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Large insert and delete batches

2012-03-01 Thread Marti Raudsepp
On Thu, Mar 1, 2012 at 21:06, Kääriäinen Anssi anssi.kaariai...@thl.fi wrote:
 The queries are select * from the_table where id = 
 ANY(ARRAY[list_of_numbers])
 and the similar delete, too.

 [...] However, once you go into
 millions of items in the list, the query will OOM my Postgres server.

The problem with IN() and ARRAY[] is that the whole list of numbers
has to be parsed by the SQL syntax parser, which has significant
memory and CPU overhead (it has to accept arbitrary expressions in the
list). But there's a shortcut around the parser: you can pass in the
list as an array literal string, e.g:
select * from the_table where id = ANY('{1,2,3,4,5}')

The SQL parser considers the value one long string and passes it to
the array input function, which is a much simpler routine. This should
scale up much better.

Even better if you could pass in the array as a query parameter, so
the SQL parser doesn't even see the long string -- but I think you
have to jump through some hoops to do that in psycopg2.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Large insert and delete batches

2012-02-29 Thread Anssi Kääriäinen

Hello all,

I am trying to help the Django project by investigating if there should 
be some default batch size limits for insert and delete queries. This is 
realted to a couple of tickets which deal with SQLite's inability to 
deal with more than 1000 parameters in a single query. That backend 
needs a limit anyways. It might be possible to implement default limits 
for other backends at the same time if that seems necessary.


If I am not mistaken, there are no practical hard limits. So, the 
question is if performance is expected to collapse at some point.


Little can be assumed about the schema or the environment. The inserts 
and deletes are going to be done in one transaction. Foreign keys are 
indexed and they are DEFERRABLE INITIALLY DEFERRED by default. 
PostgreSQL version can be anything from 8.2 on.


The queries will be of form:
insert into some_table(col1, col2) values (val1, val2), (val3, 
val4), ...;

and
delete from some_table where PK in (list_of_pk_values);

So, is there some common wisdom about the batch sizes? Or is it better 
to do the inserts and deletes in just one batch? I think the case for 
performance problems needs to be strong before default limits are 
considered for PostgreSQL.


The tickets in question are:
https://code.djangoproject.com/ticket/17788 and 
https://code.djangoproject.com/ticket/16426


 - Anssi Kääriäinen

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance