Re: [GENERAL] INSERTing lots of data

2010-06-01 Thread Dimitri Fontaine
Greg Smith  writes:
> Joachim Worringen wrote:
>> my Python application (http://perfbase.tigris.org) repeatedly needs to
>> insert lots of data into an exsting, non-empty, potentially large
>> table. Currently, the bottleneck is with the Python application, so I
>> intend to multi-thread it. Each thread should work on a part of the input
>> file.
>
> You are wandering down a path followed by pgloader at one point:
> http://pgloader.projects.postgresql.org/#toc6 and one that I fought with
> briefly as well.  Simple multi-threading can be of minimal help in scaling
> up insert performance here, due to the Python issues involved with the GIL.
> Maybe we get Dimitri to chime in here, he did more of this than I did.

In my case pgloader is using COPY and not INSERT. Which would mean than
while one python thread is blocked on network IO the others have a
chance of using the CPU. That should be a case where GIL is working
ok. My tests show that it's not.

> Two thoughts.  First, build a test performance case assuming it will fail to
> scale upwards, looking for problems.  If you get lucky, great, but don't
> assume this will work--it's proven more difficult than is obvious in the
> past for others.
>
> Second, if you do end up being throttled by the GIL, you can probably build
> a solution for Python 2.6/3.0 using the multiprocessing module for your use
> case:  http://docs.python.org/library/multiprocessing.html

My plan was to go with http://docs.python.org/library/subprocess.html
but it seems multiprocessing is easier to use when you want to port
existing threaded code.

Thanks Greg!
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-06-01 Thread Joachim Worringen

On 06/01/2010 05:45 AM, Greg Smith wrote:

Two thoughts. First, build a test performance case assuming it will fail
to scale upwards, looking for problems. If you get lucky, great, but
don't assume this will work--it's proven more difficult than is obvious
in the past for others.

Second, if you do end up being throttled by the GIL, you can probably
build a solution for Python 2.6/3.0 using the multiprocessing module for
your use case: http://docs.python.org/library/multiprocessing.html


Thanks, Greg - multiprocessing looks very usable for my application. 
Much more than using fork() and pipes myself...


 Joachim


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-31 Thread Greg Smith

Joachim Worringen wrote:
my Python application (http://perfbase.tigris.org) repeatedly needs to 
insert lots of data into an exsting, non-empty, potentially large 
table. Currently, the bottleneck is with the Python application, so I 
intend to multi-thread it. Each thread should work on a part of the 
input file.


You are wandering down a path followed by pgloader at one point:  
http://pgloader.projects.postgresql.org/#toc6 and one that I fought with 
briefly as well.  Simple multi-threading can be of minimal help in 
scaling up insert performance here, due to the Python issues involved 
with the GIL.  Maybe we get Dimitri to chime in here, he did more of 
this than I did.


Two thoughts.  First, build a test performance case assuming it will 
fail to scale upwards, looking for problems.  If you get lucky, great, 
but don't assume this will work--it's proven more difficult than is 
obvious in the past for others.


Second, if you do end up being throttled by the GIL, you can probably 
build a solution for Python 2.6/3.0 using the multiprocessing module for 
your use case:  http://docs.python.org/library/multiprocessing.html


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Craig Ringer

On 28/05/2010 9:17 PM, Joachim Worringen wrote:


In the "import data" case, however, I really need concurrent processing
on the CPU in the first place, so you may be right on this one. I'll
check it.


If you run into GIL problems, you can always work around it by spawning 
multiple processes and communicating between them via pipes, signals, 
shm, etc. That way you get one GIL per Python instance.


BTW, if you need decent CPU performance with Python, make sure you're 
using the Pysco specializing compiler module for Python. Activating it 
requires literally four lines at the start of your main module and the 
speed improvement is mind blowing.


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Joachim Worringen

On 05/28/2010 02:55 PM, Craig Ringer wrote:

On 28/05/10 17:41, Joachim Worringen wrote:

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to
insert lots of data into an exsting, non-empty, potentially large table.
Currently, the bottleneck is with the Python application, so I intend to
multi-thread it.


That may not be a great idea. For why, search for "Global Interpreter
Lock" (GIL).

It might help if Python's mostly blocked on network I/O, as the GIL is
released when Python blocks on the network, but still, your results may
not be great.


I verified that the thread actually execute queries concurrently. That 
does imply that they are blocked on I/O while the query is running, and 
that the query performance does in fact scale for this reason.


In the "import data" case, however, I really need concurrent processing 
on the CPU in the first place, so you may be right on this one. I'll 
check it.



will I get a speedup? Or will table-locking serialize things on the
server side?


Concurrent inserts work *great* with PostgreSQL, it's Python I'd be
worried about.


That's the part of answer I wanted to hear.,,

 thanks, Joachim



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Craig Ringer

On 28/05/10 17:41, Joachim Worringen wrote:

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to
insert lots of data into an exsting, non-empty, potentially large table.
Currently, the bottleneck is with the Python application, so I intend to
multi-thread it.


That may not be a great idea. For why, search for "Global Interpreter 
Lock" (GIL).


It might help if Python's mostly blocked on network I/O, as the GIL is 
released when Python blocks on the network, but still, your results may 
not be great.



will I get a speedup? Or will table-locking serialize things on the
server side?


Concurrent inserts work *great* with PostgreSQL, it's Python I'd be 
worried about.



--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Alban Hertroys
On 28 May 2010, at 12:14, Martin Gainty wrote:

> Good Afternoon Szymon!
>  
> Could you explain what a Python GIL is? and if there is any workaround to 
> Python GIL we can implement to achieve better performance..possibly at the 
> database level?

See here: http://en.wikipedia.org/wiki/Global_Interpreter_Lock

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.


!DSPAM:737,4bff996b10419162611771!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Martin Gainty

Good Afternoon Szymon!

 

Could you explain what a Python GIL is? and if there is any workaround to 
Python GIL we can implement to achieve better performance..possibly at the 
database level?


Mit freundlichen Grüßen/Les plus sincères amitiés 
Martin Gainty 
__ 
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger 
sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung 
oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem 
Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. 
Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung 
fuer den Inhalt uebernehmen.

Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le 
destinataire prévu, nous te demandons avec bonté que pour satisfaire informez 
l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est 
interdite. Ce message sert à l'information seulement et n'aura pas n'importe 
quel effet légalement obligatoire. Étant donné que les email peuvent facilement 
être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité 
pour le contenu fourni.



 



Date: Fri, 28 May 2010 11:48:16 +0200
Subject: Re: [GENERAL] INSERTing lots of data
From: mabew...@gmail.com
To: joachim.worrin...@iathh.de
CC: pgsql-general@postgresql.org


2010/5/28 Joachim Worringen 

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to insert 
lots of data into an exsting, non-empty, potentially large table. Currently, 
the bottleneck is with the Python application, so I intend to multi-thread it. 
Each thread should work on a part of the input file.

I already multi-threaded the query part of the application, which requires to 
use one connection per thread - cursors a serialized via a single connection.

Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources

 will I get a speedup? Or will table-locking serialize things on the server 
side?

Suggestions for alternatives are welcome, but the data must go through the 
Python application via INSERTs (no bulk insert, COPY etc. possible)




Remember about Python's GIL in some Python implementations so those threads 
could be serialized at the Python level.


This is possible that those inserts will be faster. The speed depends on the 
table structure, some constraints and triggers and even database configuration. 
The best answer is: just check it on some test code, make a simple 
multithreaded aplication and try to do the inserts and check that out.




regards
Szymon Guz

  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Joachim Worringen

On 05/28/2010 11:48 AM, Szymon Guz wrote:

Remember about Python's GIL in some Python implementations so those
threads could be serialized at the Python level.


My multi-threaded queries scale nicely with Python 2.6 on Linux, so this 
is not an issue here. But the queries do not perform concurrent write 
accesses on the same table.



This is possible that those inserts will be faster. The speed depends on
the table structure, some constraints and triggers and even database
configuration. The best answer is: just check it on some test code, make
a simple multithreaded aplication and try to do the inserts and check
that out.


Sure, testing always shows something, but I wonder if something general 
can be said about the execution of concurrent write transaction on the 
same table (no triggers, some non-NULL constraints, one index).


http://www.postgresql.org/docs/8.4/interactive/mvcc-intro.html says 
about MVCC:

"
The main advantage of using the MVCC model of concurrency control rather 
than locking is that in MVCC locks acquired for querying (reading) data 
do not conflict with locks acquired for writing data, and so reading 
never blocks writing and writing never blocks reading.

"

It does not mention whether writing may block writing, or if it always 
does. 
http://bytes.com/topic/python/answers/728130-parallel-insert-postgresql-thread 
indicates it should not block - can this be confirmed by some Postgresql 
guru?


thanks, Joachim



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] INSERTing lots of data

2010-05-28 Thread Szymon Guz
2010/5/28 Joachim Worringen 

> Greetings,
>
> my Python application (http://perfbase.tigris.org) repeatedly needs to
> insert lots of data into an exsting, non-empty, potentially large table.
> Currently, the bottleneck is with the Python application, so I intend to
> multi-thread it. Each thread should work on a part of the input file.
>
> I already multi-threaded the query part of the application, which requires
> to use one connection per thread - cursors a serialized via a single
> connection.
>
> Provided that
> - the threads use their own connection
> - the threads perform all INSERTs within a single transaction
> - the machine has enough resources
>
>  will I get a speedup? Or will table-locking serialize things on the server
> side?
>
> Suggestions for alternatives are welcome, but the data must go through the
> Python application via INSERTs (no bulk insert, COPY etc. possible)
>
>
Remember about Python's GIL in some Python implementations so those threads
could be serialized at the Python level.

This is possible that those inserts will be faster. The speed depends on the
table structure, some constraints and triggers and even database
configuration. The best answer is: just check it on some test code, make a
simple multithreaded aplication and try to do the inserts and check that
out.


regards
Szymon Guz


[GENERAL] INSERTing lots of data

2010-05-28 Thread Joachim Worringen

Greetings,

my Python application (http://perfbase.tigris.org) repeatedly needs to 
insert lots of data into an exsting, non-empty, potentially large table. 
Currently, the bottleneck is with the Python application, so I intend to 
multi-thread it. Each thread should work on a part of the input file.


I already multi-threaded the query part of the application, which 
requires to use one connection per thread - cursors a serialized via a 
single connection.


Provided that
- the threads use their own connection
- the threads perform all INSERTs within a single transaction
- the machine has enough resources

 will I get a speedup? Or will table-locking serialize things on the 
server side?


Suggestions for alternatives are welcome, but the data must go through 
the Python application via INSERTs (no bulk insert, COPY etc. possible)


 thanks, Joachim


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general