On 05/11/2011 05:34 AM, Aren Cambre wrote: > Using one thread, the app can do about 111 rows per second, and it's > only exercising 1.5 of 8 CPU cores while doing this. 12,000,000 rows / > 111 rows per second ~= 30 hours. > > I hoped to speed things up with some parallel processing. > > When the app is multithreaded, the app itself consumes about 3% CPU time > waiting for Postgres, which is only hammering 1 core and barely > exercising disk I/O (per two programs and HDD light).
OK, so before looking at parallelism, you might want to look at why you're not getting much out of Pg and your app with even one thread. You should be able to put a high load on the disk disk - or one cpu core - without needing to split out work into multiple threads and parallel workers. I suspect your app is doing lots of tiny single-row queries instead of efficiently batching things. It'll be wasting huge amounts of time waiting for results. Even if every query is individually incredibly fast, with the number of them you seem to be doing you'll lose a LOT of time if you loop over lots of little SELECTs. The usual cause of the kind of slow performance you describe is an app that "chats" with the database continuously, so its pattern is: loop: ask for row from database using SELECT retrieve result do a tiny bit of processing continue loop This is incredibly inefficient, because Pg is always waiting for the app to ask for something or the app is waiting for Pg to return something. During each switch there are delays and inefficiencies. It's actually: loop: ask for a single row from database using SELECT [twiddle thumbs while database plans and executes the query] retrieve result do a tiny bit of processing [Pg twiddles its thumbs] continue loop What you want is your app and Pg working at the same time. Assuming that CPU is the limitation rather than database speed and disk I/O I'd use something like this: Thread 1: get cursor for selecting all rows from database loop: get 100 rows from cursor add rows to processing queue if queue contains over 1000 rows: wait until queue contains less than 1000 rows Thread 2: until there are no more rows: ask Thread 1 for 100 rows for each row: do a tiny bit of processing By using a producer/consumer model like that you can ensure that thread 1 is always talking to the database, keeping Pg busy, and thread 2 is always working the CPUs. The two threads should share NOTHING except the queue to keep the whole thing simple and clean. You must make sure that the "get 100 rows" operation of the producer can happen even while the producer is in the middle of getting some more rows from Pg (though not necessarily in the middle of actually appending them to the queue data structure) so you don't accidentally serialize on access to the producer thread. If the single producer thread can't keep, try reading in bigger batches or adding more producer threads with a shared queue. If the single consumer thread can't keep up with the producer, add more consumers to use more CPU cores. [producer 1] [producer 2] [...] [producer n] | | | | --------------------------------- | queue | --------------------------------- | | | | [worker 1] [worker 2] [...] [worker n] ... or you can have each worker fetch its own chunks of rows (getting rid of the producer/consumer split) using its own connection and just have lots more workers to handle all the wasted idle time. A producer/consumer approach will probably be faster, though. If the consumer threads produce a result that must be sent back to the database, you can either have each thread write it to the database using its own connection when it's done, or you can have them delegate that work to another thread that's dedicated to INSERTing the results. If the INSERTer can't keep up, guess what, you spawn more of them working off a shared queue. If the consumer threads require additional information from the database to do their work, make sure they avoid the: loop: fetch one row do work on row pattern, instead fetching sets of rows from the database in batches. Use joins if necessary, or the IN() criterion. -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance