On 11/28/2014 05:53 AM, Albert-Jan Roskam wrote:
----- Original Message -----
From: Dave Angel <[email protected]>
To: [email protected]
Cc:
Sent: Thursday, November 27, 2014 11:55 PM
Subject: Re: [Tutor] multiprocessing question
On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
I made a comparison between multiprocessing and threading. In the code
below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more
than 100 (yes: one hundred) times slower than threading! That is
I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing
something wrong? I can't believe the difference is so big.
The bulk of the time is spent marshalling the data to the dictionary
self.lookup. You can speed it up some by using a list there (it also
makes the code much simpler). But the real trick is to communicate less
often between the processes.
def mp_create_lookup(self):
local_lookup = []
lino, record_start = 0, 0
for line in self.data:
if not line:
break
local_lookup.append(record_start)
if len(local_lookup) > 100:
self.lookup.extend(local_lookup)
local_lookup = []
record_start += len(line)
print(len(local_lookup))
self.lookup.extend(local_lookup)
It's faster because it passes a larger list across the boundary every
100 records, instead of a single value every record.
Note that the return statement wasn't ever needed, and you don't need a
lino variable. Just use append.
I still have to emphasize that record_start is just wrong. You must use
ftell() if you're planning to use fseek() on a text file.
You can also probably speed the process up a good deal by passing the
filename to the other process, rather than opening the file in the
original process. That will eliminate sharing the self.data across the
process boundary.
Hi Dave,
Thanks. I followed your advice and this indeed makes a huuuge difference.
Multiprocessing is now just 3 times slower than threading.
And I'd bet you could close most of that gap by opening the file in the
subprocess instead of marshalling the file I/O across the boundary.
Even so, threading is still the way to go (also because of the added complexity
of the mp_create_lookup function).
Threading/mp aside: I agree that a dict is not the right choice. I consider a
dict like a mix between a Ferrari
> and a Mack truck: fast, but bulky. Would it make sense to use
array.array instead of list?
Sure. The first trick for performance is to pick a structure that's
just complex enough to solve your problem. Since your keys are
sequential integers, list makes more sense than dict. If all your keys
are 4gig or less, then an array.array makes sense. But each time you
make such a simplification, you are usually adding an assumption.
I've been treating this as an academic exercise, to help expose some of
the tradeoffs. But as you've already pointed out, the real reason to
use threads is to simplify the code. The fact that it's faster is just
gravy. The main downside to threads is it's way too easy to
accidentally use a global, and not realize how the threads are interacting.
Optimizing is fun:
So are these csv files pretty stable? If so, you could prepare an index
file to each one, and only recalculate if the timestamp changes. That
index could be anything you like, and it could be fixed length binary
data, so random access in it is trivial.
Are the individual lines always less than 255 bytes? if so, you could
index every 100 lines in a smaller array.arry, and for the individual
line sizes use a byte array. You've saved another factor of 4.
--
DaveA
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor