On 11/28/2014 05:53 AM, Albert-Jan Roskam wrote:


----- Original Message -----

From: Dave Angel <[email protected]>
To: [email protected]
Cc:
Sent: Thursday, November 27, 2014 11:55 PM
Subject: Re: [Tutor] multiprocessing question

On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:





  I made a comparison between multiprocessing and threading.  In the code
below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more
than 100 (yes: one hundred) times slower than threading! That is
I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing
something wrong? I can't believe the difference is so big.



The bulk of the time is spent marshalling the data to the dictionary
self.lookup.  You can speed it up some by using a list there (it also
makes the code much simpler).  But the real trick is to communicate less
often between the processes.

      def mp_create_lookup(self):
          local_lookup = []
          lino, record_start = 0, 0
          for line in self.data:
              if not line:
                  break
              local_lookup.append(record_start)
              if len(local_lookup) > 100:
                  self.lookup.extend(local_lookup)
                  local_lookup = []
              record_start += len(line)
          print(len(local_lookup))
          self.lookup.extend(local_lookup)

It's faster because it passes a larger list across the boundary every
100 records, instead of a single value every record.

Note that the return statement wasn't ever needed, and you don't need a
lino variable.  Just use append.

I still have to emphasize that record_start is just wrong.  You must use
ftell() if you're planning to use fseek() on a text file.

You can also probably speed the process up  a good deal by passing the
filename to the other process, rather than opening the file in the
original process.  That will eliminate sharing the self.data across the
process boundary.


Hi Dave,

Thanks. I followed your advice and this indeed makes a huuuge difference. 
Multiprocessing is now just 3 times slower than threading.

And I'd bet you could close most of that gap by opening the file in the subprocess instead of marshalling the file I/O across the boundary.

Even so, threading is still the way to go (also because of the added complexity 
of the mp_create_lookup function).

Threading/mp aside: I agree that a dict is not the right choice. I consider a 
dict like a mix between a Ferrari
> and a Mack truck: fast, but bulky. Would it make sense to use array.array instead of list?

Sure. The first trick for performance is to pick a structure that's just complex enough to solve your problem. Since your keys are sequential integers, list makes more sense than dict. If all your keys are 4gig or less, then an array.array makes sense. But each time you make such a simplification, you are usually adding an assumption.

I've been treating this as an academic exercise, to help expose some of the tradeoffs. But as you've already pointed out, the real reason to use threads is to simplify the code. The fact that it's faster is just gravy. The main downside to threads is it's way too easy to accidentally use a global, and not realize how the threads are interacting.

Optimizing is fun:

So are these csv files pretty stable? If so, you could prepare an index file to each one, and only recalculate if the timestamp changes. That index could be anything you like, and it could be fixed length binary data, so random access in it is trivial.

Are the individual lines always less than 255 bytes? if so, you could index every 100 lines in a smaller array.arry, and for the individual line sizes use a byte array. You've saved another factor of 4.

--
DaveA
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to