Re: [Tutor] multiprocessing question

Dave Angel Fri, 28 Nov 2014 03:38:30 -0800

On 11/28/2014 05:53 AM, Albert-Jan Roskam wrote:



----- Original Message -----

From: Dave Angel <[email protected]>
To: [email protected]
Cc:
Sent: Thursday, November 27, 2014 11:55 PM
Subject: Re: [Tutor] multiprocessing question

On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:






  I made a comparison between multiprocessing and threading.  In the code

below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more
than 100 (yes: one hundred) times slower than threading! That is
I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing
something wrong? I can't believe the difference is so big.


The bulk of the time is spent marshalling the data to the dictionary
self.lookup.  You can speed it up some by using a list there (it also
makes the code much simpler).  But the real trick is to communicate less
often between the processes.

      def mp_create_lookup(self):
          local_lookup = []
          lino, record_start = 0, 0
          for line in self.data:
              if not line:
                  break
              local_lookup.append(record_start)
              if len(local_lookup) > 100:
                  self.lookup.extend(local_lookup)
                  local_lookup = []
              record_start += len(line)
          print(len(local_lookup))
          self.lookup.extend(local_lookup)

It's faster because it passes a larger list across the boundary every
100 records, instead of a single value every record.

Note that the return statement wasn't ever needed, and you don't need a
lino variable.  Just use append.

I still have to emphasize that record_start is just wrong.  You must use
ftell() if you're planning to use fseek() on a text file.

You can also probably speed the process up  a good deal by passing the
filename to the other process, rather than opening the file in the
original process.  That will eliminate sharing the self.data across the
process boundary.



Hi Dave,

Thanks. I followed your advice and this indeed makes a huuuge difference. 
Multiprocessing is now just 3 times slower than threading.

And I'd bet you could close most of that gap by opening the file in thesubprocess instead of marshalling the file I/O across the boundary.

Even so, threading is still the way to go (also because of the added complexity 
of the mp_create_lookup function).

Threading/mp aside: I agree that a dict is not the right choice. I consider a 
dict like a mix between a Ferrari

> and a Mack truck: fast, but bulky. Would it make sense to usearray.array instead of list?

Sure. The first trick for performance is to pick a structure that'sjust complex enough to solve your problem. Since your keys aresequential integers, list makes more sense than dict. If all your keysare 4gig or less, then an array.array makes sense. But each time youmake such a simplification, you are usually adding an assumption.

I've been treating this as an academic exercise, to help expose some ofthe tradeoffs. But as you've already pointed out, the real reason touse threads is to simplify the code. The fact that it's faster is justgravy. The main downside to threads is it's way too easy toaccidentally use a global, and not realize how the threads are interacting.


Optimizing is fun:

So are these csv files pretty stable? If so, you could prepare an indexfile to each one, and only recalculate if the timestamp changes. Thatindex could be anything you like, and it could be fixed length binarydata, so random access in it is trivial.

Are the individual lines always less than 255 bytes? if so, you couldindex every 100 lines in a smaller array.arry, and for the individualline sizes use a byte array. You've saved another factor of 4.


--
DaveA
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] multiprocessing question

Reply via email to