Re: [Tutor] multiprocessing question

Albert-Jan Roskam Fri, 28 Nov 2014 03:00:07 -0800


----- Original Message -----


> From: Dave Angel <[email protected]>
> To: [email protected]
> Cc: 
> Sent: Thursday, November 27, 2014 11:55 PM
> Subject: Re: [Tutor] multiprocessing question
> 
> On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:
>> 
>> 
>> 
>> 
>> 
>>  I made a comparison between multiprocessing and threading.  In the code 
> below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more 
> than 100 (yes: one hundred) times slower than threading! That is 
> I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing 
> something wrong? I can't believe the difference is so big.
>> 
>> 
> 
> The bulk of the time is spent marshalling the data to the dictionary 
> self.lookup.  You can speed it up some by using a list there (it also 
> makes the code much simpler).  But the real trick is to communicate less 
> often between the processes.
> 
>      def mp_create_lookup(self):
>          local_lookup = []
>          lino, record_start = 0, 0
>          for line in self.data:
>              if not line:
>                  break
>              local_lookup.append(record_start)
>              if len(local_lookup) > 100:
>                  self.lookup.extend(local_lookup)
>                  local_lookup = []
>              record_start += len(line)
>          print(len(local_lookup))
>          self.lookup.extend(local_lookup)
> 
> It's faster because it passes a larger list across the boundary every 
> 100 records, instead of a single value every record.
> 
> Note that the return statement wasn't ever needed, and you don't need a 
> lino variable.  Just use append.
> 
> I still have to emphasize that record_start is just wrong.  You must use 
> ftell() if you're planning to use fseek() on a text file.
> 
> You can also probably speed the process up  a good deal by passing the 
> filename to the other process, rather than opening the file in the 
> original process.  That will eliminate sharing the self.data across the 
> process boundary.


Hi Dave,

Thanks. I followed your advice and this indeed makes a huuuge difference. 
Multiprocessing is now just 3 times slower than threading. Even so, threading 
is still the way to go (also because of the added complexity of the 
mp_create_lookup function).

Threading/mp aside: I agree that a dict is not the right choice. I consider a 
dict like a mix between a Ferrari and a Mack truck: fast, but bulky. Would it 
make sense to use array.array instead of list? I also checked numpy.array, but 
numpy.append is very ineffcient (reminded me of str.__iadd__). This site 
suggests that it could make a huge difference in terms of RAM use: 
http://www.dotnetperls.com/array-python. "The array with 10 million integers 
required 43.8 MB of memory. The list version required 710.9 MB."  (note that is 
it is followed by a word of caution)

Albert-Jan
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] multiprocessing question

Reply via email to