----- Original Message -----
> From: Dave Angel <[email protected]> > To: [email protected] > Cc: > Sent: Thursday, November 27, 2014 11:55 PM > Subject: Re: [Tutor] multiprocessing question > > On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote: >> >> >> >> >> >> I made a comparison between multiprocessing and threading. In the code > below (it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more > than 100 (yes: one hundred) times slower than threading! That is > I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing > something wrong? I can't believe the difference is so big. >> >> > > The bulk of the time is spent marshalling the data to the dictionary > self.lookup. You can speed it up some by using a list there (it also > makes the code much simpler). But the real trick is to communicate less > often between the processes. > > def mp_create_lookup(self): > local_lookup = [] > lino, record_start = 0, 0 > for line in self.data: > if not line: > break > local_lookup.append(record_start) > if len(local_lookup) > 100: > self.lookup.extend(local_lookup) > local_lookup = [] > record_start += len(line) > print(len(local_lookup)) > self.lookup.extend(local_lookup) > > It's faster because it passes a larger list across the boundary every > 100 records, instead of a single value every record. > > Note that the return statement wasn't ever needed, and you don't need a > lino variable. Just use append. > > I still have to emphasize that record_start is just wrong. You must use > ftell() if you're planning to use fseek() on a text file. > > You can also probably speed the process up a good deal by passing the > filename to the other process, rather than opening the file in the > original process. That will eliminate sharing the self.data across the > process boundary. Hi Dave, Thanks. I followed your advice and this indeed makes a huuuge difference. Multiprocessing is now just 3 times slower than threading. Even so, threading is still the way to go (also because of the added complexity of the mp_create_lookup function). Threading/mp aside: I agree that a dict is not the right choice. I consider a dict like a mix between a Ferrari and a Mack truck: fast, but bulky. Would it make sense to use array.array instead of list? I also checked numpy.array, but numpy.append is very ineffcient (reminded me of str.__iadd__). This site suggests that it could make a huge difference in terms of RAM use: http://www.dotnetperls.com/array-python. "The array with 10 million integers required 43.8 MB of memory. The list version required 710.9 MB." (note that is it is followed by a word of caution) Albert-Jan _______________________________________________ Tutor maillist - [email protected] To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
