On 11/27/2014 04:01 PM, Albert-Jan Roskam wrote:





I made a comparison between multiprocessing and threading.  In the code below 
(it's also here: http://pastebin.com/BmbgHtVL, multiprocessing is more than 100 
(yes: one hundred) times slower than threading! That is 
I-must-be-doing-something-wrong-ishly slow. Any idea whether I am doing 
something wrong? I can't believe the difference is so big.



The bulk of the time is spent marshalling the data to the dictionary self.lookup. You can speed it up some by using a list there (it also makes the code much simpler). But the real trick is to communicate less often between the processes.

    def mp_create_lookup(self):
        local_lookup = []
        lino, record_start = 0, 0
        for line in self.data:
            if not line:
                break
            local_lookup.append(record_start)
            if len(local_lookup) > 100:
                self.lookup.extend(local_lookup)
                local_lookup = []
            record_start += len(line)
        print(len(local_lookup))
        self.lookup.extend(local_lookup)

It's faster because it passes a larger list across the boundary every 100 records, instead of a single value every record.

Note that the return statement wasn't ever needed, and you don't need a lino variable. Just use append.

I still have to emphasize that record_start is just wrong. You must use ftell() if you're planning to use fseek() on a text file.

You can also probably speed the process up a good deal by passing the filename to the other process, rather than opening the file in the original process. That will eliminate sharing the self.data across the process boundary.



--
DaveA
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to