Re: [Tutor] multiprocessing question

Cameron Simpson Sun, 23 Nov 2014 17:49:37 -0800

On 23Nov2014 22:30, Albert-Jan Roskam <[email protected]> wrote:

I created some code to get records from a potentially giant .csv file. This 
implements a __getitem__ method that gets records from a memory-mapped csv 
file. In order for this to work, I need to build a lookup table that maps line 
numbers to line starts/ends. This works, BUT building the lookup table could be 
time-consuming (and it freezes up the app). The (somewhat pruned) code is here: 
http://pastebin.com/0x6JKbfh. Now I would like to build the lookup table in a 
separate process. I used multiprocessing. In the crude example below, it 
appears to be doing what I have in mind. Is this the way to do it? I have never 
used multiprocessing/threading before, apart from playing around. One specfic 
question: __getitem__ is supposed to throw an IndexError when needed. But how 
do I know when I should do this if I don't yet know the total number of 
records? If there an uever cheap way of doing getting this number?


First up, multiprocessing is not what you want. You want threading for this.

The reason is that your row index makes an in-memory index. If you do this in asubprocess (mp.Process) then the in-memory index is in a different process, andnot accessable.


Use a Thread. You code will be very similar.

Next: your code is short enough to including inline instead of forcing peopleto go to pastebin; in particular if I were reading your email offline (as Imight do on a train) then I could not consult your code. Including it in themessage is preferable, normally.

Your approach of setting self.lookup_done to False and then later to Trueanswers your question about "__getitem__ is supposed to throw an IndexErrorwhen needed. But how do I know when I should do this if I don't yet know thetotal number of records?" Make __getitem__ _block_ until self.lookup_done isTrue. At that point you should know how many records there are.

Regarding blocking, you want a Condition object or a Lock (a Lock is simpler,and Condition is more general). Using a Lock, you would create the Lock and.acquire it. In create_lookup(), release() the Lock at the end. In __getitem__(or any other function dependent on completion of create_lookup), .acquire()and then .release() the Lock. That will cause it to block until the index scanis finished.


A remark about the create_lookup() function on pastebin: you go:

 record_start += len(line)

This presumes that a single text character on a line consumes a single byte ormemory or file disc space. However, your data file is utf-8 encoded, and somecharacters may be more than one byte or storage. This means that yourrecord_start values will not be useful because they are character counts, notbyte counts, and you need byte counts to offset into a file if you are doingrandom access.

Instead, note the value of unicode_csv_data.tell() before reading each line(you will need to modify your CSV reader somewhat to do this, and maybe returnboth the offset and line text). That is a byte offset to be used later.


Cheers,
Cameron Simpson <[email protected]>

George, discussing a patent and prior art:
"Look, this  publication has a date, the patent has a priority date,
can't you just compare them?"
Paul Sutcliffe:
"Not unless you're a lawyer."
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] multiprocessing question

Reply via email to