On 23Nov2014 22:30, Albert-Jan Roskam <[email protected]> wrote:
I created some code to get records from a potentially giant .csv file. This 
implements a __getitem__ method that gets records from a memory-mapped csv 
file. In order for this to work, I need to build a lookup table that maps line 
numbers to line starts/ends. This works, BUT building the lookup table could be 
time-consuming (and it freezes up the app). The (somewhat pruned) code is here: 
http://pastebin.com/0x6JKbfh. Now I would like to build the lookup table in a 
separate process. I used multiprocessing. In the crude example below, it 
appears to be doing what I have in mind. Is this the way to do it? I have never 
used multiprocessing/threading before, apart from playing around. One specfic 
question: __getitem__ is supposed to throw an IndexError when needed. But how 
do I know when I should do this if I don't yet know the total number of 
records? If there an uever cheap way of doing getting this number?

First up, multiprocessing is not what you want. You want threading for this.

The reason is that your row index makes an in-memory index. If you do this in a subprocess (mp.Process) then the in-memory index is in a different process, and not accessable.

Use a Thread. You code will be very similar.

Next: your code is short enough to including inline instead of forcing people to go to pastebin; in particular if I were reading your email offline (as I might do on a train) then I could not consult your code. Including it in the message is preferable, normally.

Your approach of setting self.lookup_done to False and then later to True answers your question about "__getitem__ is supposed to throw an IndexError when needed. But how do I know when I should do this if I don't yet know the total number of records?" Make __getitem__ _block_ until self.lookup_done is True. At that point you should know how many records there are.

Regarding blocking, you want a Condition object or a Lock (a Lock is simpler, and Condition is more general). Using a Lock, you would create the Lock and .acquire it. In create_lookup(), release() the Lock at the end. In __getitem__ (or any other function dependent on completion of create_lookup), .acquire() and then .release() the Lock. That will cause it to block until the index scan is finished.

A remark about the create_lookup() function on pastebin: you go:

 record_start += len(line)

This presumes that a single text character on a line consumes a single byte or memory or file disc space. However, your data file is utf-8 encoded, and some characters may be more than one byte or storage. This means that your record_start values will not be useful because they are character counts, not byte counts, and you need byte counts to offset into a file if you are doing random access.

Instead, note the value of unicode_csv_data.tell() before reading each line (you will need to modify your CSV reader somewhat to do this, and maybe return both the offset and line text). That is a byte offset to be used later.

Cheers,
Cameron Simpson <[email protected]>

George, discussing a patent and prior art:
"Look, this  publication has a date, the patent has a priority date,
can't you just compare them?"
Paul Sutcliffe:
"Not unless you're a lawyer."
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to