Re: [Tutor] multiprocessing question

Albert-Jan Roskam Mon, 24 Nov 2014 05:00:56 -0800

 ----- Original Message -----
 > From: Cameron Simpson <[email protected]>
 > To: Python Mailing List <[email protected]>
 > Cc: 
 > Sent: Monday, November 24, 2014 2:20 AM
 > Subject: Re: [Tutor] multiprocessing question
 > 
> On 23Nov2014 22:30, Albert-Jan Roskam <[email protected]> 
> wrote:
>> I created some code to get records from a potentially giant .csv file. This 
> implements a __getitem__ method that gets records from a memory-mapped csv 
> file. 
> In order for this to work, I need to build a lookup table that maps line 
> numbers 
> to line starts/ends. This works, BUT building the lookup table could be 
> time-consuming (and it freezes up the app). The (somewhat pruned) code is 
> here: 
> http://pastebin.com/0x6JKbfh. Now I would like to build the lookup table in a 
> separate process. I used multiprocessing. In the crude example below, it 
> appears 
> to be doing what I have in mind. Is this the way to do it? I have never used 
> multiprocessing/threading before, apart from playing around. One specfic 
> question: __getitem__ is supposed to throw an IndexError when needed. But how 
> do 
> I know when I should do this if I don't yet know the total number of 
> records? If there an uever cheap way of doing getting this number?
> 
> First up, multiprocessing is not what you want. You want threading for this.
> 
> The reason is that your row index makes an in-memory index. If you do this in 
> a 
> subprocess (mp.Process) then the in-memory index is in a different process, 
> and 
> not accessable.
Hi Cameron,  Thanks for helping me. I read this page before I decided to go for 
multiprocessing: 
http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python. 
I never *really* understood why cPython (with GIL) could have threading anyway. 
I am confused: I thought the idea of mutliprocessing.Manager was to share 
information.
>>> help(mp.Manager)
Help on function Manager in module multiprocessing:Manager()
    Returns a manager associated with a running server process
    
    The managers methods such as `Lock()`, `Condition()` and `Queue()`
    can be used to create shared objects.>>> help(mp.managers.SyncManager)
Help on class SyncManager in module multiprocessing.managers:class 
SyncManager(BaseManager)
 |  Subclass of `BaseManager` which supports a number of shared object types.
 |  
 |  The types registered are those intended for the synchronization
 |  of threads, plus `dict`, `list` and `Namespace`.
 |  
 |  The `multiprocessing.Manager()` function creates started instances of
 |  this class.......>>> help(mp.Manager().dict)
Help on method dict in module multiprocessing.managers:dict(self, *args, 
**kwds) method of multiprocessing.managers.SyncManager instance  > Use a 
Thread. You code will be very similar.
Ok, I will try that.
> Next: your code is short enough to including inline instead of forcing people 
> to go to pastebin; in particular if I were reading your email offline (as I 
> might do on a train) then I could not consult your code. Including it in the 
> message is preferable, normally. Sorry about that. I did not want to burden 
> people with too many lines of code. The pastebin code was meant as the 
> problem context.
 
> Your approach of setting self.lookup_done to False and then later to True 
> answers your question about "__getitem__ is supposed to throw an IndexError  
> :-) Nice. I added those lines while editing the mail. 
 
> 
> when needed. But how do I know when I should do this if I don't yet know the 
> 
> total number of records?" Make __getitem__ _block_ until self.lookup_done 
> is 
> True. At that point you should know how many records there are.
> 
> Regarding blocking, you want a Condition object or a Lock (a Lock is simpler, 
> and Condition is more general). Using a Lock, you would create the Lock and 
> .acquire it. In create_lookup(), release() the Lock at the end. In 
> __getitem__ 
> (or any other function dependent on completion of create_lookup), .acquire() 
> and then .release() the Lock. That will cause it to block until the index 
> scan 
> is finished. So __getitem__ cannot be called while it is being created? But 
> wouldn't that defeat the purpose? My PyQt program around it initially shows 
> the first 25 records. On many occasions that's all what's needed.  
 
> A remark about the create_lookup() function on pastebin: you go:
> 
>   record_start += len(line)
THANKS!! How could I not think of this.. I initially started wth open(), which 
returns bytestrings.I could convert it to bytes and then take the len() 
> This presumes that a single text character on a line consumes a single byte 
> or 
> memory or file disc space. However, your data file is utf-8 encoded, and some 
> characters may be more than one byte or storage. This means that your 
> record_start values will not be useful because they are character counts, not 
> byte counts, and you need byte counts to offset into a file if you are doing 
> random access.
> 
> Instead, note the value of unicode_csv_data.tell() before reading each line 
> (you will need to modify your CSV reader somewhat to do this, and maybe 
> return 
> both the offset and line text). That is a byte offset to be used later.
> 
> Cheers,
> Cameron Simpson <[email protected]>
> 
> George, discussing a patent and prior art:
> "Look, this  publication has a date, the patent has a priority date,
> can't you just compare them?"
> Paul Sutcliffe:
> "Not unless you're a lawyer."
> _______________________________________________
> Tutor maillist  -  [email protected]
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] multiprocessing question

Reply via email to