Re: [Tutor] multiprocessing question

Cameron Simpson Mon, 24 Nov 2014 14:18:14 -0800

On 24Nov2014 12:56, Albert-Jan Roskam <[email protected]> wrote:

> From: Cameron Simpson <[email protected]>

On 23Nov2014 22:30, Albert-Jan Roskam <[email protected]>
wrote:

I created some code to get records from a potentially giant .csv file. This

implements a __getitem__ method that gets records from a memory-mapped csv file.
In order for this to work, I need to build a lookup table that maps line numbers
to line starts/ends. This works, BUT building the lookup table could be
time-consuming (and it freezes up the app). [...]


First up, multiprocessing is not what you want. You want threading for this.

The reason is that your row index makes an in-memory index. If you do this in a
subprocess (mp.Process) then the in-memory index is in a different process, and
not accessable.


Hi Cameron,  Thanks for helping me. I read this page before I decided to go for 
multiprocessing: 
http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python. 
I never *really* understood why cPython (with GIL) could have threading anyway. 
I am confused: I thought the idea of mutliprocessing.Manager was to share 
information.

Regarding the GIL, it will prevent the raw python interpreter from using morethan one CPU: no two python opcodes run concurrently. However, any calls to Clibraries or the OS which may block release the GIL (broadly speaking). Sowhile the OS is off reading data from a hard drive or opening a networkconnection or something, the Python interpreter is free to run opcodes forother python threads. It is timesharing at the python opcode level. And if theOS or a C library is off doing work with the GIL released then you get truemultithreading.

Most real code is not compute bound at the Python level, most of the time.Whenever you block for I/O or delegate work to a library or another process,your current Python Thread is stalled, allowing other Threads to run.

For myself, I use threads when algorithms naturally fall into parallelexpression or for situations like yours where some lengthy process must run butI want the main body of code to commence work before it finishes. As ithappens, one of my common uses cases for the latter is reading a CSV file:-)

Anywhere you want to do things in parallel, ideally I/O bound, a Thread is areasonable thing to consider. It lets you write the separate task in a nicelinear fashion.

With a Thread (coding errors aside) you know where you stand: the datastructures it works on are the very same ones used by the main program. (Ofcourse, therein lie the hazards as well.)

With multiprocessing the subprocess works on distinct data sets and (from myreading) any shared data is managed by proxy objects that communicate betweenthe processes. That gets you data isolation for the subprocess, but also higherlatency in data access between the processes and of course the task ofarranging those proxy objects.


For your task I would go with a Thread.

when needed. But how do I know when I should do this if I don't yet know the

total number of records?" Make __getitem__ _block_ until self.lookup_done
is
True. At that point you should know how many records there are.

Regarding blocking, you want a Condition object or a Lock (a Lock is simpler,
and Condition is more general). Using a Lock, you would create the Lock and
.acquire it. In create_lookup(), release() the Lock at the end. In __getitem__
(or any other function dependent on completion of create_lookup), .acquire()
and then .release() the Lock. That will cause it to block until the index scan
is finished.


So __getitem__ cannot be called while it is being created? But wouldn't that 
defeat the purpose? My PyQt program around it initially shows the first 25 
records. On many occasions that's all what's needed.

That depends on the CSV and how you're using it. If __getitem__ is just "giveme row number N", then all it really needs to do is check against the currentcount of rows read. Keep such a counter, updated by the scanning/indexingthread. If the requested row number is less than the counter, fetch it andreturn it. Otherwise block/wait until the counter becomes big enough. (Orthrow some exception if the calling code can cope with the notion of "data notready yet".)

If you want __getitem__ to block, you will need to arrange a way to do that.Stupid programs busy wait:


 while counter < index_value:
   pass

Horrendous; it causes the CPU to max out _and_ gets in the way of other work,slowing everything down. The simple approach is a poll:


 while counter < index_value:
   sleep(0.1)

This polls 10 times a second. Tuning the sleep time is a subjective call: toofrequent will consume resources, to infrequent will make __getitem__ too slowto respond when the counter finally catches up.

A more elaborate but truly blocking scheme is to have some kind of requestqueue, where __getitem__ makes (for example) a Condition variable and queues arequest for "when the counter reaches this number". When the indexer reachesthat number (or finsihes indexing) it wakes up the condition and __getitem__gets on with its task. This requires extra code in your indexer to (a) keep aPriorityQueue of requests and (b) to check for the lowest one when itincrements its record count. When the record count reaches the lowest request,wake up every request of that count, and then record the next request (if any)as the next "wake up" number. That is a sketch: there are complications, suchas when a new request comes in lower than the current "lowest" request, and soforth.

I'd go with the 0.1s poll loop myself. It is simple and easy and will work. Usea better scheme later if needed.

A remark about the create_lookup() function on pastebin: you go:

  record_start += len(line)
This presumes that a single text character on a line consumes a single byte

or

memory or file disc space. However, your data file is utf-8 encoded, and some
characters may be more than one byte or storage. This means that your
record_start values will not be useful because they are character counts, not
byte counts, and you need byte counts to offset into a file if you are doing
random access.

Instead, note the value of unicode_csv_data.tell() before reading each line
(you will need to modify your CSV reader somewhat to do this, and maybe return
both the offset and line text). That is a byte offset to be used later.


THANKS!! How could I not think of this.. I initially started wth open(), which 
returns bytestrings.I could convert it to bytes and then take the len()

Converting to bytes relies on that conversion being symmetric and requires youto know the conversion required. Simply noting the .tell() value before theline is read avoids all that: wher am I? Read line. Return line and startposition. Simple and direct.


Cheers,
Cameron Simpson <[email protected]>
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] multiprocessing question

Reply via email to