----- Original Message ----- > From: Cameron Simpson <c...@zip.com.au> > To: Python Mailing List <tutor@python.org> > Cc: > Sent: Monday, November 24, 2014 2:20 AM > Subject: Re: [Tutor] multiprocessing question > > On 23Nov2014 22:30, Albert-Jan Roskam <fo...@yahoo.com.dmarc.invalid> > wrote: >> I created some code to get records from a potentially giant .csv file. This > implements a __getitem__ method that gets records from a memory-mapped csv > file. > In order for this to work, I need to build a lookup table that maps line > numbers > to line starts/ends. This works, BUT building the lookup table could be > time-consuming (and it freezes up the app). The (somewhat pruned) code is > here: > http://pastebin.com/0x6JKbfh. Now I would like to build the lookup table in a > separate process. I used multiprocessing. In the crude example below, it > appears > to be doing what I have in mind. Is this the way to do it? I have never used > multiprocessing/threading before, apart from playing around. One specfic > question: __getitem__ is supposed to throw an IndexError when needed. But how > do > I know when I should do this if I don't yet know the total number of > records? If there an uever cheap way of doing getting this number? > > First up, multiprocessing is not what you want. You want threading for this. > > The reason is that your row index makes an in-memory index. If you do this in > a > subprocess (mp.Process) then the in-memory index is in a different process, > and > not accessable. Hi Cameron, Thanks for helping me. I read this page before I decided to go for multiprocessing: http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python. I never *really* understood why cPython (with GIL) could have threading anyway. I am confused: I thought the idea of mutliprocessing.Manager was to share information. >>> help(mp.Manager) Help on function Manager in module multiprocessing:Manager() Returns a manager associated with a running server process The managers methods such as `Lock()`, `Condition()` and `Queue()` can be used to create shared objects.>>> help(mp.managers.SyncManager) Help on class SyncManager in module multiprocessing.managers:class SyncManager(BaseManager) | Subclass of `BaseManager` which supports a number of shared object types. | | The types registered are those intended for the synchronization | of threads, plus `dict`, `list` and `Namespace`. | | The `multiprocessing.Manager()` function creates started instances of | this class.......>>> help(mp.Manager().dict) Help on method dict in module multiprocessing.managers:dict(self, *args, **kwds) method of multiprocessing.managers.SyncManager instance > Use a Thread. You code will be very similar. Ok, I will try that. > Next: your code is short enough to including inline instead of forcing people > to go to pastebin; in particular if I were reading your email offline (as I > might do on a train) then I could not consult your code. Including it in the > message is preferable, normally. Sorry about that. I did not want to burden > people with too many lines of code. The pastebin code was meant as the > problem context. > Your approach of setting self.lookup_done to False and then later to True > answers your question about "__getitem__ is supposed to throw an IndexError > :-) Nice. I added those lines while editing the mail. > > when needed. But how do I know when I should do this if I don't yet know the > > total number of records?" Make __getitem__ _block_ until self.lookup_done > is > True. At that point you should know how many records there are. > > Regarding blocking, you want a Condition object or a Lock (a Lock is simpler, > and Condition is more general). Using a Lock, you would create the Lock and > .acquire it. In create_lookup(), release() the Lock at the end. In > __getitem__ > (or any other function dependent on completion of create_lookup), .acquire() > and then .release() the Lock. That will cause it to block until the index > scan > is finished. So __getitem__ cannot be called while it is being created? But > wouldn't that defeat the purpose? My PyQt program around it initially shows > the first 25 records. On many occasions that's all what's needed. > A remark about the create_lookup() function on pastebin: you go: > > record_start += len(line) THANKS!! How could I not think of this.. I initially started wth open(), which returns bytestrings.I could convert it to bytes and then take the len() > This presumes that a single text character on a line consumes a single byte > or > memory or file disc space. However, your data file is utf-8 encoded, and some > characters may be more than one byte or storage. This means that your > record_start values will not be useful because they are character counts, not > byte counts, and you need byte counts to offset into a file if you are doing > random access. > > Instead, note the value of unicode_csv_data.tell() before reading each line > (you will need to modify your CSV reader somewhat to do this, and maybe > return > both the offset and line text). That is a byte offset to be used later. > > Cheers, > Cameron Simpson <c...@zip.com.au> > > George, discussing a patent and prior art: > "Look, this publication has a date, the patent has a priority date, > can't you just compare them?" > Paul Sutcliffe: > "Not unless you're a lawyer." > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor >
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor