Re: [Tutor] using multiprocessing efficiently to process large data file
On Sun, Sep 2, 2012 at 2:41 AM, Alan Gauld wrote: > >> if __name__ == '__main__': # <-- required for Windows > > Why? > What difference does that make in Windows? It's a hack to get around the fact that Win32 doesn't fork(). Windows calls CreateProcess(), which loads a fresh interpreter. multiprocessing then loads the module under a different name (i.e. not '__main__'). Otherwise another processing Pool would be created, etc, etc. This is also why you can't share global data in Windows. A forked process in Linux uses copy on write, so you can load a large block of data before calling fork() and share it. In Windows the module is executed separately for each process, so each has its own copy. To share data in Windows, I think the fastest option is to use a ctypes shared Array. The example I wrote is just using the default Pool setup that serializes (pickle) over pipes. FYI, the Win32 API imposes the requirement to use CreateProcess(). The native NT kernel has no problem forking (e.g. for the POSIX subsystem). I haven't looked closely enough to know why they didn't implement fork() in Win32. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] using multiprocessing efficiently to process large data file
On 02/09/12 06:48, eryksun wrote: from multiprocessing import Pool, cpu_count from itertools import izip_longest, imap FILE_IN = '...' FILE_OUT = '...' NLINES = 100 # estimate this for a good chunk_size BATCH_SIZE = 8 def func(batch): """ test func """ import os, time time.sleep(0.001) return "%d: %s\n" % (os.getpid(), repr(batch)) if __name__ == '__main__': # <-- required for Windows Why? What difference does that make in Windows? file_in, file_out = open(FILE_IN), open(FILE_OUT, 'w') nworkers = cpu_count() - 1 with file_in, file_out: batches = izip_longest(* [file_in] * BATCH_SIZE) if nworkers > 0: pool = Pool(nworkers) chunk_size = NLINES // BATCH_SIZE // nworkers result = pool.imap(func, batches, chunk_size) else: result = imap(func, batches) file_out.writelines(result) just curious. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] using multiprocessing efficiently to process large data file
On Sat, Sep 1, 2012 at 9:14 AM, Wayne Werner wrote: > > with open('inputfile') as f: > for line1, line2, line3, line4 in zip(f,f,f,f): > # do your processing here Use itertools.izip_longest (zip_longest in 3.x) for this. Items in the final batch are set to fillvalue (defaults to None) if the iterator has reached the end of the file. Below I've included a template that uses a multiprocessing.Pool, but only if there are cores available. On a single-core system it falls back to using itertools.imap (use built-in map in 3.x). from multiprocessing import Pool, cpu_count from itertools import izip_longest, imap FILE_IN = '...' FILE_OUT = '...' NLINES = 100 # estimate this for a good chunk_size BATCH_SIZE = 8 def func(batch): """ test func """ import os, time time.sleep(0.001) return "%d: %s\n" % (os.getpid(), repr(batch)) if __name__ == '__main__': # <-- required for Windows file_in, file_out = open(FILE_IN), open(FILE_OUT, 'w') nworkers = cpu_count() - 1 with file_in, file_out: batches = izip_longest(* [file_in] * BATCH_SIZE) if nworkers > 0: pool = Pool(nworkers) chunk_size = NLINES // BATCH_SIZE // nworkers result = pool.imap(func, batches, chunk_size) else: result = imap(func, batches) file_out.writelines(result) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] using multiprocessing efficiently to process large data file
On Thu, 30 Aug 2012, Abhishek Pratap wrote: Hi Guys I have a with few million lines. I want to process each block of 8 lines and from my estimate my job is not IO bound. In other words it takes a lot more time to do the computation than it would take for simply reading the file. I am wondering how can I go about reading data from this at a faster pace and then farm out the jobs to worker function using multiprocessing module. I can think of two ways. 1. split the split and read it in parallel(dint work well for me ) primarily because I dont know how to read a file in parallel efficiently. 2. keep reading the file sequentially into a buffer of some size and farm out a chunks of the data through multiprocessing. As other folks have mentioned, having at least your general algorithm available would make things a lot better. But here's another way that you could iterate over the file if you know exactly how many you have available (or at least a number that it's divisible by): with open('inputfile') as f: for line1, line2, line3, line4 in zip(f,f,f,f): # do your processing here The caveat to this is that if your lines aren't evenly divisible by 4 then you'll loose the last count % 4 lines. The reason that this can work is because zip() combines several sequences and returns a new iterator. In this case it's combining the file handles f, which are themselves iterators. So each successive call to next() - i.e. pass through the for loop - next() is successively called on f. The problem of course is that when you reach the end of the file - say your last pass through and you've only got one line left. Well, when zip's iterator calls next on the first `f`, that returns the last line. But since f is now at the end of the file, calling next on it will raise StopIteration, which will end your loop without actually processing anything on the inside! So, this probably isn't the best way to handle your issue, but maybe it is! HTH, Wayne ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] using multiprocessing efficiently to process large data file
Please always respond to the list. And avoid top posting. > -Original Message- > From: Abhishek Pratap [mailto:abhishek@gmail.com] > Sent: Thursday, August 30, 2012 5:47 PM > To: Prasad, Ramit > Subject: Re: [Tutor] using multiprocessing efficiently to process large data > file > > Hi Ramit > > Thanks for your quick reply. Unfortunately given the size of the file > I cant afford to load it all into memory at one go. > I could read, lets say first 1 million lines process them in parallel > and so on. I am looking for some example which does something similar. > > -Abhi > The same logic should work just process your batch after checking size and iterate over the file directly instead of reading in memory. with open( file, 'r' ) as f: iterdata = iter(f) grouped_data =[] for d in iterdata: l = [d, next(iterdata)] # make this list 8 elements instead grouped_data.append( l ) if len(grouped_data) > 100/8: # one million lines # process batch grouped_data = [] This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] using multiprocessing efficiently to process large data file
On 30/08/12 23:19, Abhishek Pratap wrote: I am wondering how can I go about reading data from this at a faster pace and then farm out the jobs to worker function using multiprocessing module. I can think of two ways. 1. split the split and read it in parallel(dint work well for me ) primarily because I dont know how to read a file in parallel efficiently. Can you show us what you tried? It's always easier to give an answer to a concrete example than to a hypethetical scenario. 2. keep reading the file sequentially into a buffer of some size and farm out a chunks of the data through multiprocessing. This is the model I've used. In pseudo code for line, data in enumerate(file): while line % chunksize: chunk.append(data) launch_subprocess(chunk) I'd tend to go for big chunks - if you have a million lines in your file I'd pick a chunksize of around 10,000-100,000 lines. If you go too small the overhead of starting the subprocess will swamp any gains you get. Also remember the constraints of how many actual CPUs/Cores you have. Too many tasks spread over too few CPUs will just cause more swapping. Any less than 4 cores is probably not worth the effort. Just maximise the efficiency of your algorithm - which is probably worth doing first anyway. HTH, -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] using multiprocessing efficiently to process large data file
> I have a with few million lines. I want to process each block of 8 > lines and from my estimate my job is not IO bound. In other words it > takes a lot more time to do the computation than it would take for > simply reading the file. > > I am wondering how can I go about reading data from this at a faster > pace and then farm out the jobs to worker function using > multiprocessing module. > > I can think of two ways. > > 1. split the split and read it in parallel(dint work well for me ) > primarily because I dont know how to read a file in parallel > efficiently. > 2. keep reading the file sequentially into a buffer of some size and > farm out a chunks of the data through multiprocessing. > > Any example would be of great help. > The general logic should work, but did not test with a real file. with open( file, 'r' ) as f: data = f.readlines() iterdata = iter(data ) grouped_data =[] for d in iterdata: l = [d, next(iterdata)] # make this list 8 elements instead grouped_data.append( l ) # batch_process on grouped data Theoretically you might be able to call next() directly on the file without doing readlines(). Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] using multiprocessing efficiently to process large data file
Hi Guys I have a with few million lines. I want to process each block of 8 lines and from my estimate my job is not IO bound. In other words it takes a lot more time to do the computation than it would take for simply reading the file. I am wondering how can I go about reading data from this at a faster pace and then farm out the jobs to worker function using multiprocessing module. I can think of two ways. 1. split the split and read it in parallel(dint work well for me ) primarily because I dont know how to read a file in parallel efficiently. 2. keep reading the file sequentially into a buffer of some size and farm out a chunks of the data through multiprocessing. Any example would be of great help. Thanks! -Abhi ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor