On Mar 26, 3:56 pm, Abhishek Pratap <abhishek....@gmail.com> wrote: > Hi Guys > > I am fwding this question from the python tutor list in the hope of > reaching more people experienced in concurrent disk access in python. > > I am trying to see if there are ways in which I can read a big file > concurrently on a multi core server and process data and write the > output to a single file as the data is processed. > > For example if I have a 50Gb file, I would like to read it in parallel > with 10 process/thread, each working on a 10Gb data and perform the > same data parallel computation on each chunk of fine collating the > output to a single file. > > I will appreciate your feedback. I did find some threads about this on > stackoverflow but it was not clear to me what would be a good way to > go about implementing this. >
Have you written a single-core solution to your problem? If so, can you post the code here? If CPU isn't your primary bottleneck, then you need to be careful not to overly complicate your solution by getting multiple cores involved. All the coordination might make your program slower and more buggy. If CPU is the primary bottleneck, then you might want to consider an approach where you only have a single thread that's reading records from the file, 10 at a time, and then dispatching out the calculations to different threads, then writing results back to disk. My approach would be something like this: 1) Take a small sample of your dataset so that you can process it within 10 seconds or so using a simple, single-core program. 2) Figure out whether you're CPU bound. A simple way to do this is to comment out the actual computation or replace it with a trivial stub. If you're CPU bound, the program will run much faster. If you're IO-bound, the program won't run much faster (since all the work is actually just reading from disk). 3) Figure out how to read 10 records at a time and farm out the records to threads. Hopefully, your program will take significantly less time. At this point, don't obsess over collating data. It might not be 10 times as fast, but it should be somewhat faster to be worth your while. 4) If the threaded approach shows promise, make sure that you can still generate correct output with that approach (in other words, figure out out synchronization and collating). At the end of that experiment, you should have a better feel on where to go next. What is the nature of your computation? Maybe it would be easier to tune the algorithm then figure out the multi-core optimization. -- http://mail.python.org/mailman/listinfo/python-list