> I have a with few million lines. I want to process each block of 8 > lines and from my estimate my job is not IO bound. In other words it > takes a lot more time to do the computation than it would take for > simply reading the file. > > I am wondering how can I go about reading data from this at a faster > pace and then farm out the jobs to worker function using > multiprocessing module. > > I can think of two ways. > > 1. split the split and read it in parallel(dint work well for me ) > primarily because I dont know how to read a file in parallel > efficiently. > 2. keep reading the file sequentially into a buffer of some size and > farm out a chunks of the data through multiprocessing. > > Any example would be of great help. >
The general logic should work, but did not test with a real file. with open( file, 'r' ) as f: data = f.readlines() iterdata = iter(data ) grouped_data =[] for d in iterdata: l = [d, next(iterdata)] # make this list 8 elements instead grouped_data.append( l ) # batch_process on grouped data Theoretically you might be able to call next() directly on the file without doing readlines(). Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor