Re: [Tutor] using multiprocessing efficiently to process large data file

Prasad, Ramit Thu, 30 Aug 2012 15:40:24 -0700

> I have a with few million lines. I want to process each block of 8
> lines and from my estimate my job is not IO bound. In other words it
> takes a lot more time to do the computation than it would take for
> simply reading the file.
> 
> I am wondering how can I go about reading data from this at a faster
> pace and then farm out the jobs to worker function using
> multiprocessing module.
> 
> I can think of two ways.
> 
> 1. split the split and read it in parallel(dint work well for me )
> primarily because I dont know how to read a file in parallel
> efficiently.
> 2. keep reading the file sequentially into a buffer of some size and
> farm out a chunks of the data through multiprocessing.
> 
> Any example would be of great help.
>


The general logic should work, but did not test with a real file.

with open( file, 'r' ) as f:
    data = f.readlines()
iterdata = iter(data )
grouped_data =[]
for d in iterdata:
    l = [d, next(iterdata)] # make this list 8 elements instead
    grouped_data.append( l )
    
# batch_process on grouped data

Theoretically you might be able to call next() directly on
the file without doing readlines().



Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] using multiprocessing efficiently to process large data file

Reply via email to