Re: [Tutor] using multiprocessing efficiently to process large data file

2012-09-02 Thread Alan Gauld

On 02/09/12 06:48, eryksun wrote:



 from multiprocessing import Pool, cpu_count
 from itertools import izip_longest, imap

 FILE_IN = '...'
 FILE_OUT = '...'

 NLINES = 100 # estimate this for a good chunk_size
 BATCH_SIZE = 8

 def func(batch):
  test func 
 import os, time
 time.sleep(0.001)
 return %d: %s\n % (os.getpid(), repr(batch))

 if __name__ == '__main__': # -- required for Windows


Why?
What difference does that make in Windows?


 file_in, file_out = open(FILE_IN), open(FILE_OUT, 'w')
 nworkers = cpu_count() - 1

 with file_in, file_out:
 batches = izip_longest(* [file_in] * BATCH_SIZE)
 if nworkers  0:
 pool = Pool(nworkers)
 chunk_size = NLINES // BATCH_SIZE // nworkers
 result = pool.imap(func, batches, chunk_size)
 else:
 result = imap(func, batches)
 file_out.writelines(result)


just curious.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] using multiprocessing efficiently to process large data file

2012-09-02 Thread eryksun
On Sun, Sep 2, 2012 at 2:41 AM, Alan Gauld alan.ga...@btinternet.com wrote:

  if __name__ == '__main__': # -- required for Windows

 Why?
 What difference does that make in Windows?

It's a hack to get around the fact that Win32 doesn't fork(). Windows
calls CreateProcess(), which loads a fresh interpreter.
multiprocessing then loads the module under a different name (i.e. not
'__main__'). Otherwise another processing Pool would be created, etc,
etc.

This is also why you can't share global data in Windows. A forked
process in Linux uses copy on write, so you can load a large block of
data before calling fork() and share it. In Windows the module is
executed separately for each process, so each has its own copy. To
share data in Windows, I think the fastest option is to use a ctypes
shared Array. The example I wrote is just using the default Pool setup
that serializes (pickle) over pipes.

FYI, the Win32 API imposes the requirement to use CreateProcess(). The
native NT kernel has no problem forking (e.g. for the POSIX
subsystem). I haven't looked closely enough to know why they didn't
implement fork() in Win32.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] using multiprocessing efficiently to process large data file

2012-09-01 Thread Wayne Werner

On Thu, 30 Aug 2012, Abhishek Pratap wrote:


Hi Guys

I have a with few million lines. I want to process each block of 8
lines and from my estimate my job is not IO bound. In other words it
takes a lot more time to do the computation than it would take for
simply reading the file.

I am wondering how can I go about reading data from this at a faster
pace and then farm out the jobs to worker function using
multiprocessing module.

I can think of two ways.

1. split the split and read it in parallel(dint work well for me )
primarily because I dont know how to read a file in parallel
efficiently.
2. keep reading the file sequentially into a buffer of some size and
farm out a chunks of the data through multiprocessing.


As other folks have mentioned, having at least your general algorithm 
available would make things a lot better.


But here's another way that you could iterate over the file if you know 
exactly how many  you have available (or at least a number that it's 
divisible by):


with open('inputfile') as f:
for line1, line2, line3, line4 in zip(f,f,f,f):
# do your processing here

The caveat to this is that if your lines aren't evenly divisible by 4 then 
you'll loose the last count % 4 lines.


The reason that this can work is because zip() combines several sequences 
and returns a new iterator. In this case it's combining the file handles 
f, which are themselves iterators. So each successive call to next() - 
i.e. pass through the for loop - next() is successively called on f. The 
problem of course is that when you reach the end of the file - say your 
last pass through and you've only got one line left. Well, when zip's 
iterator calls next on the first `f`, that returns the last line. But 
since f is now at the end of the file, calling next on it will raise 
StopIteration, which will end your loop without actually processing 
anything on the inside!


So, this probably isn't the best way to handle your issue, but maybe it 
is!


HTH,
Wayne
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] using multiprocessing efficiently to process large data file

2012-09-01 Thread eryksun
On Sat, Sep 1, 2012 at 9:14 AM, Wayne Werner wa...@waynewerner.com wrote:

 with open('inputfile') as f:
 for line1, line2, line3, line4 in zip(f,f,f,f):
 # do your processing here

Use itertools.izip_longest (zip_longest in 3.x) for this. Items in the
final batch are set to fillvalue (defaults to None) if the iterator
has reached the end of the file.

Below I've included a template that uses a multiprocessing.Pool, but
only if there are cores available. On a single-core system it falls
back to using itertools.imap (use built-in map in 3.x).

from multiprocessing import Pool, cpu_count
from itertools import izip_longest, imap

FILE_IN = '...'
FILE_OUT = '...'

NLINES = 100 # estimate this for a good chunk_size
BATCH_SIZE = 8

def func(batch):
 test func 
import os, time
time.sleep(0.001)
return %d: %s\n % (os.getpid(), repr(batch))

if __name__ == '__main__': # -- required for Windows

file_in, file_out = open(FILE_IN), open(FILE_OUT, 'w')
nworkers = cpu_count() - 1

with file_in, file_out:
batches = izip_longest(* [file_in] * BATCH_SIZE)
if nworkers  0:
pool = Pool(nworkers)
chunk_size = NLINES // BATCH_SIZE // nworkers
result = pool.imap(func, batches, chunk_size)
else:
result = imap(func, batches)
file_out.writelines(result)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] using multiprocessing efficiently to process large data file

2012-08-31 Thread Prasad, Ramit
Please always respond to the list. And avoid top posting.

 -Original Message-
 From: Abhishek Pratap [mailto:abhishek@gmail.com]
 Sent: Thursday, August 30, 2012 5:47 PM
 To: Prasad, Ramit
 Subject: Re: [Tutor] using multiprocessing efficiently to process large data
 file
 
 Hi Ramit
 
 Thanks for your quick reply. Unfortunately given the size of the file
 I  cant afford to load it all into memory at one go.
 I could read, lets say first 1 million lines process them in parallel
 and so on. I am looking for some example which does something similar.
 
 -Abhi
 

The same logic should work just process your batch after checking size
and iterate over the file directly instead of reading in memory.

with open( file, 'r' ) as f:
iterdata = iter(f)
grouped_data =[]
for d in iterdata:
l = [d, next(iterdata)] # make this list 8 elements instead
grouped_data.append( l )
if len(grouped_data)  100/8: # one million lines
# process batch
grouped_data = []


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] using multiprocessing efficiently to process large data file

2012-08-30 Thread Abhishek Pratap
Hi Guys

I have a with few million lines. I want to process each block of 8
lines and from my estimate my job is not IO bound. In other words it
takes a lot more time to do the computation than it would take for
simply reading the file.

I am wondering how can I go about reading data from this at a faster
pace and then farm out the jobs to worker function using
multiprocessing module.

I can think of two ways.

1. split the split and read it in parallel(dint work well for me )
primarily because I dont know how to read a file in parallel
efficiently.
2. keep reading the file sequentially into a buffer of some size and
farm out a chunks of the data through multiprocessing.

Any example would be of great help.

Thanks!
-Abhi
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] using multiprocessing efficiently to process large data file

2012-08-30 Thread Prasad, Ramit
 I have a with few million lines. I want to process each block of 8
 lines and from my estimate my job is not IO bound. In other words it
 takes a lot more time to do the computation than it would take for
 simply reading the file.
 
 I am wondering how can I go about reading data from this at a faster
 pace and then farm out the jobs to worker function using
 multiprocessing module.
 
 I can think of two ways.
 
 1. split the split and read it in parallel(dint work well for me )
 primarily because I dont know how to read a file in parallel
 efficiently.
 2. keep reading the file sequentially into a buffer of some size and
 farm out a chunks of the data through multiprocessing.
 
 Any example would be of great help.


The general logic should work, but did not test with a real file.

with open( file, 'r' ) as f:
data = f.readlines()
iterdata = iter(data )
grouped_data =[]
for d in iterdata:
l = [d, next(iterdata)] # make this list 8 elements instead
grouped_data.append( l )

# batch_process on grouped data

Theoretically you might be able to call next() directly on
the file without doing readlines().



Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] using multiprocessing efficiently to process large data file

2012-08-30 Thread Alan Gauld

On 30/08/12 23:19, Abhishek Pratap wrote:


I am wondering how can I go about reading data from this at a faster
pace and then farm out the jobs to worker function using
multiprocessing module.

I can think of two ways.

1. split the split and read it in parallel(dint work well for me )
primarily because I dont know how to read a file in parallel
efficiently.


Can you show us what you tried? It's always easier to give an answer to 
a concrete example than to a hypethetical scenario.



2. keep reading the file sequentially into a buffer of some size and
farm out a chunks of the data through multiprocessing.


This is the model I've used. In pseudo code

for line, data in enumerate(file):
   while line % chunksize:
   chunk.append(data)
   launch_subprocess(chunk)

I'd tend to go for big chunks - if you have a million lines in your file 
I'd pick a chunksize of around 10,000-100,000 lines. If you go too small 
the overhead of starting the subprocess will swamp any gains
you get. Also remember the constraints of how many actual CPUs/Cores you 
have. Too many tasks spread over too few CPUs will just cause more 
swapping. Any less than 4 cores is probably not worth the effort. Just 
maximise the efficiency of your algorithm - which is probably worth 
doing first anyway.


HTH,
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor