Re: help with multiprocessing pool

Philip Semanchuk Thu, 27 Jan 2011 07:06:51 -0800

On Jan 25, 2011, at 8:19 PM, Craig Yoshioka wrote:

> Hi all, 
> 
> I could really use some help with a problem I'm having.



Hiya Craig,
I don't know if I can help, but it's really difficult to do without a full 
working example. 

Also, your code has several OS X-isms in it so I guess that's the platform 
you're on. But in case you're on Windows, note that that platform requires some 
extra care when using multiprocessing:
http://docs.python.org/library/multiprocessing.html#windows


Good luck
Philip


> I wrote a function that can take a pattern of actions and it apply it to the 
> filesystem.
> It takes a list of starting paths, and a pattern like this:
> 
> pattern = {
>            InGlob('Test/**'):{
>               MatchRemove('DS_Store'):[],
>                NoMatchAdd('(alhpaID_)|(DS_Store)','warnings'):[],
>                MatchAdd('alphaID_','alpha_found'):[],
>               InDir('alphaID_'):{
>                    NoMatchAdd('(betaID_)|(DS_Store)','warnings'):[],
>                    InDir('betaID_'):{
>                        NoMatchAdd('(gammaID_)|(DS_Store)','warnings'):[],
>                        MatchAdd('gammaID_','gamma_found'):[] }}}}
> 
> so if you run evalFSPattern(['Volumes/**'],pattern) it'll return a dictionary 
> where:
> 
> dict['gamma_found'] = [list of paths that matched] (i.e. 
> '/Volumes/HD1/Test/alphaID_3382/betaID_38824/gammaID_848384')
> dict['warning'] = [list of paths that failed to match] (ie. 
> '/Volumes/HD1/Test/alphaID_3382/gammaID_47383') 
> 
> Since some of these volumes are on network shares I also wanted to 
> parallelize this so that it would not block on IO.  I started the 
> parallelization by using multiprocessing.Pool and got it to work if I ran the 
> fsparser from the interpreter.  It ran in *much* less time and produced 
> correct output that matched the non-parallelized version.  The problem begins 
> if I then try to use the parallelized function from within the code.
> 
> For example I wrote a class whose instances are created around valid FS 
> paths, that are cached to reduce expensive FS lookups.
> 
> class Experiment(object):
>       
>       SlidePaths = None
> 
>       @classmethod
>        def getSlidePaths(cls):
>              if cls.SlidePaths == None:
>                   cls.SlidePaths = fsparser(['/Volumes/**'],pattern)
>             return cls.SlidePaths
>       
>       @classmethod
>       def lookupPathWithGammaID(cls,id):
>                paths = cls.getSlidePaths()
>                ...
>                return paths[selected]
> 
>       @classmethod
>        def fromGamaID(cls,id):
>               path = cls.lookupPathWithGammaID(id)
>                return cls(path)
>       
>       def __init__(self,path)
>               self.Path = path
>               ...
>       
>       ...     
> 
> If I do the following from the interpreter it works:
> 
>>>> from experiment import Experiment
>>>> expt = Experiment.fromGammaID(10102)
> 
> but if I write a script called test.py:
> 
> from experiment import Experiment
> expt1 = Experiment.fromGammaID(10102)
> expt2 = Experiment.fromGammaID(10103)
> comparison = expt1.compareTo(expt2)
> 
> it fails, if I try to import it or run it from bash prompt:
> 
>>>> from test import comparison (hangs forever)
> $ python test.py (hangs forever)
> 
> I would really like some help trying to figure this out...  I thought it 
> should work easily since all the spawned processes don't share data or state 
> (their results are merged in the main thread).  The classes used in the 
> pattern are also simple python objects (use python primitives).
> 
> 
> These are the main functions:
> 
> def mapAction(pool,paths,action):
>    merge = {'next':[]}
>    for result in pool.map(action,paths):
>        if result == None:
>            continue
>        merge = mergeDicts(merge,result)
>    return merge
> 
> 
> def mergeDicts(d1,d2):
>    for key in d2:
>        if key not in d1:
>            d1[key] = d2[key]
>        else:
>            d1[key] += d2[key]
>    return d1
> 
> 
> def evalFSPattern(paths,pattern):
>    pool = Pool(10)
>    results = {}
>    for action in pattern:
>        tomerge1 = mapAction(pool,paths,action)
>        tomerge2 = evalFSPattern(tomerge1['next'],pattern[action])
>        del tomerge1['next']
>        results = mergeDicts(results,tomerge1)
>        results = mergeDicts(results,tomerge2)
>    return results
> 
> the classes used in the pattern (InGlob,NoMatchAdd,etc.) are callable classes 
> that take a single parameter (a path) and return a dict result or None which 
> makes them trivial to adapt to Pool.map.
> 
> Note if I change the mapAction function to:
> 
> def mapAction(pool,paths,action):
>    merge = {'next':[]}
>    for path in paths:
>         result = action(path)
>         if result == None:
>            continue
>        merge = mergeDicts(merge,result)
>    return merge
> 
> everything works just fine.
> 
> 
> Thanks.
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: help with multiprocessing pool

Reply via email to