Pickle based workflow - looking for advice

Fabien Mon, 13 Apr 2015 08:02:38 -0700

Folks,

I am writing a quite extensive piece of scientific software. Itsworkflow is quite easy to explain. The tool realizes series ofoperations on watersheds (such as mapping data on it, geostatistics andmore). There are thousands of independent watersheds of different size,and the size determines the computing time spent on each of them.

Say I have the operations A, B, C and D. B and C are completelyindependent but they need A to be run first, D needs B and C, and soforth. Eventually the whole operations A, B, C and D will run once forall, but of course the whole development is an iterative process and Irerun all operations many times.


Currently my workflow is defined as follows:

Define a unique ID and file directory for each watershed, and define Aand B:


def A(watershed_dir):
        # read some external data
        # do stuff
        # Store the stuff in a Watershed object
        # save it
        f_pickle = os.path.join(watershed_dir, 'watershed.p')
        with open(f_pickle, 'wb') as f:
                pickle.dump(watershed, f)

def B(watershed_directory):
        w = pickle.read()
        f_pickle = os.path.join(watershed_dir, 'watershed.p')
        with open(f_pickle, 'rb') as f:
                watershed = pickle.load(f)
        # do new stuff
        # store it in watershed and save
        with open(f_pickle, 'wb') as f:
                pickle.dump(watershed, f)

So the watershed object is a data container which grows in content. Thepickle that stores the info can reach a few Mb of size. I chose thisstrategy because A, B, C and D are independent, but they can share theirresults through the pickle. The functions have a single argument (thepath to the working directory), which means that when I run thethousands catchments I can use the multiprocessing pool:


        import multiprocessing as mp
        poolargs = [list of directories]
        pool = mp.Pool()
        poolout = pool.map(A, poolargs, chunksize=1)
        poolout = pool.map(B, poolargs, chunksize=1)
        etc.

I can easily choose to rerun just B without rerunning A. Reading andwriting pickle times is real slow in comparison to the other stuffs todo (running B or C on a single catchment can take seconds for example).


Now, to my questions:
1. Does that seem reasonable?

2. Should Watershed be an object or should it be a simple dictionary? Ithought that an object could be nice, because it could take care of someoperations such as plotting and logging. Currently I defined a classWatershed, but its attributes are defined and filled by A, B and C (thisseems a bit wrong to me). I could give more responsibilities to thisclass but it might become way too big: since the whole purpose of thetool is to work on watersheds, making a Watershed class actually soundslike a code smell (http://en.wikipedia.org/wiki/God_object)3. The operation A opens an external file, reads data out of it andwrites it in Watershed object. Is it a bad idea to multiprocess this? (Iguess it is, since the file might be read twice at the same time)

4. Other comments you might have?

Sorry for the lengthy mail but thanks for any tip.

Fabien




        
--
https://mail.python.org/mailman/listinfo/python-list

Pickle based workflow - looking for advice

Reply via email to