Folks,

I am writing a quite extensive piece of scientific software. Its workflow is quite easy to explain. The tool realizes series of operations on watersheds (such as mapping data on it, geostatistics and more). There are thousands of independent watersheds of different size, and the size determines the computing time spent on each of them.

Say I have the operations A, B, C and D. B and C are completely independent but they need A to be run first, D needs B and C, and so forth. Eventually the whole operations A, B, C and D will run once for all, but of course the whole development is an iterative process and I rerun all operations many times.

Currently my workflow is defined as follows:

Define a unique ID and file directory for each watershed, and define A and B:

def A(watershed_dir):
        # read some external data
        # do stuff
        # Store the stuff in a Watershed object
        # save it
        f_pickle = os.path.join(watershed_dir, 'watershed.p')
        with open(f_pickle, 'wb') as f:
                pickle.dump(watershed, f)

def B(watershed_directory):
        w = pickle.read()
        f_pickle = os.path.join(watershed_dir, 'watershed.p')
        with open(f_pickle, 'rb') as f:
                watershed = pickle.load(f)
        # do new stuff
        # store it in watershed and save
        with open(f_pickle, 'wb') as f:
                pickle.dump(watershed, f)

So the watershed object is a data container which grows in content. The pickle that stores the info can reach a few Mb of size. I chose this strategy because A, B, C and D are independent, but they can share their results through the pickle. The functions have a single argument (the path to the working directory), which means that when I run the thousands catchments I can use the multiprocessing pool:

        import multiprocessing as mp
        poolargs = [list of directories]
        pool = mp.Pool()
        poolout = pool.map(A, poolargs, chunksize=1)
        poolout = pool.map(B, poolargs, chunksize=1)
        etc.

I can easily choose to rerun just B without rerunning A. Reading and writing pickle times is real slow in comparison to the other stuffs to do (running B or C on a single catchment can take seconds for example).

Now, to my questions:
1. Does that seem reasonable?
2. Should Watershed be an object or should it be a simple dictionary? I thought that an object could be nice, because it could take care of some operations such as plotting and logging. Currently I defined a class Watershed, but its attributes are defined and filled by A, B and C (this seems a bit wrong to me). I could give more responsibilities to this class but it might become way too big: since the whole purpose of the tool is to work on watersheds, making a Watershed class actually sounds like a code smell (http://en.wikipedia.org/wiki/God_object) 3. The operation A opens an external file, reads data out of it and writes it in Watershed object. Is it a bad idea to multiprocess this? (I guess it is, since the file might be read twice at the same time)
4. Other comments you might have?

Sorry for the lengthy mail but thanks for any tip.

Fabien




        
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to