Re: Pickle based workflow - looking for advice

Fabien Mon, 13 Apr 2015 10:38:04 -0700

On 13.04.2015 18:25, Dave Angel wrote:

On 04/13/2015 10:58 AM, Fabien wrote:

Folks,


A comment.  Pickle is a method of creating persistent data, most
commonly used to preserve data between runs.  A database is another
method.  Although either one can also be used with multiprocessing, you
seem to be worrying more about the mechanism, and not enough about the
problem.

I am writing a quite extensive piece of scientific software. Its
workflow is quite easy to explain. The tool realizes series of
operations on watersheds (such as mapping data on it, geostatistics and
more). There are thousands of independent watersheds of different size,
and the size determines the computing time spent on each of them.


First question:  what is the name or "identity" of a watershed?
Apparently it's named by a directory.  But you mention ID as well.  You
write a function A() that takes only a directory name. Is that the name
of the watershed?  One per directory?  And you can derive the ID from
the directory name?

Second question, is there any communication between watersheds, or are
they totally independent?

Third:  this "external data", is it dynamic, do you have to fetch it in
a particular order, is it separated by watershed id, or what?

Fourth:  when the program starts, are the directories all empty, so the
presence of a pickle file tells you that A() has run?  Or is there some
other meaning for those files?


Say I have the operations A, B, C and D. B and C are completely
independent but they need A to be run first, D needs B and C, and so
forth. Eventually the whole operations A, B, C and D will run once for
all,


For all what?

but of course the whole development is an iterative process and I
rerun all operations many times.


Based on what?  Is the external data changing, and you have to rerun
functions to update what you've already stored about them?  Or do you
just mean you call the A() function on every possible watershed?



(I suddenly have to go out, so I can't comment on the rest, except that
choosing to pickle, or to marshall, or to database, or to
custom-serialize seems a bit premature.  You may have it all clear in
your head, but I can't see what the interplay between all these calls to
one-letter-named functions is intended to be.)



Thanks Dave for your interest. I'll make an example:

external files:
- watershed outlines (single file)
- global topography (single file)
- climate data (single file)

Each watershed has an ID. Each watershed is completely independant.

So the function A for example will take one ID as argument, open thewatershed file and extract its outlines, make a local map, open thetopography file, extract a part of it, make a watershed object and storethe watersheds local data in it.

Function B will open the watershed pickle, take the local information itneeds (like local topography, already cropped to the region of interest)and map climate data on it.

And so forth, so that each function A, B, C, ... builds upon theinformation of the others and adds it's own "service" in terms of data.

Currently, all data (numpy arrays and vecor objects mostly) are storedas object attributes, which is I guess bad practice. It's kind of a"database for dummies": read topography of watershed ID 0128 will be:

- open watershed.p in the '0128' directory
- read the watershed.topography attribute

I think that I like Peter's idea to follow a file based workflowinstead, and forget about my watershed object for now.


But I'd still be interested in your comments if you find time for it.

Fabien

--
https://mail.python.org/mailman/listinfo/python-list

Re: Pickle based workflow - looking for advice

Reply via email to