#3166: Parallelization with tiling for grass.script --------------------------+------------------------------ Reporter: wenzeslaus | Owner: grass-dev@… Type: enhancement | Status: new Priority: normal | Milestone: 7.4.0 Component: Python | Version: unspecified Resolution: | Keywords: script, parallel CPU: Unspecified | Platform: Unspecified --------------------------+------------------------------
Comment (by wenzeslaus): Yes, I would like to reconcile the two APIs or implementations (or both). At this point, I still see too many differences. Replying to [comment:4 huhabla]: > IMHO, the for-loop to setup the processing commands for the TiledWorkflow can be avoided when using the PyGRASS Module and MultiModule approach. The API with for-loop is actually based on the case where the user wants the for loop like this one: {{{ #!python for i in range(0, 5): gs.run_command('r.module', num=i) gs.mapcalc(expr, num=i) }}} I had code like this and I wanted to parallelize the individual loop runs which are independent. So I just come up with the following API which is not changing much in the main part of the code: {{{ #!python workflow = SeriesWorkflow() # currently called ModuleCallList for i in range(0, 5): workflow.run_command('r.module', num=i) workflow.mapcalc(expr, num=i) workflow.execute() }}} The Python functions I used in the background have some problems with interrupting and failed subprocesses but they handle well a pool of subprocess so that there is always the given number of processes running (so there can be one really slow process but the others are just running in the mean time). Then I had a different case, where I didn't have any loop but I needed the tiling. The following API emerged from that: {{{ #!python for namer, workflow in TiledWorkflow(width=100, height=100): name = namer.name('rast', i) workflow.run_command('r.module', num=name) workflow.mapcalc(expr, num=name) workflow.execute() }}} This was of course before r69507, but the reasons for similar API are still there because the non-tiled workflow just has the loop anyway (if desired). One argument against current `TiledWorkflow` would actually be that we want the API to be different from the case where the loop is actually desired by the user. > The PyGRASS Module objects allows to alter the input and output settings before running, so that the TiledWorkflow class could take care of the tile names, altering the user pre-configured Module objects. The user simply initiates the Modules that should be used with the original raster names. The user (at least me) uses variables anyway. With the `SeriesWorkflow` case, user names the outputs as needed because all are preserved. With `TiledWorkflow` the variables needs to be assigned with the help of the `TiledWorkflow`, so some work is required but not that much. > The PyGRASS Module allows deep copy operation to clone the existing Module objects, hence the TiledWorkflow can create any number of copies and replacing the raster names with tile names. I don't think it is as simple as replacing the names which is of course possible only with PyGRASS, not grass.script. The naming step in `TiledWorkflow` simply adds maps for patching. This has potential to handle the case for r.mapcalc expressions as well as ''some'' basename usages like from r.texture. I don't have this implemented, but the user could also not include some outputs for patching and mark them for removal instead. > > The implementation is now 300 lines. MultiModule alone has 200 > > > > Well it is not much "Code". The doctests and the description of MultiModule are more than 100 lines. ;) Right. I guess my point is that parallel.py mostly relies on higher level functions from Python multiprocessing and on grass.script which is itself simple. Furthermore, parallel.py is more than just `TiledWorkflow`, although that's the longest and most complicated part. The parallel.py's design is to cover as many cases as possible with minimal code and the cost is that user needs to do something special time to time like the naming step for `TiledWorkflow` or the use of somehow wrapper functions instead of the real ones (applies to both `SeriesWorkflow` and `TiledWorkflow`). However, I think that `MultiModule` and others are much more robust at this point. parallel.py's only hope for being robust is that it is simple enough to become robust one day. I hope this clarifies a little bit more where I'm coming from. I know I was not specific in that private email week ago. -- Ticket URL: <https://trac.osgeo.org/grass/ticket/3166#comment:5> GRASS GIS <https://grass.osgeo.org> _______________________________________________ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev