Hi Simone,

I think that if you are limited in the RAM per node then using MPI is a
mistake and instead you should be going for a threading-based solutions so
that all the threads can pool the available RAM. Alternatively maybe you
are using Open MPI in hybrid mode but it seems unlikely from your comments.
PyTables using Numexpr as the backend is the obvious solution here. Threads
do not usually scale as nicely as processes, but the setup/takedown time is
minimal and they share RAM. If you have an SMP (shared memory process)
environment for your cluster then it's usually suitable for such a thing.
PyTables is just generally a better HDF5 interface than h5py in my opinion.

http://www.pytables.org/

In particular pay attention to the 'evaluate' functionality which uses this
library:

https://github.com/pydata/numexpr

If those 16 core machines are two CPUs, then certainly then Numexpr
probably will not scale well past 8 cores, but 8 cores is better than 3.
It's very difficult to beat NumExpr in imaging processing on CPU in the
Python landscape, due to the amount of data flowing between the CPUs and
memory.  I've used pyFFTW and NumExpr in combination to be able to keep up
with competitors who program in Fortran-90.

If your data is compressible you could also look at zarr, which uses blosc
to compress Numpy arrays into chunks, and decompresses them at the cache
for processing:

https://github.com/alimanfoo/zarr

PyTables also includes blosc support now.

Stuff like Dask and Hadoop are for parallelizing algorithms in machine
learning and similar fields. If you just want to do matrix algebra they're
probably sub-optimal.  It comes down to, how big are each of your stacks?
You said chunking isn't practical, why?

Robert


On Wed, Dec 28, 2016 at 11:46 PM, <scikit-image@googlegroups.com> wrote:

> scikit-image@googlegroups.com
> <https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/scikit-image/topics>
>  Google
> Groups
> <https://groups.google.com/forum/?utm_source=digest&utm_medium=email/#!overview>
> <https://groups.google.com/forum/?utm_source=digest&utm_medium=email/#!overview>
> Topic digest
> View all topics
> <https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/scikit-image/topics>
>
>    - Image analysis pipeline improvement suggestions
>    <#m_-9004690481591817511_group_thread_0> - 1 Update
>
> Image analysis pipeline improvement suggestions
> <http://groups.google.com/group/scikit-image/t/5cef988037886d66?utm_source=digest&utm_medium=email>
> Simone Codeluppi <sim...@codeluppi.org>: Dec 28 09:58AM -0800
>
> Hi all!
>
> I would like to pick your brain for some suggestion on how to modify my
> image analysis pipeline.
>
> I am analyzing terabytes of image stacks generated using a microscope. The
> current code I generated rely heavily on scikit-image, numpy and scipy. In
> order to speed up the analysis the code runs on a HPC computer (
> https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for
> parallelization and hdf5 (h5py) for file storage. The development cycle of
> the code has been pretty painful mainly due to my non familiarity with mpi
> and problems in compiling parallel hdf5 (with many open/closing bugs).
> However, the big drawback is that each core has only 2Gb of RAM (no shared
> ram across nodes) and in order to run some of the processing steps i ended
> up reserving one node (16 cores) but running only 3 cores in order to have
> enough ram (image chunking won’t work in this case). As you can imagine
> this is extremely inefficient and i end up getting low priority in the
> queue system.
>
>
> Our lab currently bought a new 4 nodes server with shared RAM running
> hadoop. My goal is to move the parallelization of the processing to dask.
> I
> tested it before in another system and works great. The drawback is that,
> if I understood correctly, parallel hdf5 works only with MPI
> (driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in
> keeping a good structure of the data and i can save everything as numpy
> arrays….very handy.
>
>
> If I will move to hadoop/dask what do you think will be a good solution
> for
> data storage? Do you have any additional suggestion that can improve the
> layout of the pipeline? Any help will be greatly appreciated.
> Back to top <#m_-9004690481591817511_digest_top>
> You have received this digest because you're subscribed to updates for
> this group. You can change your settings on the group membership page
> <https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!forum/scikit-image/join>
> .
> To unsubscribe from this group and stop receiving emails from it send an
> email to scikit-image+unsubscr...@googlegroups.com.
>



-- 
Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)
Biozentrum der Universität Basel
Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
robert.mcl...@unibas.ch
robert.mcl...@bsse.ethz.ch <robert.mcl...@ethz.ch>
robbmcl...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"scikit-image" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scikit-image+unsubscr...@googlegroups.com.
To post to this group, send an email to scikit-image@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/scikit-image/CAEFUWWUjsti4k2gdC7e4_RSdDixFTW%3DPgVmBVL6Z%3Dej3kBGTiw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to