Re: [Hdf-forum] How to include parallelization in the program

Steven Varga Mon, 19 Feb 2018 07:05:31 -0800

Hello,

the requirements for PHDF5 are here
<https://support.hdfgroup.org/HDF5/PHDF5/>. It may be good idea to check
whether there is a speed up from parllelFS + phdf5 set up.

In my interpretation there is a benefit to use PHDF5 when you have a full
parallel system backed with parallel file system which capable handling IO
parallel: large super-computing (batch) environments are such.
On the other end of the spectrum you can have a single computer, single
drive system with multiple cores; AWS EC2 instances without local HDD are
such.

If the latter case using PHDF5 you pull into extra code lines and some
restrictions (no filters, ... ) as  you see at some choke point there must
be a mechanism to serialise all the READ/WRITE operations.
If you have the latter setup using a separate process and a reliable
software fabric ( ie: ZeroMQ + protocol buffer  or similar queue ) get you
the result.
There also is another approach: to write into separate files, local in
memory fs;

1) then copy all files into one single HDF5 container
2) use a separate HDF5 with external file driver to link the files into a
single image.

The copy/collect version works on batch processors if your 'collector'
script is scheduled after the MPI job.

Of course in case you are having a true parallel environment indeed you
should benefit from parallel IO.

best,
steve

On Mon, Feb 19, 2018 at 3:10 AM, Stefano Salvadè <[email protected]
> wrote:

> Good morning everyone,
>
>
>
> I’ve recently started using parallel HDF5 for my company, as we wish to
> save analysed data on multiple files at a time. It would be an N:N case,
> with an output stream for each file.
>
> The main program itself is written in C#, but we already have an API that
> allows us to make calls to hdf5 and MPI in C and C++. It retrieves data
> from an external device, executes some analysis and then saves the data,
> and parallelizing these three parts would speed up the process. However i’m
> not quite sure how to implement such parallelization on the third bit:
>
> So far i’ve seen that parallelization is usually implemented right off the
> bat: the program is started with mpiexec (i’m on Windows), with a specified
> number of processes. (like “mpiexec -n x Program.exe). Unfortunately
> running multiple instances of the whole program in parallel would be
> problematic, but i’ve seen that one should be able to spawn processes later
> during runtime with MPI_Spawn(), indicating an executable as a target
> (provided that the “main” process, the program itself, has been started
> with “mpiexec -n 1 Program.exe” for example).
>
> This second method could do it for us, but I was wondering if there is a
> more elegant way to achieve parallel output writing, like calling a
> function from my own program instead of an executable.
>
>
>
> Bonus question, just to make sure i’ve got the basics of PHDF5 right in
> the first place: I do need to have a process for each parallel action that
> I want to perform in parallel, be it writing N streams to N files, or
> writing N streams to a single file?
>
>
>
> Thank you in advance
>
>
>
> Stefano
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] How to include parallelization in the program

Reply via email to