Hello, the requirements for PHDF5 are here <https://support.hdfgroup.org/HDF5/PHDF5/>. It may be good idea to check whether there is a speed up from parllelFS + phdf5 set up.
In my interpretation there is a benefit to use PHDF5 when you have a full parallel system backed with parallel file system which capable handling IO parallel: large super-computing (batch) environments are such. On the other end of the spectrum you can have a single computer, single drive system with multiple cores; AWS EC2 instances without local HDD are such. If the latter case using PHDF5 you pull into extra code lines and some restrictions (no filters, ... ) as you see at some choke point there must be a mechanism to serialise all the READ/WRITE operations. If you have the latter setup using a separate process and a reliable software fabric ( ie: ZeroMQ + protocol buffer or similar queue ) get you the result. There also is another approach: to write into separate files, local in memory fs; 1) then copy all files into one single HDF5 container 2) use a separate HDF5 with external file driver to link the files into a single image. The copy/collect version works on batch processors if your 'collector' script is scheduled after the MPI job. Of course in case you are having a true parallel environment indeed you should benefit from parallel IO. best, steve On Mon, Feb 19, 2018 at 3:10 AM, Stefano Salvadè <[email protected] > wrote: > Good morning everyone, > > > > I’ve recently started using parallel HDF5 for my company, as we wish to > save analysed data on multiple files at a time. It would be an N:N case, > with an output stream for each file. > > The main program itself is written in C#, but we already have an API that > allows us to make calls to hdf5 and MPI in C and C++. It retrieves data > from an external device, executes some analysis and then saves the data, > and parallelizing these three parts would speed up the process. However i’m > not quite sure how to implement such parallelization on the third bit: > > So far i’ve seen that parallelization is usually implemented right off the > bat: the program is started with mpiexec (i’m on Windows), with a specified > number of processes. (like “mpiexec -n x Program.exe). Unfortunately > running multiple instances of the whole program in parallel would be > problematic, but i’ve seen that one should be able to spawn processes later > during runtime with MPI_Spawn(), indicating an executable as a target > (provided that the “main” process, the program itself, has been started > with “mpiexec -n 1 Program.exe” for example). > > This second method could do it for us, but I was wondering if there is a > more elegant way to achieve parallel output writing, like calling a > function from my own program instead of an executable. > > > > Bonus question, just to make sure i’ve got the basics of PHDF5 right in > the first place: I do need to have a process for each parallel action that > I want to perform in parallel, be it writing N streams to N files, or > writing N streams to a single file? > > > > Thank you in advance > > > > Stefano > > > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows 10 > > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > >
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
