Re: [OMPI users] mca_sharedfp_lockfile issues
What file system are you running your code on ? And is the same directory shared across all nodes? I have seen this error if users try to use a non-shared directory for MPI I/O operations ( e.g. /tmp which is a different drive/folder on each node). Thanks Edgar -Original Message- From: users On Behalf Of bend linux4ms.net via users Sent: Tuesday, November 2, 2021 3:33 PM To: Open MPI Open MPI Cc: bend linux4ms.net Subject: [OMPI users] mca_sharedfp_lockfile issues Ok, I got more issues. Maybe someone on the list can help me: Open MPI version: 4.1.1 download from github source Compile on Centos 8.4 using GCC 8.4.1 Configured is: ./configure --enable-shared --enable-static \ --without-tm \ --enable-mpi-cxx \ --enable-wrapper-runpath \ --enable-mpirun-prefix-by-default \ --enable-mpi-thread-multiple \ --enable-mpi-fortran=yes \ --prefix=/p/app/compilers/mpi/openmpi/4.1.1 2>&1 \ | tee config.log Intel HPC system, 850 nodes trying to launch IOR benchmark. Top portion of the mpi command: - export OMPI_MCA_btl_openib_allow_ib=1 export OMPI_MCA_btl_openib_if_include="mlx5_0:1" mpirun -machinefile ${hostlist} \ --mca opal_common_ucx_opal_mem_hooks 1 \ \ -np ${NP} \ --map-by node \ -N ${rpn} \ -vv \ - I am getting the message [##] mca_sharedfp_lockedfile_file_open : Error during file open on all the nodes. I've tried it with the --mca sharedfp lockedfile and without, I still get the errors. What Have I done wrong ? Thanks .. Ben Duncan -
Re: [OMPI users] Status of pNFS, CephFS and MPI I/O
let me amend my last email by making clear that I do not recommend using NFS for parallel I/O. But if you have to, make sure your code does not do things like read-after-write, or multiple processes writing data that ends up in the same file system block (can often be avoided by using collective I/O for example). -Original Message- From: users On Behalf Of Gabriel, Edgar via users Sent: Thursday, September 23, 2021 5:31 PM To: Eric Chamberland ; Open MPI Users Cc: Gabriel, Edgar ; Louis Poirel ; Vivien Clauzon Subject: Re: [OMPI users] Status of pNFS, CephFS and MPI I/O -Original Message- From: Eric Chamberland Thanks for your answer Edgard! In fact, we are able to use NFS and certainly any POSIX file system on a single node basis. I should have been asking for: What are the supported file systems for *multiple nodes* read/write access to files? -> We have tested it on BeeGFS, GPFS, Lustre, PVFS2/OrangeFS, and NFS, but again, if a parallel file system has POSIX functions would expect it to work (and yes, I am aware of the strict POSIX semantics are not necessarily available in parallel file systems. Internally, we are using open, close, (p)readv, (pwritev), lock, unlock, seek). For nfs, MPI I/O is known to *not* work on NFS when using multiple nodes ... except for NFS v3 with "noac" mount option (we are about to test with "actimeo=0" option to see if it works). -> well, it depends on how you define that. I would suspect that our largest user base is actually using NFS v3/v4 with the noac option + nfslock server. Most I/O patterns will probably work in this scenario, and in fact, we are actually passing our entire testsuite on multi-node NFS setup (which does some nasty things). However, it is true that there could be corner cases that fail. In addition, parallel I/O on multi-node NFS can be outrageously slow, since we lock the *entire* file before every operation ( in contrary to ROMIO, which only locks the file range that is currently being accessed). Btw, is OpenMPI MPI I/O having some "hidden" (mca?) options to make a multiple nodes NFS cluster to work? -> OMPO recognizes the NFS file system automatically, without requiring an mca parameter. I usually recommend to users to try to relax the locking options and see whether they still produce correct data in order to improve performance of their code, since most I/O patterns do not require this super-strict locking behavior. This is the fs_ufs_lock_algorithm parameter. Thanks Edgar Thanks, Eric On 2021-09-23 1:57 p.m., Gabriel, Edgar wrote: > Eric, > > generally speaking, ompio should be able to operate correctly on all file > systems that have support for POSIX functions. The generic ufs component is > for example being used on BeeGFS parallel file systems without problems, we > are using that on a daily basis. For GPFS, the only reason we handle that > file system separately is because of some custom info objects that can be > used to configure the file during file_open. If one would not use these info > objects the generic ufs component would be as good as the GPFS specific > component. > > Note, the generic ufs component is also being used for NFS, it has logic > built in to recognize an NFS file system and handle some operations slightly > differently (but still relying on POSIX functions). The one big exception is > Lustre: due its different file locking strategy we are required to use a > different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs > would work on Lustre, too, but it would be horribly slow. > > I cannot comment on CephFS and pNFS since I do not have access to those file > systems, it would come down to test them. > > Thanks > Edgar > > > -Original Message- > From: users On Behalf Of Eric > Chamberland via users > Sent: Thursday, September 23, 2021 9:28 AM > To: Open MPI Users > Cc: Eric Chamberland ; Vivien > Clauzon > Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O > > Hi, > > I am looking around for information about parallel filesystems supported for > MPI I/O. > > Clearly, GFPS, Lustre are fully supported, but what about others? > > - CephFS > > - pNFS > > - Other? > > when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing... > > Otherwise I found this into ompi/mca/common/ompio/common_ompio.h : > > enum ompio_fs_type > { > NONE = 0, > UFS = 1, > PVFS2 = 2, > LUSTRE = 3, > PLFS = 4, > IME = 5, > GPFS = 6 > }; > > Does that mean that other fs types (pNFS, CephFS) does not need special > treatment or are not supported or not optimally supported? > > Thanks, > > Eric > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Université Laval > (418) 656-2131 poste 41 22 42 > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
Re: [OMPI users] Status of pNFS, CephFS and MPI I/O
-Original Message- From: Eric Chamberland Thanks for your answer Edgard! In fact, we are able to use NFS and certainly any POSIX file system on a single node basis. I should have been asking for: What are the supported file systems for *multiple nodes* read/write access to files? -> We have tested it on BeeGFS, GPFS, Lustre, PVFS2/OrangeFS, and NFS, but again, if a parallel file system has POSIX functions would expect it to work (and yes, I am aware of the strict POSIX semantics are not necessarily available in parallel file systems. Internally, we are using open, close, (p)readv, (pwritev), lock, unlock, seek). For nfs, MPI I/O is known to *not* work on NFS when using multiple nodes ... except for NFS v3 with "noac" mount option (we are about to test with "actimeo=0" option to see if it works). -> well, it depends on how you define that. I would suspect that our largest user base is actually using NFS v3/v4 with the noac option + nfslock server. Most I/O patterns will probably work in this scenario, and in fact, we are actually passing our entire testsuite on multi-node NFS setup (which does some nasty things). However, it is true that there could be corner cases that fail. In addition, parallel I/O on multi-node NFS can be outrageously slow, since we lock the *entire* file before every operation ( in contrary to ROMIO, which only locks the file range that is currently being accessed). Btw, is OpenMPI MPI I/O having some "hidden" (mca?) options to make a multiple nodes NFS cluster to work? -> OMPO recognizes the NFS file system automatically, without requiring an mca parameter. I usually recommend to users to try to relax the locking options and see whether they still produce correct data in order to improve performance of their code, since most I/O patterns do not require this super-strict locking behavior. This is the fs_ufs_lock_algorithm parameter. Thanks Edgar Thanks, Eric On 2021-09-23 1:57 p.m., Gabriel, Edgar wrote: > Eric, > > generally speaking, ompio should be able to operate correctly on all file > systems that have support for POSIX functions. The generic ufs component is > for example being used on BeeGFS parallel file systems without problems, we > are using that on a daily basis. For GPFS, the only reason we handle that > file system separately is because of some custom info objects that can be > used to configure the file during file_open. If one would not use these info > objects the generic ufs component would be as good as the GPFS specific > component. > > Note, the generic ufs component is also being used for NFS, it has logic > built in to recognize an NFS file system and handle some operations slightly > differently (but still relying on POSIX functions). The one big exception is > Lustre: due its different file locking strategy we are required to use a > different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs > would work on Lustre, too, but it would be horribly slow. > > I cannot comment on CephFS and pNFS since I do not have access to those file > systems, it would come down to test them. > > Thanks > Edgar > > > -Original Message- > From: users On Behalf Of Eric > Chamberland via users > Sent: Thursday, September 23, 2021 9:28 AM > To: Open MPI Users > Cc: Eric Chamberland ; Vivien > Clauzon > Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O > > Hi, > > I am looking around for information about parallel filesystems supported for > MPI I/O. > > Clearly, GFPS, Lustre are fully supported, but what about others? > > - CephFS > > - pNFS > > - Other? > > when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing... > > Otherwise I found this into ompi/mca/common/ompio/common_ompio.h : > > enum ompio_fs_type > { > NONE = 0, > UFS = 1, > PVFS2 = 2, > LUSTRE = 3, > PLFS = 4, > IME = 5, > GPFS = 6 > }; > > Does that mean that other fs types (pNFS, CephFS) does not need special > treatment or are not supported or not optimally supported? > > Thanks, > > Eric > > -- > Eric Chamberland, ing., M. Ing > Professionnel de recherche > GIREF/Université Laval > (418) 656-2131 poste 41 22 42 > -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
Re: [OMPI users] Status of pNFS, CephFS and MPI I/O
Eric, generally speaking, ompio should be able to operate correctly on all file systems that have support for POSIX functions. The generic ufs component is for example being used on BeeGFS parallel file systems without problems, we are using that on a daily basis. For GPFS, the only reason we handle that file system separately is because of some custom info objects that can be used to configure the file during file_open. If one would not use these info objects the generic ufs component would be as good as the GPFS specific component. Note, the generic ufs component is also being used for NFS, it has logic built in to recognize an NFS file system and handle some operations slightly differently (but still relying on POSIX functions). The one big exception is Lustre: due its different file locking strategy we are required to use a different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs would work on Lustre, too, but it would be horribly slow. I cannot comment on CephFS and pNFS since I do not have access to those file systems, it would come down to test them. Thanks Edgar -Original Message- From: users On Behalf Of Eric Chamberland via users Sent: Thursday, September 23, 2021 9:28 AM To: Open MPI Users Cc: Eric Chamberland ; Vivien Clauzon Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O Hi, I am looking around for information about parallel filesystems supported for MPI I/O. Clearly, GFPS, Lustre are fully supported, but what about others? - CephFS - pNFS - Other? when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing... Otherwise I found this into ompi/mca/common/ompio/common_ompio.h : enum ompio_fs_type { NONE = 0, UFS = 1, PVFS2 = 2, LUSTRE = 3, PLFS = 4, IME = 5, GPFS = 6 }; Does that mean that other fs types (pNFS, CephFS) does not need special treatment or are not supported or not optimally supported? Thanks, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42
Re: [OMPI users] 4.1 mpi-io test failures on lustre
ok, so what I get from this conversation is the following todo list: 1. check out the tests src/mpi/romio/test 2. revisit the atomicity issue. You are right that there scenarios where it might be required, the fact that we were not able to hit the issues in our tests is no evidence. 3. will work on an update of the FAQ section. -Original Message- From: users On Behalf Of Dave Love via users Sent: Monday, January 18, 2021 11:14 AM To: Gabriel, Edgar via users Cc: Dave Love Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre "Gabriel, Edgar via users" writes: >> How should we know that's expected to fail? It at least shouldn't fail like >> that; set_atomicity doesn't return an error (which the test is prepared for >> on a filesystem like pvfs2). >> I assume doing nothing, but appearing to, can lead to corrupt data, and I'm >> surprised that isn't being seen already. >> HDF5 requires atomicity -- at least to pass its tests -- so presumably >> anyone like us who needs it should use something mpich-based with recent or >> old romio, and that sounds like most general HPC systems. >> Am I missing something? >> With the current romio everything I tried worked, but we don't get that >> option with openmpi. > > First of all, it is mentioned on the FAQ sites of Open MPI, although > admittedly it is not entirely update (it lists external32 support also > as missing, which is however now available since 4.1). Yes, the FAQ was full of confusing obsolete material when I last looked. Anyway, users can't be expected to check whether any particular operation is expected to fail silently. I should have said that MPI_File_set_atomicity(3) explicitly says the default is true for multiple nodes, and doesn't say the call is a no-op with the default implementation. I don't know whether the MPI spec allows not implementing it, but I at least expect an error return if it doesn't. As far as I remember, that's what romio does on a filesystem like pvfs2 (or lustre when people know better than implementers and insist on noflock); I mis-remembered from before, thinking that ompio would be changed to do the same. From that thread, I did think atomicity was on its way. Presumably an application requests atomicity for good reason, and can take appropriate action if the status indicates it's not available on that filesystem. > You don't need atomicity for the HDF5 tests, we are passing all of them to > the best my knowledge, and this is one of the testsuites that we do run > regularly as part of our standard testing process. I guess we're just better at breaking things. > I am aware that they have an atomicity test - which we pass for whatever > reason. This highlight also btw the issue(s) that I am having with the > atomicity option in MPI I/O. I don't know what the application is of atomicity in HDF5. Maybe it isn't required for typical operations, but I assume it's not used blithely. However, I'd have thought HDF5 should be prepared for something like pvfs2, and at least not abort the test at that stage. I've learned to be wary of declaring concurrent systems working after a few tests. In fact, the phdf5 test failed for me like this when I tried across four lustre client nodes with 4.1's defaults. (I'm confused about the striping involved, because I thought I set it to four, and now it shows as one on that directory.) ... Testing -- dataset atomic updates (atomicity) Proc 9: *** Parallel ERRProc 54: *** Parallel ERROR *** VRFY (H5Sset_hyperslab succeeded) failed at line 4293 in t_dset.c aborting MPI proceProc 53: *** Parallel ERROR *** Unfortunately I hadn't turned on backtracing, and I wouldn't get another job trough for a while. > The entire infrastructure to enforce atomicity is actually in place in ompio, > and I can give you the option on how to enforce strict atomic behavior for > all files in ompio (just not on a per file basis), just be aware that the > performance will nose-dive. This is not just the case with ompio, but also in > romio, you can read up on that various discussion boards on that topic, look > at NFS related posts (where you need the atomicity for correctness in > basically all scenarios). I'm fairly sure I accidentally ran tests successfully on NFS4, at least single-node. I never found a good discussion of the topic, and what I have seen about "NFS" was probably specific to NFS3 and non-POSIX compliance, though I don't actually care about parallel i/o on NFS. The information we got about lustre was direct from Rob Latham, as nothing showed up online. I don't like fast-but-wrong, so I think there should be the option of correctness, especially as it's the documented default. > Just as another data point, in the 8+ years that ompio has been available, > t
Re: [OMPI users] 4.1 mpi-io test failures on lustre
I would like to correct one of my statements: -Original Message- From: users On Behalf Of Gabriel, Edgar via users Sent: Friday, January 15, 2021 7:58 AM To: Open MPI Users Cc: Gabriel, Edgar Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre > The entire infrastructure to enforce atomicity is actually in place in ompio, > and I can give you the option on how to enforce strict atomic behavior for > all files in ompio (just not on a per file basis), just > be aware that the > performance will nose-dive. I realized that this statement is not entirely true, we are missing one aspect for being able to provide full atomicity. Thanks Edgar
Re: [OMPI users] 4.1 mpi-io test failures on lustre
-Original Message- From: users On Behalf Of Dave Love via users Sent: Friday, January 15, 2021 4:48 AM To: Gabriel, Edgar via users Cc: Dave Love Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre > How should we know that's expected to fail? It at least shouldn't fail like > that; set_atomicity doesn't return an error (which the test is prepared for > on a filesystem like pvfs2). > I assume doing nothing, but appearing to, can lead to corrupt data, and I'm > surprised that isn't being seen already. > HDF5 requires atomicity -- at least to pass its tests -- so presumably anyone > like us who needs it should use something mpich-based with recent or old > romio, and that sounds like most general HPC systems. > Am I missing something? > With the current romio everything I tried worked, but we don't get that > option with openmpi. First of all, it is mentioned on the FAQ sites of Open MPI, although admittedly it is not entirely update (it lists external32 support also as missing, which is however now available since 4.1). You don't need atomicity for the HDF5 tests, we are passing all of them to the best my knowledge, and this is one of the testsuites that we do run regularly as part of our standard testing process. I am aware that they have an atomicity test - which we pass for whatever reason. This highlight also btw the issue(s) that I am having with the atomicity option in MPI I/O. The entire infrastructure to enforce atomicity is actually in place in ompio, and I can give you the option on how to enforce strict atomic behavior for all files in ompio (just not on a per file basis), just be aware that the performance will nose-dive. This is not just the case with ompio, but also in romio, you can read up on that various discussion boards on that topic, look at NFS related posts (where you need the atomicity for correctness in basically all scenarios). Just as another data point, in the 8+ years that ompio has been available, there was not one issue reported related to correctness due to missing the atomicity option. That being said, if you feel more comfortable using romio, it is completely up to you. Open MPI offers this option, and it is incredibly easy to set the default parameters on a platform for all users such that romio is being used. We are doing with our limited resources the best we can, and while ompio is by no means perfect, we try to be responsive to issues reported by users and value constructive feedback and discussion. Thanks Edgar
Re: [OMPI users] 4.1 mpi-io test failures on lustre
I will have a look at those tests. The recent fixes were not correctness, but performance fixes. Nevertheless, we used to pass the mpich tests, but I admit that it is not a testsuite that we run regularly, I will have a look at them. The atomicity tests are expected to fail, since this the one chapter of MPI I/O that is not implemented in ompio. Thanks Edgar -Original Message- From: users On Behalf Of Dave Love via users Sent: Thursday, January 14, 2021 5:46 AM To: users@lists.open-mpi.org Cc: Dave Love Subject: [OMPI users] 4.1 mpi-io test failures on lustre I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system that I understand was used to fix ompio problems on lustre. I'm puzzled that I still see failures. I don't know why there are disjoint sets in mpich's test/mpi/io and src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io defaults across two nodes. In src/mpi/romio/test, atomicity failed (ignoring error and syshints); in test/mpi/io, the failures were setviewcur, tst_fileview, external32_derived_dtype, i_bigtype, and i_setviewcur. tst_fileview was probably killed by the 100s timeout. It may be that some are only appropriate for romio, but no-one said so before and they presumably shouldn't segv or report libc errors. I built against ucx 1.9 with cuda support. I realize that has problems on ppc64le, with no action on the issue, but there's a limit to what I can do. cuda looks relevant since one test crashes while apparently trying to register cuda memory; that's presumably not ompio's fault, but we need cuda.
Re: [OMPI users] Parallel HDF5 low performance
the reason for potential performance issues on NFS are very different from Lustre. Basically, depending on your use-case and the NFS configuration, you have to enforce different locking policy to ensure correct output files. The default value for chosen for ompio is the most conservative setting, since this was the only setting that we found that would result in a correct output file for all of our tests. You can change settings to see whether other options would work you. The parameter that you need to work with is fs_ufs_lock_algorithm. Setting it to 1 will completely disable it (and most likely lead to the best performance), setting it to 3 is a middle ground (lock specific ranges) and similar to what ROMIO does. So e.g. mpiexec -n 16 --mca fs_ufs_lock_algorihtm 1 ./mytests That being said, if you google NFS + MPI I/O, you will find a ton of document and reasons for potential problems, so using MPI I/O on top of NFS (whether OMPIO or ROMIO) is always at your own risk. Thanks Edgar -Original Message- From: users On Behalf Of Gilles Gouaillardet via users Sent: Thursday, December 3, 2020 4:46 AM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] Parallel HDF5 low performance Patrick, glad to hear you will upgrade Open MPI thanks to this workaround! ompio has known performance issues on Lustre (this is why ROMIO is still the default on this filesystem) but I do not remember such performance issues have been reported on a NFS filesystem. Sharing a reproducer will be very much appreciated in order to improve ompio Cheers, Gilles On Thu, Dec 3, 2020 at 6:05 PM Patrick Bégou via users wrote: > > Thanks Gilles, > > this is the solution. > I will set OMPI_MCA_io=^ompio automatically when loading the parallel > hdf5 module on the cluster. > > I was tracking this problem for several weeks but not looking in the > right direction (testing NFS server I/O, network bandwidth.) > > I think we will now move definitively to modern OpenMPI implementations. > > Patrick > > Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit : > > Patrick, > > > > > > In recent Open MPI releases, the default component for MPI-IO is > > ompio (and no more romio) > > > > unless the file is on a Lustre filesystem. > > > > > > You can force romio with > > > > mpirun --mca io ^ompio ... > > > > > > Cheers, > > > > > > Gilles > > > > On 12/3/2020 4:20 PM, Patrick Bégou via users wrote: > >> Hi, > >> > >> I'm using an old (but required by the codes) version of hdf5 > >> (1.8.12) in parallel mode in 2 fortran applications. It relies on > >> MPI/IO. The storage is NFS mounted on the nodes of a small cluster. > >> > >> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 > >> the I/Os are 10x to 100x slower. Are there fundamentals changes in > >> MPI/IO for these new releases of OpenMPI and a solution to get back > >> to the IO performances with this parallel HDF5 release ? > >> > >> Thanks for your advices > >> > >> Patrick > >> >
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
I will have a look at the t_bigio tests on Lustre with ompio. We had from collaborators some reports about the performance problems similar to the one that you mentioned here (which was the reason we were hesitant to make ompio the default on Lustre), but part of the problem is that we were not able to reproduce it reliably on the systems that we had access to, which we makes debugging and fixing the issue very difficult. Lustre is a very unforgiving file system, if you get something wrong with the settings, the performance is not just a bit off, but often orders of magnitude (as in your measurements). Thanks! Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Thursday, November 26, 2020 9:38 AM To: Dave Love via users Cc: Mark Dixon ; Dave Love Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? On Wed, 25 Nov 2020, Dave Love via users wrote: >> The perf test says romio performs a bit better. Also -- from overall >> time -- it's faster on IMB-IO (which I haven't looked at in detail, >> and ran with suboptimal striping). > > I take that back. I can't reproduce a significant difference for > total IMB-IO runtime, with both run in parallel on 16 ranks, using > either the system default of a single 1MB stripe or using eight > stripes. I haven't teased out figures for different operations yet. > That must have been done elsewhere, but I've never seen figures. But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio parallel test appears to be a pathological case and OMPIO is 2 orders of magnitude slower on a Lustre filesystem: - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds End users seem to have the choice of: - use openmpi 4.x and have some things broken (romio) - use openmpi 4.x and have some things slow (ompio) - use openmpi 3.x and everything works My concern is that openmpi 3.x is near, or at, end of life. Mark t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, Lustre 2.12.5: [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real0m21.141s user2m0.318s sys 0m3.289s [login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real42m34.103s user213m22.925s sys 8m6.742s
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
hm, I think this sounds like a different issue, somebody who is more invested in the ROMIO Open MPI work should probably have a look. Regarding compiling Open MPI with Lustre support for ROMIO, I cannot test it right now for various reasons, but if I recall correctly the trick was to provide the --with-lustre option twice, once inside of the "--with-io-romio-flags=" (along with the option that you provided), and once outside (for ompio). Thanks Edgar -Original Message- From: Mark Dixon Sent: Monday, November 16, 2020 8:19 AM To: Gabriel, Edgar via users Cc: Gabriel, Edgar Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? Hi Edgar, Thanks for this - good to know that ompio is an option, despite the reference to potential performance issues. I'm using openmpi 4.0.5 with ucx 1.9.0 and see the hdf5 1.10.7 test "testphdf5" timeout (with the timeout set to an hour) using romio. Is it a known issue there, please? When it times out, the last few lines to be printed are these: Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) The other thing I note is that openmpi doesn't configure romio's lustre driver, even when given "--with-lustre". Regardless, I see the same result whether or not I add "--with-io-romio-flags=--with-file-system=lustre+ufs" Cheers, Mark On Mon, 16 Nov 2020, Gabriel, Edgar via users wrote: > this is in theory still correct, the default MPI I/O library used by > Open MPI on Lustre file systems is ROMIO in all release versions. That > being said, ompio does have support for Lustre as well starting from > the > 2.1 series, so you can use that as well. The main reason that we did > not switch to ompio for Lustre as the default MPI I/O library is a > performance issue that can arise under certain circumstances. > > Which version of Open MPI are you using? There was a bug fix in the > Open MPI to ROMIO integration layer sometime in the 4.0 series that > fixed a datatype problem, which caused some problems in the HDF5 > tests. You might be hitting that problem. > > Thanks > Edgar > > -Original Message- > From: users On Behalf Of Mark Dixon > via users > Sent: Monday, November 16, 2020 4:32 AM > To: users@lists.open-mpi.org > Cc: Mark Dixon > Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? > > Hi all, > > I'm confused about how openmpi supports mpi-io on Lustre these days, > and am hoping that someone can help. > > Back in the openmpi 2.0.0 release notes, it said that OMPIO is the > default MPI-IO implementation on everything apart from Lustre, where > ROMIO is used. Those release notes are pretty old, but it still > appears to be true. > > However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I > tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to > print warning messages (UCX_LOG_LEVEL=ERROR). > > Can I just check: are we still supposed to be using ROMIO? > > Thanks, > > Mark >
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
this is in theory still correct, the default MPI I/O library used by Open MPI on Lustre file systems is ROMIO in all release versions. That being said, ompio does have support for Lustre as well starting from the 2.1 series, so you can use that as well. The main reason that we did not switch to ompio for Lustre as the default MPI I/O library is a performance issue that can arise under certain circumstances. Which version of Open MPI are you using? There was a bug fix in the Open MPI to ROMIO integration layer sometime in the 4.0 series that fixed a datatype problem, which caused some problems in the HDF5 tests. You might be hitting that problem. Thanks Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Monday, November 16, 2020 4:32 AM To: users@lists.open-mpi.org Cc: Mark Dixon Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? Hi all, I'm confused about how openmpi supports mpi-io on Lustre these days, and am hoping that someone can help. Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default MPI-IO implementation on everything apart from Lustre, where ROMIO is used. Those release notes are pretty old, but it still appears to be true. However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning messages (UCX_LOG_LEVEL=ERROR). Can I just check: are we still supposed to be using ROMIO? Thanks, Mark
Re: [OMPI users] ompe support for filesystems
the ompio software infrastructure has multiple frameworks. fs framework: abstracts out file system level operations (open, close, etc) fbtl framework: provides the abstractions and implementations of *individual* file I/O operations (seek,read,write, iread,iwrite) fcoll framework: provides the abstractions and implementations of *collective* file I/O operations ( read_all, write_all, etc.) sharedfp framework: provides the abstractions and implementations *shared file pointer* file I/O operations (read_shared, write_shared, read_ordered, write_ordered). Feel free to ping me also directly if you need more assistance. If you are looking for a reference and more explanations, please have a look at the following paper: Mohamad Chaarawi, Edgar Gabriel, Rainer Keller, Richard Graham, George Bosilca and Jack Dongarra, 'OMPIO: A Modular Software Architecture for MPI I/O', in Y. Cotronis, A. Danalis, D. Nikolopoulos, J. Dongarra, (Eds.) 'Recent Advances in Message Passing Interface', LNCS vol. 6960, pp. 81-89, Springer, 2011. http://www2.cs.uh.edu/~gabriel/publications/EuroMPI11_OMPIO.pdf Best regards Edgar -Original Message- From: users On Behalf Of Ognen Duzlevski via users Sent: Monday, November 2, 2020 7:54 AM To: Open MPI Users Cc: Ognen Duzlevski Subject: Re: [OMPI users] ompe support for filesystems Gilles, Thank you for replying. I took a look at the code and am curious to understand where the actual read/write/seek etc. operations are implemented. From what I can see/understand - what you pointed me to implements file open/close etc. operations that pertain to particular filesystems. I then tried to figure out the read/write/seek etc. operations and can see that a MPI_File structure appears to have a f_io_selected_module member, whose v_2_0_0 member seems to have the list of pointers to all the functionals dealing with the actual file write/read/seek functionality. Is this correct? What I would like to figure out is where the actual writes or reads happen (as in the underlying filesystem's implementations). I imagine for some filesystems a write, for example, is not just a simple call to the write onto disk but involves a bit more logic/magic. Thanks! Ognen Gilles Gouaillardet via users writes: > Hi Ognen, > > MPI-IO is implemented by two components: > - ROMIO (from MPICH) > - ompio ("native" Open MPI MPI-IO, default component unless running > on Lustre) > > Assuming you want to add support for a new filesystem in ompio, first > step is to implement a new component in the fs framework the framework > is in /ompi/mca/fs, and each component is in its own directory (for > example ompi/mca/fs/gpfs) > > There are a some configury tricks (create a configure.m4, add Makefile > to autoconf, ...) to make sure your component is even compiled. > If you are struggling with these, feel free to open a Pull Request to > get some help fixing the missing bits. > > Cheers, > > Gilles > > On Sun, Nov 1, 2020 at 12:18 PM Ognen Duzlevski via users > wrote: >> >> Hello! >> >> If I wanted to support a specific filesystem in open mpi, how is this >> done? What code in the source tree does it? >> >> Thanks! >> Ognen
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
Your code looks correct, and based on your output I would actually suspect that the I/O part finished correctly, the error message that you see is not an IO error, but from the btl (which is communication related). What version of Open MPI are using, and on what file system? Thanks Edgar -Original Message- From: users On Behalf Of Stephen Siegel via users Sent: Friday, June 5, 2020 5:35 PM To: users@lists.open-mpi.org Cc: Stephen Siegel Subject: [OMPI users] MPI I/O question using MPI_File_write_shared I posted this question on StackOverflow and someone suggested I write to the OpenMPI community. https://stackoverflow.com/questions/62223698/mpi-i-o-why-does-my-program-hang-or-misbehave-when-one-process-writes-using-mpi Below is a little MPI program. It is a simple use of MPI I/O. Process 0 writes an int to the file using MPI_File_write_shared; no other process writes anything. It works correctly using an MPICH installation, but on two different machines using OpenMPI, it either hangs in the middle of the call to MPI_File_write_shared, or it reports an error at the end. Not sure if it is my misunderstanding of the MPI Standard or a bug or configuration problem with my OpenMPI. Thanks in advance if anyone can look at it, Steve #include #include #include int nprocs, rank; int main() { MPI_File fh; int err, count; MPI_Status status; MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); err = MPI_File_open(MPI_COMM_WORLD, "io_byte_shared.tmp", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, ); assert(err==0); err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); assert(err==0); printf("Proc %d: file has been opened.\n", rank); fflush(stdout); // Proc 0 only writes header using shared file pointer... MPI_Barrier(MPI_COMM_WORLD); if (rank == 0) { int x = ; printf("Proc 0: About to write to file.\n"); fflush(stdout); err = MPI_File_write_shared(fh, , 1, MPI_INT, ); printf("Proc 0: Finished writing.\n"); fflush(stdout); assert(err == 0); } MPI_Barrier(MPI_COMM_WORLD); printf("Proc %d: about to close file.\n", rank); fflush(stdout); err = MPI_File_close(); assert(err==0); MPI_Finalize(); } Example run: $ mpicc io_byte_shared.c $ mpiexec -n 4 ./a.out Proc 0: file has been opened. Proc 0: About to write to file. Proc 0: Finished writing. Proc 1: file has been opened. Proc 2: file has been opened. Proc 3: file has been opened. Proc 0: about to close file. Proc 1: about to close file. Proc 2: about to close file. Proc 3: about to close file. [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Re: [OMPI users] Slow collective MPI File IO
The one test that would give you a good idea of the upper bound for your scenario would be that write a benchmark where each process writes to a separate file, and look at the overall bandwidth achieved across all processes. The MPI I/O performance will be less or equal to the bandwidth achieved in this scenario, as long as the number of processes are moderate. Thanks Edgar From: Dong-In Kang Sent: Monday, April 6, 2020 9:34 AM To: Collin Strassburger Cc: Open MPI Users ; Gabriel, Edgar Subject: Re: [OMPI users] Slow collective MPI File IO Hi Collin, It is written in C. So, I think it is OK. Thank you, David On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger mailto:cstrassbur...@bihrle.com>> wrote: Hello, Just a quick comment on this; is your code written in C/C++ or Fortran? Fortran has issues with writing at a decent speed regardless of MPI setup and as such should be avoided for file IO (yet I still occasionally see it implemented). Collin From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Dong-In Kang via users Sent: Monday, April 6, 2020 10:02 AM To: Gabriel, Edgar mailto:egabr...@central.uh.edu>> Cc: Dong-In Kang mailto:dik...@gmail.com>>; Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [OMPI users] Slow collective MPI File IO Thank you Edgar for the information. I also tried MPI_File_write_at_all(), but it usually makes the performance worse. My program is very simple. Each MPI process writes a consecutive portion of a file. No interleaving among the MPI processes. I think in this case I can use MPI_File_write_at(). I tested the maximum bandwidth of the target devices and they are at least a few times bigger than what single process can achieve. I tested it using the same program but open the individual files using MPI_COMM_SELF. I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB chunk, but no noticeable difference. (There are performance differences between using 32MB chunk and using 512MB chunk. But, they still don't make multiple MPI processes file IO exceeds the performance of single MPI process file IO) As for the local disk, at least 2 times faster than single MPI process can achieve. As for the ramdisk, at least 5 times faster. Luster, I know that it is at least 7-8 times or more faster depending on the configuration. About caching effect, it would be the case of MPI_File_read(). I can see very high bandwidth of MPI_File_read(), which I believe comes from caches in RAM. But as for MPI_File_write, I think it doesn't be affected by caching. And I create a new file for each test and removes the file at the end of the testing. I may make a very simple mistake, but I don't know what it is. I saw MPI_File I/O could achieve multiple times of speedup over single process file IO, when faster file system is used like Lustre from a few reports in the internet. I started this experiment because I couldn't get speedup on Lustre file system. And then I moved the experiment to ramdisk and local disk, because it can remove the issue of Lustre configuration. Any comments are welcome. David On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar mailto:egabr...@central.uh.edu>> wrote: Hi, A couple of comments. First, if you use MPI_File_write_at, this is usually not considered collective I/O, even if executed by multiple processes. MPI_File_write_at_all would be collective I/O. Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are providing. If already a single process is able to saturate the bandwidth of your file system and hardware, you will not be able to see performance improvements from multiple processes (some minor exceptions maybe due to caching effects, but that is only for smaller problem sizes, the larger the amount of data that you try to write, the lesser the caching effects become in file I/O). So the first question that you have to answer, what is the sustained bandwidth of your hardware, and are you able to saturate it already with a single process. If you are using a single hard drive (or even 2 or 3 hard drives in a RAID 0 configuration), this is almost certainly the case. Lastly, the configuration parameters of your tests also play a major role. As a general rule, the larger amounts of data you are able to provide per file I/O call, the better the performance will be. 1MB of data per call is probably on the smaller side. The ompio implementation of MPI I/O breaks large individual I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance reasons internally. Large collective I/O operations (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some hints on the quantities of data that you would have to use for performance reasons. Along the same lines, one final comment. You say you did 1000 writes of 1MB each. For a single process that is about 1GB of data. Depending o
Re: [OMPI users] Slow collective MPI File IO
Hi, A couple of comments. First, if you use MPI_File_write_at, this is usually not considered collective I/O, even if executed by multiple processes. MPI_File_write_at_all would be collective I/O. Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are providing. If already a single process is able to saturate the bandwidth of your file system and hardware, you will not be able to see performance improvements from multiple processes (some minor exceptions maybe due to caching effects, but that is only for smaller problem sizes, the larger the amount of data that you try to write, the lesser the caching effects become in file I/O). So the first question that you have to answer, what is the sustained bandwidth of your hardware, and are you able to saturate it already with a single process. If you are using a single hard drive (or even 2 or 3 hard drives in a RAID 0 configuration), this is almost certainly the case. Lastly, the configuration parameters of your tests also play a major role. As a general rule, the larger amounts of data you are able to provide per file I/O call, the better the performance will be. 1MB of data per call is probably on the smaller side. The ompio implementation of MPI I/O breaks large individual I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance reasons internally. Large collective I/O operations (e.g. MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some hints on the quantities of data that you would have to use for performance reasons. Along the same lines, one final comment. You say you did 1000 writes of 1MB each. For a single process that is about 1GB of data. Depending on how much main memory your PC has, this amount of data can still be cached in modern systems, and you might have an unrealistically high bandwidth value for the 1 process case that you are comparing against (it depends a bit on what your benchmark does, and whether you force flushing the data to disk inside of your measurement loop). Hope this gives you some pointers on where to start to look. Thanks Edgar From: users On Behalf Of Dong-In Kang via users Sent: Monday, April 6, 2020 7:14 AM To: users@lists.open-mpi.org Cc: Dong-In Kang Subject: [OMPI users] Slow collective MPI File IO Hi, I am running an MPI program where N processes write to a single file on a single shared memory machine. I’m using OpenMPI v.4.0.2. Each MPI process write a 1MB chunk of data for 1K times sequentially. There is no overlap in the file between any of the two MPI processes. I ran the program for -np = {1, 2, 4, 8}. I am seeing that the speed of the collective write to a file for -np = {2, 4, 8} never exceeds the speed of -np = {1}. I did the experiment with a few different file systems {local disk, ram disk, Luster FS}. For all of them, I see similar results. The speed of collective write to a single shared file never exceeds the speed of single MPI process case. Any tip or suggestions? I used MPI_File_write_at() routine with proper offset for each MPI process. (I also tried MPI_File_write_at_all() routine, which makes the performance worse as np gets bigger.) Before writing, MPI_Barrrier() is used. The start time is taken right after MPI_Barrier() using MPI_Timer(); The end time is taken right after another MPI_Barrier(). The speed of the collective write is calculate as (total data amount written to the file)/(time between the first MPI_Barrier() and the second MPI_Barrier()); Any idea to increase the speed? Thanks, David
Re: [OMPI users] How to prevent linking in GPFS when it is present
ompio only added recently support for gpfs, and its only available in master (so far). If you are using any of the released versions of Open MPI (2.x, 3.x, 4.x) you will not find this feature in ompio yet. Thus, the issue is only how to disable gpfs in romio. I could not find right away an option for that, but I keep looking. Thanks Edgar -Original Message- From: users On Behalf Of Jonathon A Anderson via users Sent: Monday, March 30, 2020 4:36 PM To: users@lists.open-mpi.org Cc: Jonathon A Anderson Subject: Re: [OMPI users] How to prevent linking in GPFS when it is present I'm going to try ac_cv_header_gpfs_h=no; but --without-gpfs doesn't seem to exist. I tried it on both 3.1.5 and 2.1.6 [joan5896@admin2 openmpi-3.1.5]$ ./configure --without-gpfs configure: WARNING: unrecognized options: --without-gpfs From: users on behalf of Gilles Gouaillardet via users Sent: Sunday, March 29, 2020 6:17 PM To: users@lists.open-mpi.org Cc: Gilles Gouaillardet Subject: Re: [OMPI users] How to prevent linking in GPFS when it is present Jonathon, GPFS is used by both the ROMIO component (that comes from MPICH) and the fs/gpfs component that is used by ompio (native Open MPI MPI-IO so to speak). you should be able to disable both by running ac_cv_header_gpfs_h=no configure --without-gpfs ... Note that Open MPI is modular by default (e.g. unless you configure --disable-dlopen), and if you run it on a node that does not have libgpfs.so[.version], you might only see a warning and Open MPI will use ompio (note that might not apply on Lustre since only ROMIO is used on this filesystem) Cheers, Gilles On 3/30/2020 8:25 AM, Jonathon A Anderson via users wrote: > We are trying to build Open MPI on a system that happens to have GPFS > installed. This appears to cause Open MPI to detect gpfs.h and link against > libgpfs.so. We are trying to build a central software stack for use on > multiple clusters, some of which do not have GPFS. (It is our experience that > this provokes an error, as libgpfs.so is not found on these clusters.) To > accommodate this I want to build openmpi explicitly without linking against > GPFS. > > I tried to accomplish this with > > ./configure --with-io-romio-flags='--with-file-system=ufs+nfs' > > But gpfs was still linked. > > configure:397895: result: -lhwloc -ldl -lz -lpmi2 -lrt -lgpfs -lutil > -lm -lfabric > > How can I tell Open MPI to not link against GPFS? > > ~jonathon > > > p.s., I realize that I could just build on a system that does not have GPFS > installed; but I am trying to genericize this to encapsulate in the Spack > package. I also don't understand why the Spack package is detecting gpfs.h in > the first place, as I thought Spack tries to isolate its build environment > from the host system; but I'll ask them that in a separate message.
Re: [OMPI users] Read from file performance degradation whenincreasing number of processors in some cases
How is the performance if you leave a few cores for the OS, e,g. running with 60 processes instead of 64? Reasoning being that the file read operation is really executed by the OS, and could potentially be quite resource intensive. Thanks Edgar From: users On Behalf Of Ali Cherry via users Sent: Friday, March 6, 2020 8:06 AM To: Open MPI Users Cc: Ali Cherry Subject: Re: [OMPI users] Read from file performance degradation whenincreasing number of processors in some cases Hello, Thank you for your replies. Yes, it is only a single node with 64 cores. The input file is copied from nfs to a tmpfs when I start the node. The mpirun command lines were: $ mpirun -np 64 --mca btl vader,self pms.out /run/user/10002/bigarray.in > pms-vader-64.log 2>&1 $ mpirun -np 32 --mca btl vader,self pms.out /run/user/10002/bigarray.in > pms-vader-32.log 2>&1 $ mpirun -np 32 --mca btl tcp,self pms.out /run/user/10002/bigarray.in > pms-tcp-32.log 2>&1 $ mpirun -np 64 --mca btl tcp,self pms.out /run/user/10002/bigarray.in > pms-tcp-64.log 2>&1 $ mpirun -np 32 --mca btl vader,self mpijr.out /run/user/10002/bigarray.in > mpijr-vader-32.log 2>&1 $ mpirun -np 64 --mca btl vader,self mpijr.out /run/user/10002/bigarray.in > mpijr-vader-64.log 2>&1 I added mpi_just_read_barrier.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read_barrier-c Unfortunately, despite running mpi_just_read_barrier with 32 cores and --bind-to core set, I was not unable to run it with 64 cores for the following reason: -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:compute-0 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. — I will solve this and get back to you soon. Best regards, Ali Cherry. On Mar 6, 2020, at 3:24 PM, Gilles Gouaillardet via users mailto:users@lists.open-mpi.org>> wrote: Also, in mpi_just_read.c, what if you add MPI_Barrier(MPI_COMM_WORLD); right before invoking MPI_Finalize(); can you observe a similar performance degradation when moving from 32 to 64 tasks ? Cheers, Gilles - Original Message - Hi, The log filenames suggests you are always running on a single node, is that correct ? Do you create the input file on the tmpfs once for all? before each run? Can you please post your mpirun command lines? If you did not bind the tasks, can you try again mpirun --bind-to core ... Cheers, Gilles - Original Message - Hi, We faced an issue when testing the scalability of parallel merge sort using reduction tree on an array of size 1024^3. Currently, only the master opens the input file and parse it into an array using fscanf and then distribute the array to other processors. When using 32 processors, it took ~109 seconds to read from file. When using 64 processors, it took ~216 seconds to read from file. Despite varying number of processors, only one processor (the master) read the file. The input file is stored in a tmpfs, its made up of 1024^3 + 1 numbers (where the first number is the array size). Additionally, I ran a C program that only read the file, it took ~104 seconds. However, I also ran an MPI program that only read the file, it took ~116 and ~118 seconds on 32 and 64 processors respectively. Code at https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf parallel_ms.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-parallel_ms-c mpi_just_read.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read-c just_read.c: https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-just_read-c Clearly, increasing number of processors on mpi_just_read.c did not severely affect the elapsed time. For parallel_ms.c, is it possible that 63 processors are in a blocking-read state from processor 0 somehow affecting the read from file elapsed time? Any assistance or clarification would be appreciated. Ali.
Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI
I am not an expert for the one-sided code in Open MPI, I wanted to comment briefly on the potential MPI -IO related item. As far as I can see, the error message “Read -1, expected 48, errno = 1” does not stem from MPI I/O, at least not from the ompio library. What file system did you use for these tests? Thanks Edgar From: users On Behalf Of Matt Thompson via users Sent: Monday, February 24, 2020 1:20 PM To: users@lists.open-mpi.org Cc: Matt Thompson Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI All, My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm not sure how to fix it. Namely, I'm currently trying to get an MPI project's CI working on CircleCI using Open MPI to run some unit tests (on a single node, so need some oversubscribe). I can build everything just fine, but when I try to run, things just...blow up: [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso 6 -ngo 1 -ngi 1 -v T,U -s mpi start app rank: 0 start app rank: 1 start app rank: 2 start app rank: 3 start app rank: 4 start app rank: 5 [3796b115c961:03629] Read -1, expected 48, errno = 1 [3796b115c961:03629] *** An error occurred in MPI_Get [3796b115c961:03629] *** reported by process [2144600065,12] [3796b115c961:03629] *** on win rdma window 5 [3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list [3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [3796b115c961:03629] ***and potentially your MPI job) I'm currently more concerned about the MPI_Get error, though I'm not sure what that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now this code is fairly fancy MPI code, so I decided to try a simpler one. Searched the internet and found an example program here: https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication and when I build and run with Intel MPI it works: (1027)(master) $ mpirun -V Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: 18555) Copyright 2003-2018 Intel Corporation. (1028)(master) $ mpiicc rma_test.c (1029)(master) $ mpirun -np 2 ./a.out srun.slurm: cluster configuration lacks support for cpu binding Rank 0 running on borgj001 Rank 1 running on borgj001 Rank 0 sets data in the shared memory: 00 01 02 03 Rank 1 sets data in the shared memory: 10 11 12 13 Rank 0 gets data from the shared memory: 10 11 12 13 Rank 1 gets data from the shared memory: 00 01 02 03 Rank 0 has new data in the shared memory:Rank 1 has new data in the shared memory: 10 11 12 13 00 01 02 03 So, I have some confidence it was written correctly. Now on the same system I try with Open MPI (building with gcc, not Intel C): (1032)(master) $ mpirun -V mpirun (Open MPI) 4.0.1 Report bugs to http://www.open-mpi.org/community/help/ (1033)(master) $ mpicc rma_test.c (1034)(master) $ mpirun -np 2 ./a.out Rank 0 running on borgj001 Rank 1 running on borgj001 Rank 0 sets data in the shared memory: 00 01 02 03 Rank 1 sets data in the shared memory: 10 11 12 13 [borgj001:22668] *** An error occurred in MPI_Get [borgj001:22668] *** reported by process [2514223105,1] [borgj001:22668] *** on win rdma window 3 [borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range [borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [borgj001:22668] ***and potentially your MPI job) [borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages This is a similar failure to above. Any ideas what I might be doing wrong here? I don't doubt I'm missing something, but I'm not sure what. Open MPI was built pretty boringly: Configure command line: '--with-slurm' '--enable-shared' '--disable-wrapper-rpath' '--disable-wrapper-runpath' '--enable-mca-no-build=btl-usnic' '--prefix=...' And I'm not sure if we need those disable-wrapper bits anymore, but long ago we needed them, and so they've lived on in "how to build" READMEs until something breaks. This btl-usnic is a bit unknown to me (this was built by sysadmins on a cluster), but this is pretty close to how I build on my desktop and it has the same issue. Any ideas from the experts? -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton
Re: [OMPI users] Deadlock in netcdf tests
Orion, It might be a good idea. This bug is triggered from the fcoll/two_phase component (and having spent just two minutes in looking at it, I have a suspicion what triggers it, namely in int vs. long conversion issue), so it is probably unrelated to the other one. I need to add running the netcdf test cases to my list of standard testsuites, but we didn't used to have any problems with them :-( Thanks for the report, we will be working on them! Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion > Poplawski via users > Sent: Friday, October 25, 2019 10:21 PM > To: Open MPI Users > Cc: Orion Poplawski > Subject: Re: [OMPI users] Deadlock in netcdf tests > > Thanks for the response, the workaround helps. > > With that out of the way I see: > > + mpiexec -n 4 ./tst_parallel4 > Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >= > num_aggregators(1)fd_size=461172966257152 off=4156705856 > Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >= > num_aggregators(1)fd_size=4611731477435006976 off=4157193280 > > Should I file issues for both of these? > > On 10/25/19 2:29 AM, Gilles Gouaillardet via users wrote: > > Orion, > > > > > > thanks for the report. > > > > > > I can confirm this is indeed an Open MPI bug. > > > > FWIW, a workaround is to disable the fcoll/vulcan component. > > > > That can be achieved by > > > > mpirun --mca fcoll ^vulcan ... > > > > or > > > > OMPI_MCA_fcoll=^vulcan mpirun ... > > > > > > I also noted the tst_parallel3 program crashes with the ROMIO component. > > > > > > Cheers, > > > > > > Gilles > > > > On 10/25/2019 12:55 PM, Orion Poplawski via users wrote: > >> On 10/24/19 9:28 PM, Orion Poplawski via users wrote: > >>> Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are > >>> seeing a test hang with openmpi 4.0.2. Backtrace: > >>> > >>> (gdb) bt > >>> #0 0x7f90c197529b in sched_yield () from /lib64/libc.so.6 > >>> #1 0x7f90c1ac8a05 in ompi_request_default_wait () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #2 0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #3 0x7f90c1b2bb73 in > >>> ompi_coll_base_allreduce_intra_recursivedoubling () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #4 0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from > >>> /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so > >>> #5 0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from > >>> /usr/lib64/openmpi/lib/libmca_common_ompio.so.41 > >>> #6 0x7f90beb0610b in mca_io_ompio_file_write_at_all () from > >>> /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so > >>> #7 0x7f90c1af033f in PMPI_File_write_at_all () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #8 0x7f90c1627d7b in H5FD_mpio_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #9 0x7f90c14636ee in H5FD_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #10 0x7f90c1442eb3 in H5F__accum_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #11 0x7f90c1543729 in H5PB_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #12 0x7f90c144d69c in H5F_block_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #13 0x7f90c161cd10 in H5C_apply_candidate_list () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #14 0x7f90c161ad02 in H5AC__run_sync_point () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #15 0x7f90c161bd4f in H5AC__flush_entries () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #16 0x7f90c13b154d in H5AC_flush () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #18 0x7f90c1448e64 in H5F__flush () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #20 0x7f90c144f171 in H5F_flush_mounts () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #21 0x7f90c143e3a5 in H5Fflush () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at > >>> ../../libhdf5/hdf5file.c:222 > >>> #23 0x7f90c1c1816e in NC4_enddef (ncid=) at > >>> ../../libhdf5/hdf5file.c:544 > >>> #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at > >>> ../../libdispatch/dfile.c:1004 > >>> #25 0x56527d0def27 in test_pio (flag=0) at > >>> ../../nc_test4/tst_parallel3.c:206 > >>> #26 0x56527d0de62c in main (argc=, > argv= >>> out>) at ../../nc_test4/tst_parallel3.c:91 > >>> > >>> processes are running full out. > >>> > >>> Suggestions for debugging this would be greatly appreciated. > >>> > >> > >> Some more info - I think now it is more dependent on openmpi versions > >> than netcdf itself: > >> > >> - last successful build was with netcdf
Re: [OMPI users] Deadlock in netcdf tests
Never mind, I see it in the backtrace :-) Will look into it, but am currently traveling. Until then, Gilles suggestion is probably the right approach. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gabriel, > Edgar via users > Sent: Friday, October 25, 2019 7:43 AM > To: Open MPI Users > Cc: Gabriel, Edgar > Subject: Re: [OMPI users] Deadlock in netcdf tests > > Orion, > I will look into this problem, is there a specific code or testcase that > triggers > this problem? > Thanks > Edgar > > > -Original Message- > > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > > Orion Poplawski via users > > Sent: Thursday, October 24, 2019 11:56 PM > > To: Open MPI Users > > Cc: Orion Poplawski > > Subject: Re: [OMPI users] Deadlock in netcdf tests > > > > On 10/24/19 9:28 PM, Orion Poplawski via users wrote: > > > Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are > > > seeing a test hang with openmpi 4.0.2. Backtrace: > > > > > > (gdb) bt > > > #0 0x7f90c197529b in sched_yield () from /lib64/libc.so.6 > > > #1 0x7f90c1ac8a05 in ompi_request_default_wait () from > > > /usr/lib64/openmpi/lib/libmpi.so.40 > > > #2 0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from > > > /usr/lib64/openmpi/lib/libmpi.so.40 > > > #3 0x7f90c1b2bb73 in > > > ompi_coll_base_allreduce_intra_recursivedoubling () from > > > /usr/lib64/openmpi/lib/libmpi.so.40 > > > #4 0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from > > > /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so > > > #5 0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from > > > /usr/lib64/openmpi/lib/libmca_common_ompio.so.41 > > > #6 0x7f90beb0610b in mca_io_ompio_file_write_at_all () from > > > /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so > > > #7 0x7f90c1af033f in PMPI_File_write_at_all () from > > > /usr/lib64/openmpi/lib/libmpi.so.40 > > > #8 0x7f90c1627d7b in H5FD_mpio_write () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #9 0x7f90c14636ee in H5FD_write () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #10 0x7f90c1442eb3 in H5F__accum_write () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #11 0x7f90c1543729 in H5PB_write () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #12 0x7f90c144d69c in H5F_block_write () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #13 0x7f90c161cd10 in H5C_apply_candidate_list () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #14 0x7f90c161ad02 in H5AC__run_sync_point () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #15 0x7f90c161bd4f in H5AC__flush_entries () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #16 0x7f90c13b154d in H5AC_flush () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #18 0x7f90c1448e64 in H5F__flush () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #20 0x7f90c144f171 in H5F_flush_mounts () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #21 0x7f90c143e3a5 in H5Fflush () from > > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > > #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at > > > ../../libhdf5/hdf5file.c:222 > > > #23 0x7f90c1c1816e in NC4_enddef (ncid=) at > > > ../../libhdf5/hdf5file.c:544 > > > #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at > > > ../../libdispatch/dfile.c:1004 > > > #25 0x56527d0def27 in test_pio (flag=0) at > > > ../../nc_test4/tst_parallel3.c:206 > > > #26 0x56527d0de62c in main (argc=, > > > argv= > > out>) at ../../nc_test4/tst_parallel3.c:91 > > > > > > processes are running full out. > > > > > > Suggestions for debugging this would be greatly appreciated. > > > > > > > Some more info - I think now it is more dependent on openmpi versions > > than netcdf itself: > > > > - last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx > > 1.5.2, pmix-3.1.4. Possible start of the failure was with openmpi > > 4.0.2-rc1 and ucx 1.6.0. > > > > - netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2, > > ucx 1.6.1, pmix 3.1.4 > > > > - netcdf 4.7.0 test hangs on Fedora F31 with openmpi 4.0.2rc2 with > > internal UCX. > > > > -- > > Orion Poplawski > > Manager of NWRA Technical Systems 720-772-5637 > > NWRA, Boulder/CoRA Office FAX: 303-415-9702 > > 3380 Mitchell Lane or...@nwra.com > > Boulder, CO 80301 https://www.nwra.com/
Re: [OMPI users] Deadlock in netcdf tests
Orion, I will look into this problem, is there a specific code or testcase that triggers this problem? Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion > Poplawski via users > Sent: Thursday, October 24, 2019 11:56 PM > To: Open MPI Users > Cc: Orion Poplawski > Subject: Re: [OMPI users] Deadlock in netcdf tests > > On 10/24/19 9:28 PM, Orion Poplawski via users wrote: > > Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are seeing a > > test hang with openmpi 4.0.2. Backtrace: > > > > (gdb) bt > > #0 0x7f90c197529b in sched_yield () from /lib64/libc.so.6 > > #1 0x7f90c1ac8a05 in ompi_request_default_wait () from > > /usr/lib64/openmpi/lib/libmpi.so.40 > > #2 0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from > > /usr/lib64/openmpi/lib/libmpi.so.40 > > #3 0x7f90c1b2bb73 in > > ompi_coll_base_allreduce_intra_recursivedoubling () from > > /usr/lib64/openmpi/lib/libmpi.so.40 > > #4 0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from > > /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so > > #5 0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from > > /usr/lib64/openmpi/lib/libmca_common_ompio.so.41 > > #6 0x7f90beb0610b in mca_io_ompio_file_write_at_all () from > > /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so > > #7 0x7f90c1af033f in PMPI_File_write_at_all () from > > /usr/lib64/openmpi/lib/libmpi.so.40 > > #8 0x7f90c1627d7b in H5FD_mpio_write () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #9 0x7f90c14636ee in H5FD_write () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #10 0x7f90c1442eb3 in H5F__accum_write () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #11 0x7f90c1543729 in H5PB_write () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #12 0x7f90c144d69c in H5F_block_write () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #13 0x7f90c161cd10 in H5C_apply_candidate_list () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #14 0x7f90c161ad02 in H5AC__run_sync_point () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #15 0x7f90c161bd4f in H5AC__flush_entries () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #16 0x7f90c13b154d in H5AC_flush () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #18 0x7f90c1448e64 in H5F__flush () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #20 0x7f90c144f171 in H5F_flush_mounts () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #21 0x7f90c143e3a5 in H5Fflush () from > > /usr/lib64/openmpi/lib/libhdf5.so.103 > > #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at > > ../../libhdf5/hdf5file.c:222 > > #23 0x7f90c1c1816e in NC4_enddef (ncid=) at > > ../../libhdf5/hdf5file.c:544 > > #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at > > ../../libdispatch/dfile.c:1004 > > #25 0x56527d0def27 in test_pio (flag=0) at > > ../../nc_test4/tst_parallel3.c:206 > > #26 0x56527d0de62c in main (argc=, argv= > out>) at ../../nc_test4/tst_parallel3.c:91 > > > > processes are running full out. > > > > Suggestions for debugging this would be greatly appreciated. > > > > Some more info - I think now it is more dependent on openmpi versions > than netcdf itself: > > - last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx 1.5.2, > pmix-3.1.4. Possible start of the failure was with openmpi 4.0.2-rc1 > and ucx 1.6.0. > > - netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2, > ucx 1.6.1, pmix 3.1.4 > > - netcdf 4.7.0 test hangs on Fedora F31 with openmpi 4.0.2rc2 with > internal UCX. > > -- > Orion Poplawski > Manager of NWRA Technical Systems 720-772-5637 > NWRA, Boulder/CoRA Office FAX: 303-415-9702 > 3380 Mitchell Lane or...@nwra.com > Boulder, CO 80301 https://www.nwra.com/
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Yes, I was talking about the same thing, although for me it was not t_mpi, but t_shapesame that was hanging. It might be an indication of the same issue however. > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan > Novosielski > Sent: Thursday, February 21, 2019 1:59 PM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > > > On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar > wrote: > > > >> -Original Message- > >>> Does it always occur at 20+ minutes elapsed ? > >> > >> Aha! Yes, you are right: every time it fails, it’s at the 20 minute > >> and a couple of seconds mark. For comparison, every time it runs, it > >> runs for 2-3 seconds total. So it seems like what might actually be > >> happening here is a hang, and not a failure of the test per se. > >> > > > > I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 > (although this was OpenSuSE, not Redhat), and it looked to me like one of > tests were hanging, but I didn't have time to investigate it further. > > Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The > OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t > believe it ever launches any jobs or anything like that. > > -- > > || \\UTGERS, > |---*O*--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
> -Original Message- > > Does it always occur at 20+ minutes elapsed ? > > Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a > couple > of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds > total. So it seems like what might actually be happening here is a hang, and > not a failure of the test per se. > I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 (although this was OpenSuSE, not Redhat), and it looked to me like one of tests were hanging, but I didn't have time to investigate it further. Thanks Edgar > > Is there some mechanism that automatically kills a job if it does not write > anything to stdout for some time ? > > > > A quick way to rule that out is to > > > > srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800 > > > > and see if that completes or get killed with the same error message. > > I was not aware of anything like that, but I’ll look into it now (running your > suggestion). I guess we don’t run across this sort of thing very often — most > stuff at least prints output when it starts. > > > You can also run use mpirun instead of srun, and even run mpirun > > outside of slurm > > > > (if your cluster policy allows it, you can for example use mpirun and > > run on the frontend node) > > I’m on the team that manages the cluster, so we can try various things. Every > piece of software we ever run, though, runs via srun — we don’t provide > mpirun as a matter of course, except in some corner cases. > > > On 2/21/2019 3:01 AM, Ryan Novosielski wrote: > >> Does it make any sense that it seems to work fine when OpenMPI and > HDF5 are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with > RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 > build, I did try an XFS filesystem and it didn’t help. GPFS works fine for > either > of the 7.4 and 8.2 builds. > >> > >> Just as a reminder, since it was reasonably far back in the thread, what > I’m doing is running the “make check” tests in HDF5 1.10.4, in part because > users use it, but also because it seems to have a good test suite and I can > therefore verify the compiler and MPI stack installs. I get very little > information, apart from it not working and getting that “Alarm clock” > message. > >> > >> I originally suspected I’d somehow built some component of this with a > host-specific optimization that wasn’t working on some compute nodes. But I > controlled for that and it didn’t seem to make any difference. > >> > >> -- > >> > >> || \\UTGERS, > >> |---*O*--- > >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu > >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > >> || \\of NJ | Office of Advanced Research Computing - MSB C630, > Newark > >> `' > >> > >>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski > wrote: > >>> > >>> It didn’t work any better with XFS, as it happens. Must be something > else. I’m going to test some more and see if I can narrow it down any, as it > seems to me that it did work with a different compiler. > >>> > >>> -- > >>> > >>> || \\UTGERS, > >>> |---*O*--- > >>> ||_// the State| Ryan Novosielski - novos...@rutgers.edu > >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > >>> || \\of NJ| Office of Advanced Research Computing - MSB > C630, Newark > >>> `' > >>> > >>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar > wrote: > >>>> > >>>> While I was working on something else, I let the tests run with Open > MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1 > release), and here is what I found for the HDF5 1.10.4 tests on my local > desktop: > >>>> > >>>> In the testpar directory, there is in fact one test that fails for both > ompio and romio321 in exactly the same manner. > >>>> I used 6 processes as you did (although I used mpirun directly instead > of srun...) From the 13 tests in the testpar directory, 12 pass correctly > (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, > t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). > >>>> > >>>> The one tests that off
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Well, the way you describe it, it sounds to me like maybe an atomic issue with this compiler version. What was your configure line of Open MPI, and what network interconnect are you using? An easy way to test this theory would be to force OpenMPI to use the tcp interfaces (everything will be slow however). You can do that by creating in your home directory a directory called .openmpi, and add there a file called mca-params.conf The file should look something like this: btl = tcp,self Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan > Novosielski > Sent: Wednesday, February 20, 2019 12:02 PM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > Does it make any sense that it seems to work fine when OpenMPI and HDF5 > are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL- > supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build, > I did try an XFS filesystem and it didn’t help. GPFS works fine for either of > the > 7.4 and 8.2 builds. > > Just as a reminder, since it was reasonably far back in the thread, what I’m > doing is running the “make check” tests in HDF5 1.10.4, in part because users > use it, but also because it seems to have a good test suite and I can > therefore > verify the compiler and MPI stack installs. I get very little information, > apart > from it not working and getting that “Alarm clock” message. > > I originally suspected I’d somehow built some component of this with a host- > specific optimization that wasn’t working on some compute nodes. But I > controlled for that and it didn’t seem to make any difference. > > -- > > || \\UTGERS, > |---*O*--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' > > > On Feb 18, 2019, at 1:34 PM, Ryan Novosielski > wrote: > > > > It didn’t work any better with XFS, as it happens. Must be something else. > I’m going to test some more and see if I can narrow it down any, as it seems > to me that it did work with a different compiler. > > > > -- > > > > || \\UTGERS, > > |---*O*--- > > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > > || \\of NJ | Office of Advanced Research Computing - MSB C630, > Newark > > `' > > > >> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar > wrote: > >> > >> While I was working on something else, I let the tests run with Open MPI > master (which is for parallel I/O equivalent to the upcoming v4.0.1 release), > and here is what I found for the HDF5 1.10.4 tests on my local desktop: > >> > >> In the testpar directory, there is in fact one test that fails for both > >> ompio > and romio321 in exactly the same manner. > >> I used 6 processes as you did (although I used mpirun directly instead of > srun...) From the 13 tests in the testpar directory, 12 pass correctly > (t_bigio, > t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, > t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). > >> > >> The one tests that officially fails ( t_pflush1) actually reports that it > >> passed, > but then throws message that indicates that MPI_Abort has been called, for > both ompio and romio. I will try to investigate this test to see what is going > on. > >> > >> That being said, your report shows an issue in t_mpi, which passes > without problems for me. This is however not GPFS, this was an XFS local file > system. Running the tests on GPFS are on my todo list as well. > >> > >> Thanks > >> Edgar > >> > >> > >> > >>> -Original Message- > >>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > >>> Gabriel, Edgar > >>> Sent: Sunday, February 17, 2019 10:34 AM > >>> To: Open MPI Users > >>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems > >>> w/OpenMPI > >>> 3.1.3 > >>> > >>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have > >>> access to a GPFS file system since recently, and wil
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
While I was working on something else, I let the tests run with Open MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1 release), and here is what I found for the HDF5 1.10.4 tests on my local desktop: In the testpar directory, there is in fact one test that fails for both ompio and romio321 in exactly the same manner. I used 6 processes as you did (although I used mpirun directly instead of srun...) From the 13 tests in the testpar directory, 12 pass correctly (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). The one tests that officially fails ( t_pflush1) actually reports that it passed, but then throws message that indicates that MPI_Abort has been called, for both ompio and romio. I will try to investigate this test to see what is going on. That being said, your report shows an issue in t_mpi, which passes without problems for me. This is however not GPFS, this was an XFS local file system. Running the tests on GPFS are on my todo list as well. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > Gabriel, Edgar > Sent: Sunday, February 17, 2019 10:34 AM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > I will also run our testsuite and the HDF5 testsuite on GPFS, I have access > to a > GPFS file system since recently, and will report back on that, but it will > take a > few days. > > Thanks > Edgar > > > -Original Message- > > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > > Ryan Novosielski > > Sent: Sunday, February 17, 2019 2:37 AM > > To: users@lists.open-mpi.org > > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > > 3.1.3 > > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > This is on GPFS. I'll try it on XFS to see if it makes any difference. > > > > On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > > > Ryan, > > > > > > What filesystem are you running on ? > > > > > > Open MPI defaults to the ompio component, except on Lustre > > > filesystem where ROMIO is used. (if the issue is related to ROMIO, > > > that can explain why you did not see any difference, in that case, > > > you might want to try an other filesystem (local filesystem or NFS > > > for example)\ > > > > > > > > > Cheers, > > > > > > Gilles > > > > > > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > > > wrote: > > >> > > >> I verified that it makes it through to a bash prompt, but I’m a > > >> little less confident that something make test does doesn’t clear it. > > >> Any recommendation for a way to verify? > > >> > > >> In any case, no change, unfortunately. > > >> > > >> Sent from my iPhone > > >> > > >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar > > >>> > > >>> wrote: > > >>> > > >>> What file system are you running on? > > >>> > > >>> I will look into this, but it might be later next week. I just > > >>> wanted to emphasize that we are regularly running the parallel > > >>> hdf5 tests with ompio, and I am not aware of any outstanding items > > >>> that do not work (and are supposed to work). That being said, I > > >>> run the tests manually, and not the 'make test' > > >>> commands. Will have to check which tests are being run by that. > > >>> > > >>> Edgar > > >>> > > >>>> -Original Message- From: users > > >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > > >>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open > > >>>> MPI Users Subject: Re: > > >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > > >>>> 3.1.3 > > >>>> > > >>>> Ryan, > > >>>> > > >>>> Can you > > >>>> > > >>>> export OMPI_MCA_io=^ompio > > >>>> > > >>>> and try again after you made sure this environment variable is > > >>>> passed by srun to the MPI tasks ? > > >>>> > > >>>> We have identified and fixed several issues specific t
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
I will also run our testsuite and the HDF5 testsuite on GPFS, I have access to a GPFS file system since recently, and will report back on that, but it will take a few days. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan > Novosielski > Sent: Sunday, February 17, 2019 2:37 AM > To: users@lists.open-mpi.org > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > This is on GPFS. I'll try it on XFS to see if it makes any difference. > > On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > > Ryan, > > > > What filesystem are you running on ? > > > > Open MPI defaults to the ompio component, except on Lustre filesystem > > where ROMIO is used. (if the issue is related to ROMIO, that can > > explain why you did not see any difference, in that case, you might > > want to try an other filesystem (local filesystem or NFS for example)\ > > > > > > Cheers, > > > > Gilles > > > > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > > wrote: > >> > >> I verified that it makes it through to a bash prompt, but I’m a > >> little less confident that something make test does doesn’t clear it. > >> Any recommendation for a way to verify? > >> > >> In any case, no change, unfortunately. > >> > >> Sent from my iPhone > >> > >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar > >>> wrote: > >>> > >>> What file system are you running on? > >>> > >>> I will look into this, but it might be later next week. I just > >>> wanted to emphasize that we are regularly running the parallel > >>> hdf5 tests with ompio, and I am not aware of any outstanding items > >>> that do not work (and are supposed to work). That being said, I run > >>> the tests manually, and not the 'make test' > >>> commands. Will have to check which tests are being run by that. > >>> > >>> Edgar > >>> > >>>> -Original Message- From: users > >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > >>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open MPI > >>>> Users Subject: Re: > >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > >>>> 3.1.3 > >>>> > >>>> Ryan, > >>>> > >>>> Can you > >>>> > >>>> export OMPI_MCA_io=^ompio > >>>> > >>>> and try again after you made sure this environment variable is > >>>> passed by srun to the MPI tasks ? > >>>> > >>>> We have identified and fixed several issues specific to the > >>>> (default) ompio component, so that could be a valid workaround > >>>> until the next release. > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> Ryan Novosielski wrote: > >>>>> Hi there, > >>>>> > >>>>> Honestly don’t know which piece of this puzzle to look at or how > >>>>> to get more > >>>> information for troubleshooting. I successfully built HDF5 > >>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the > >>>> “make check” in HDF5 is failing at the below point; I am using a > >>>> value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t > >>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly > >>>> configured. > >>>>> > >>>>> Thanks for any help you can provide. > >>>>> > >>>>> make[4]: Entering directory > >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>> Testing t_mpi > >>>>> t_mpi Test Log > >>>>> srun: job 84126610 queued and > waiting > >>>>> for resources srun: job 84126610 has been allocated resources > >>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user > >>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata > >>>>> 5152maxresident)k 0inputs+0outputs (0major+1529minor)pagefaults > >>>>> 0swap
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
What file system are you running on? I will look into this, but it might be later next week. I just wanted to emphasize that we are regularly running the parallel hdf5 tests with ompio, and I am not aware of any outstanding items that do not work (and are supposed to work). That being said, I run the tests manually, and not the 'make test' commands. Will have to check which tests are being run by that. Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > Gouaillardet > Sent: Saturday, February 16, 2019 1:49 AM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > Ryan, > > Can you > > export OMPI_MCA_io=^ompio > > and try again after you made sure this environment variable is passed by srun > to the MPI tasks ? > > We have identified and fixed several issues specific to the (default) ompio > component, so that could be a valid workaround until the next release. > > Cheers, > > Gilles > > Ryan Novosielski wrote: > >Hi there, > > > >Honestly don’t know which piece of this puzzle to look at or how to get more > information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL > system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is > failing at the below point; I am using a value of RUNPARALLEL='srun -- > mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise > properly configured. > > > >Thanks for any help you can provide. > > > >make[4]: Entering directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > > > >Testing t_mpi > > > >t_mpi Test Log > > > >srun: job 84126610 queued and waiting for resources > >srun: job 84126610 has been allocated resources > >srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system > >20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k > >0inputs+0outputs (0major+1529minor)pagefaults 0swaps > >make[4]: *** [t_mpi.chkexe_] Error 1 > >make[4]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make[3]: *** [build-check-p] Error 1 > >make[3]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make[2]: *** [test] Error 2 > >make[2]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make[1]: *** [check-am] Error 2 > >make[1]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make: *** [check-recursive] Error 1 > > > >-- > > > >|| \\UTGERS, > >|---*O*--- > >||_// the State | Ryan Novosielski - novos...@rutgers.edu > >|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > >|| \\of NJ | Office of Advanced Research Computing - MSB C630, > >Newark > > `' > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Building OpenMPI with Lustre support using PGI fails
Gilles submitted a patch for that, and I approved it a couple of days back, I *think* it has not been merged however. This was a bug in the Open MPI Lustre configure logic, should be fixed after this one however. https://github.com/open-mpi/ompi/pull/6080 Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Latham, > Robert J. via users > Sent: Tuesday, November 27, 2018 2:03 PM > To: users@lists.open-mpi.org > Cc: Latham, Robert J. ; gi...@rist.or.jp > Subject: Re: [OMPI users] Building OpenMPI with Lustre support using PGI fails > > On Tue, 2018-11-13 at 21:57 -0600, gil...@rist.or.jp wrote: > > Raymond, > > > > can you please compress and post your config.log ? > > I didn't see the config.log in response to this. Maybe Ray and Giles took the > discusison off list? As someone who might have introduced the offending > configure-time checks, I'm particularly interested in fixing lustre detection. > > ==rob > > > > > > > Cheers, > > > > Gilles > > > > - Original Message - > > > I am trying to build OpenMPI with Lustre support using PGI 18.7 on > > > CentOS 7.5 (1804). > > > > > > It builds successfully with Intel compilers, but fails to find the > > > necessary Lustre components with the PGI compiler. > > > > > > I have tried building OpenMPI 4.0.0, 3.1.3 and 2.1.5. I can > > > build > > > OpenMPI, but configure does not find the proper Lustre files. > > > > > > Lustre is installed from current client RPMS, version 2.10.5 > > > > > > Include files are in /usr/include/lustre > > > > > > When specifying --with-lustre, I get: > > > > > > --- MCA component fs:lustre (m4 configuration macro) checking for > > > MCA component fs:lustre compile mode... dso checking --with-lustre > > > value... simple ok (unspecified value) looking for header without > > > includes checking lustre/lustreapi.h usability... yes checking > > > lustre/lustreapi.h presence... yes checking for > > > lustre/lustreapi.h... yes checking for library containing > > > llapi_file_create... -llustreapi checking if liblustreapi requires > > > libnl v1 or v3... > > > checking for required lustre data structures... no > > > configure: error: Lustre support requested but not found. Aborting > > > > > > > > > -- > > > > > > Ray Muno > > > IT Manager > > > > > > > > >University of Minnesota > > > Aerospace Engineering and Mechanics Mechanical > > > Engineering > > > 110 Union St. S.E. 111 Church Street SE > > > Minneapolis, MN 55455 Minneapolis, MN 55455 > > > > > > ___ > > > users mailing list > > > users@lists.open-mpi.org > > > https://lists.open-mpi.org/mailman/listinfo/users > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
Dave, Thank you for your detailed report and testing, that is indeed very helpful. We will definitely have to do something. Here is what I think would be potentially doable. a) if we detect a Lustre file system without flock support, we can printout an error message. Completely disabling MPI I/O is on the ompio architecture not possible at the moment, since the Lustre component can disqualify itself, but the generic Unix FS component would kick in in that case, and still continue execution. To be more precise, the query function of the Lustre component has no way to return anything than "I am interested to run" or "I am not interested to run" b) I can add an MCA parameter that would allow the Lustre component to abort execution of the job entirely. While this parameter would probably be by default set to 'false', a system administrator could configure it to be set to 'true' an particular platform. I will discuss this also with a couple of other people in the next couple of days. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave > Love > Sent: Monday, October 15, 2018 4:22 AM > To: Open MPI Users > Subject: Re: [OMPI users] ompio on Lustre > > For what it's worth, I found the following from running ROMIO's tests with > OMPIO on Lustre mounted without flock (or localflock). I used 48 processes > on two nodes with Lustre for tests which don't require a specific number. > > OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails > noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock. > > On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig, > shared_fp, ordered_fp, and error. > > Please can OMPIO be changed to fail in the same way as ROMIO (with a clear > message) for the operations it can't support without flock. > Otherwise it looks as if you can potentially get invalid data, or at least > waste > time debugging other errors. > > I'd debug the common failure on the "error" test, but ptrace is disabled on > the > system. > > In case anyone else is in the same boat and can't get mounts changed, I > suggested staging data to and from a PVFS2^WOrangeFS ephemeral > filesystem on jobs' TMPDIR local mounts if they will fit. Of course other > libraries will potentially corrupt data on nolock mounts. > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
Well, good question. To be fair, the test passes if you run it with a lower number of processes. In addition, I had a couple of years back a discussion on that with one of the HDF5 developers, and it seemed to be ok to run it this way. That being said, after thinking about it a bit, I think the fix to properly support it is at this point relatively easy, I will try to make it work in the next couple of days (there was a big chunk of code brought in for another fix last year in fall, and I think we have actually everything in place to properly support the atomicity operations). Edgar > -Original Message- > From: Dave Love [mailto:dave.l...@manchester.ac.uk] > Sent: Wednesday, October 10, 2018 3:46 AM > To: Gabriel, Edgar > Cc: Open MPI Users > Subject: Re: [OMPI users] ompio on Lustre > > "Gabriel, Edgar" writes: > > > Ok, thanks. I usually run these test with 4 or 8, but the major item > > is that atomicity is one of the areas that are not well supported in > > ompio (along with data representations), so a failure in those tests > > is not entirely surprising . > > If it's not expected to work, could it be made to return a helpful error, > rather > than just not working properly? ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
Ok, thanks. I usually run these test with 4 or 8, but the major item is that atomicity is one of the areas that are not well supported in ompio (along with data representations), so a failure in those tests is not entirely surprising . Most of the work to support atomicity properly is actually in place, but we didn't have the manpower (and requests to be honest) to finish that work. Thanks Edgar > -Original Message- > From: Dave Love [mailto:dave.l...@manchester.ac.uk] > Sent: Tuesday, October 9, 2018 7:05 AM > To: Gabriel, Edgar > Cc: Open MPI Users > Subject: Re: [OMPI users] ompio on Lustre > > "Gabriel, Edgar" writes: > > > Hm, thanks for the report, I will look into this. I did not run the > > romio tests, but the hdf5 tests are run regularly and with 3.1.2 you > > should not have any problems on a regular unix fs. How many processes > > did you use, and which tests did you run specifically? The main tests > > that I execute from their parallel testsuite are testphdf5 and > > t_shapesame. > > Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core SMP > job > (so 24 processes), where $TMPDIR is on ext4: > > export HDF5_PARAPREFIX=$TMPDIR > make check RUNPARALLEL='mpirun' > > It stopped after testphdf5 spewed "Atomicity Test Failed" errors. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
Hm, thanks for the report, I will look into this. I did not run the romio tests, but the hdf5 tests are run regularly and with 3.1.2 you should not have any problems on a regular unix fs. How many processes did you use, and which tests did you run specifically? The main tests that I execute from their parallel testsuite are testphdf5 and t_shapesame. I will also look into the testmpio that you mentioned in the next couple of days. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave > Love > Sent: Monday, October 8, 2018 10:20 AM > To: Open MPI Users > Subject: Re: [OMPI users] ompio on Lustre > > I said I'd report back about trying ompio on lustre mounted without flock. > > I couldn't immediately figure out how to run MTT. I tried the parallel > hdf5 tests from the hdf5 1.10.3, but I got errors with that even with the > relevant environment variable to put the files on (local) /tmp. > Then it occurred to me rather late that romio would have tests. Using the > "runtests" script modified to use "--mca io ompio" in the romio/test directory > from ompi 3.1.2 on no-flock-mounted Lustre, after building the tests with an > installed ompi-3.1.2, it did this and apparently hung at the end: > > Testing simple.c >No Errors > Testing async.c >No Errors > Testing async-multiple.c >No Errors > Testing atomicity.c > Process 3: readbuf[118] is 0, should be 10 > Process 2: readbuf[65] is 0, should be 10 > -- > MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD > with errorcode 1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -- > Process 1: readbuf[145] is 0, should be 10 > Testing coll_test.c >No Errors > Testing excl.c > error opening file test > error opening file test > error opening file test > > Then I ran on local /tmp as a sanity check and still got errors: > > Testing I/O functions > Testing simple.c >No Errors > Testing async.c >No Errors > Testing async-multiple.c >No Errors > Testing atomicity.c > Process 2: readbuf[155] is 0, should be 10 > Process 1: readbuf[128] is 0, should be 10 > Process 3: readbuf[128] is 0, should be 10 > -- > MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD > with errorcode 1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -- > Testing coll_test.c >No Errors > Testing excl.c >No Errors > Testing file_info.c >No Errors > Testing i_noncontig.c >No Errors > Testing noncontig.c >No Errors > Testing noncontig_coll.c >No Errors > Testing noncontig_coll2.c >No Errors > Testing aggregation1 >No Errors > Testing aggregation2 >No Errors > Testing hindexed >No Errors > Testing misc.c > file pointer posn = 265, should be 10 > > byte offset = 3020, should be 1080 > > file pointer posn = 265, should be 10 > > byte offset = 3020, should be 1080 > > file pointer posn = 265, should be 10 > > byte offset = 3020, should be 1080 > > file pointer posn in bytes = 3280, should be 1000 > > file pointer posn = 265, should be 10 > > byte offset = 3020, should be 1080 > > file pointer posn in bytes = 3280, should be 1000 > > file pointer posn in bytes = 3280, should be 1000 > > file pointer posn in bytes = 3280, should be 1000 > > Found 12 errors > Testing shared_fp.c >No Errors > Testing ordered_fp.c >No Errors > Testing split_coll.c >No Errors > Testing psimple.c >No Errors > Testing error.c > File set view did not return an error >Found 1 errors > Testing status.c >No Errors > Testing types_with_zeros >No Errors > Testing darray_read >No Errors > > I even got an error with romio on /tmp (modifying the script to use mpirun -- > mca io romi314): > > Testing error.c > Unexpected error message MPI_ERR_ARG: invalid argument of some other > kind >Found 1 errors > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
It was originally for performance reasons, but this should be fixed at this point. I am not aware of correctness problems. However, let me try to clarify your question about: What do you precisely mean by "MPI I/O on Lustre mounts without flock"? Was the Lustre filesystem mounted without flock? If yes, that could lead to some problems, we had that on our Lustre installation for a while, but problems were even occurring without MPI I/O in that case (although I do not recall all details, just that we had to change the mount options). Maybe just take a testsuite (either ours or HDF5), make sure to run it in a multi-node configuration and see whether it works correctly. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave > Love > Sent: Friday, October 5, 2018 5:15 AM > To: users@lists.open-mpi.org > Subject: [OMPI users] ompio on Lustre > > Is romio preferred over ompio on Lustre for performance or correctness? > If it's relevant, the context is MPI-IO on Lustre mounts without flock, which > ompio doesn't seem to require. > Thanks. > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users