[OMPI users] SC'18 PMIx BoF meeting
Hello all [I’m sharing this on the OMPI mailing lists (as well as the PMIx one) as PMIx has become tightly integrated to the OMPI code since v2.0 was released] The PMIx Community will once again be hosting a Birds-of-a-Feather meeting at SuperComputing. This year, however, will be a little different! PMIx has come a long, long way over the last four years, and we are starting to see application-level adoption of the various APIs. Accordingly, we will be devoting most of this year’s meeting to a tutorial-like review of several use-cases, including: * fault-tolerant OpenSHMEM implementation * interlibrary resource coordination using OpenMP and MPI * population modeling and swarm intelligence models running natively in an HPC environment * use of the PMIx_Query interface The meeting has been shifted to Wed night, 5:15-6:45pm, in room C144. Please share this with others who you feel might be interested, and do plan to attend! Ralph ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
Dave, Thank you for your detailed report and testing, that is indeed very helpful. We will definitely have to do something. Here is what I think would be potentially doable. a) if we detect a Lustre file system without flock support, we can printout an error message. Completely disabling MPI I/O is on the ompio architecture not possible at the moment, since the Lustre component can disqualify itself, but the generic Unix FS component would kick in in that case, and still continue execution. To be more precise, the query function of the Lustre component has no way to return anything than "I am interested to run" or "I am not interested to run" b) I can add an MCA parameter that would allow the Lustre component to abort execution of the job entirely. While this parameter would probably be by default set to 'false', a system administrator could configure it to be set to 'true' an particular platform. I will discuss this also with a couple of other people in the next couple of days. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave > Love > Sent: Monday, October 15, 2018 4:22 AM > To: Open MPI Users > Subject: Re: [OMPI users] ompio on Lustre > > For what it's worth, I found the following from running ROMIO's tests with > OMPIO on Lustre mounted without flock (or localflock). I used 48 processes > on two nodes with Lustre for tests which don't require a specific number. > > OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails > noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock. > > On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig, > shared_fp, ordered_fp, and error. > > Please can OMPIO be changed to fail in the same way as ROMIO (with a clear > message) for the operations it can't support without flock. > Otherwise it looks as if you can potentially get invalid data, or at least > waste > time debugging other errors. > > I'd debug the common failure on the "error" test, but ptrace is disabled on > the > system. > > In case anyone else is in the same boat and can't get mounts changed, I > suggested staging data to and from a PVFS2^WOrangeFS ephemeral > filesystem on jobs' TMPDIR local mounts if they will fit. Of course other > libraries will potentially corrupt data on nolock mounts. > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
On Mon, 2018-10-15 at 12:21 +0100, Dave Love wrote: > For what it's worth, I found the following from running ROMIO's tests > with OMPIO on Lustre mounted without flock (or localflock). I used > 48 > processes on two nodes with Lustre for tests which don't require a > specific number. > > OMPIO fails tests atomicity, misc, and error on ext4; it additionally > fails noncontig_coll2, fp, shared_fp, and ordered_fp on > Lustre/noflock. > > On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig, > shared_fp, ordered_fp, and error. > > Please can OMPIO be changed to fail in the same way as ROMIO (with a > clear message) for the operations it can't support without flock. > Otherwise it looks as if you can potentially get invalid data, or at > least waste time debugging other errors. > > I'd debug the common failure on the "error" test, but ptrace is > disabled > on the system. > > In case anyone else is in the same boat and can't get mounts changed, > I > suggested staging data to and from a PVFS2^WOrangeFS ephemeral > filesystem on jobs' TMPDIR local mounts if they will fit. Of course > other libraries will potentially corrupt data on nolock mounts. ROMIO uses fcntl locks for Atomic mode, Shared file pointer updates, and to prefent false sharing in the data sieving optimization for noncontiguous writes. it's hard to implement fcntl-lock-free versions of Atomic mode and Shared file pointer so file systems like PVFS don't support those modes (and return an error indicating such at open time). You can run lock-free for noncontiguous writes, though at a significant performance cost. In ROMIO we can disable data sieving write by setting the hint "romio_ds_write" to "disable", which will fall back to piece-wise operations. Could be OK if you know your noncontiguous accesses are only a little bit noncontiguous. Perhaps OMPIO has a similar option, but I am not familiar with its tuning knobs. ==rob ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] ompio on Lustre
For what it's worth, I found the following from running ROMIO's tests with OMPIO on Lustre mounted without flock (or localflock). I used 48 processes on two nodes with Lustre for tests which don't require a specific number. OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock. On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig, shared_fp, ordered_fp, and error. Please can OMPIO be changed to fail in the same way as ROMIO (with a clear message) for the operations it can't support without flock. Otherwise it looks as if you can potentially get invalid data, or at least waste time debugging other errors. I'd debug the common failure on the "error" test, but ptrace is disabled on the system. In case anyone else is in the same boat and can't get mounts changed, I suggested staging data to and from a PVFS2^WOrangeFS ephemeral filesystem on jobs' TMPDIR local mounts if they will fit. Of course other libraries will potentially corrupt data on nolock mounts. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users