[OMPI users] SC'18 PMIx BoF meeting

2018-10-15 Thread Ralph H Castain
Hello all

[I’m sharing this on the OMPI mailing lists (as well as the PMIx one) as PMIx 
has become tightly integrated to the OMPI code since v2.0 was released]

The PMIx Community will once again be hosting a Birds-of-a-Feather meeting at 
SuperComputing. This year, however, will be a little different! PMIx has come a 
long, long way over the last four years, and we are starting to see 
application-level adoption of the various APIs. Accordingly, we will be 
devoting most of this year’s meeting to a tutorial-like review of several 
use-cases, including:

* fault-tolerant OpenSHMEM implementation
* interlibrary resource coordination using OpenMP and MPI
* population modeling and swarm intelligence models running natively in an HPC 
environment
* use of the PMIx_Query interface

The meeting has been shifted to Wed night, 5:15-6:45pm, in room C144. Please 
share this with others who you feel might be interested, and do plan to attend!
Ralph

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Gabriel, Edgar
Dave,
Thank you for your detailed report and testing, that is indeed very helpful. We 
will definitely have to do something.
Here is what I think would be potentially doable.

a) if we detect a Lustre file system without flock support, we can printout an 
error message. Completely disabling MPI I/O is on the ompio architecture not 
possible at the moment, since the Lustre component can disqualify itself, but 
the generic Unix FS component would kick in in that case, and still continue 
execution. To be more precise, the query function of the Lustre component has 
no way to return anything than "I am interested to run" or "I am not interested 
to run"

b)  I can add an MCA parameter that would allow the Lustre component to abort 
execution of the job entirely. While this parameter would probably be by 
default set to 'false', a system administrator could configure it to be set to 
'true' an particular platform. 

I will discuss this also with a couple of other people in the next couple of 
days.
Thanks
Edgar 

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Monday, October 15, 2018 4:22 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> For what it's worth, I found the following from running ROMIO's tests with
> OMPIO on Lustre mounted without flock (or localflock).  I used 48 processes
> on two nodes with Lustre for tests which don't require a specific number.
> 
> OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails
> noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.
> 
> On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
> shared_fp, ordered_fp, and error.
> 
> Please can OMPIO be changed to fail in the same way as ROMIO (with a clear
> message) for the operations it can't support without flock.
> Otherwise it looks as if you can potentially get invalid data, or at least 
> waste
> time debugging other errors.
> 
> I'd debug the common failure on the "error" test, but ptrace is disabled on 
> the
> system.
> 
> In case anyone else is in the same boat and can't get mounts changed, I
> suggested staging data to and from a PVFS2^WOrangeFS ephemeral
> filesystem on jobs' TMPDIR local mounts if they will fit.  Of course other
> libraries will potentially corrupt data on nolock mounts.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Latham, Robert J.
On Mon, 2018-10-15 at 12:21 +0100, Dave Love wrote:
> For what it's worth, I found the following from running ROMIO's tests
> with OMPIO on Lustre mounted without flock (or localflock).  I used
> 48
> processes on two nodes with Lustre for tests which don't require a
> specific number.
> 
> OMPIO fails tests atomicity, misc, and error on ext4; it additionally
> fails noncontig_coll2, fp, shared_fp, and ordered_fp on
> Lustre/noflock.
> 
> On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
> shared_fp, ordered_fp, and error.
> 
> Please can OMPIO be changed to fail in the same way as ROMIO (with a
> clear message) for the operations it can't support without flock.
> Otherwise it looks as if you can potentially get invalid data, or at
> least waste time debugging other errors.
> 
> I'd debug the common failure on the "error" test, but ptrace is
> disabled
> on the system.
> 
> In case anyone else is in the same boat and can't get mounts changed,
> I
> suggested staging data to and from a PVFS2^WOrangeFS ephemeral
> filesystem on jobs' TMPDIR local mounts if they will fit.  Of course
> other libraries will potentially corrupt data on nolock mounts.

ROMIO uses fcntl locks for Atomic mode, Shared file pointer updates,
and to prefent false sharing in the data sieving optimization for
noncontiguous writes.

it's hard to implement fcntl-lock-free versions of Atomic mode and
Shared file pointer so file systems like PVFS don't support those modes
(and return an error indicating such at open time).

You can run lock-free for noncontiguous writes, though at a significant
performance cost.  In ROMIO we can disable data sieving write by
setting the hint "romio_ds_write" to "disable", which will fall back to
piece-wise operations.  Could be OK if you know your noncontiguous
accesses are only a little bit noncontiguous.

Perhaps OMPIO has a similar option, but I am not familiar with its
tuning knobs.

==rob

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Dave Love
For what it's worth, I found the following from running ROMIO's tests
with OMPIO on Lustre mounted without flock (or localflock).  I used 48
processes on two nodes with Lustre for tests which don't require a
specific number.

OMPIO fails tests atomicity, misc, and error on ext4; it additionally
fails noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.

On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
shared_fp, ordered_fp, and error.

Please can OMPIO be changed to fail in the same way as ROMIO (with a
clear message) for the operations it can't support without flock.
Otherwise it looks as if you can potentially get invalid data, or at
least waste time debugging other errors.

I'd debug the common failure on the "error" test, but ptrace is disabled
on the system.

In case anyone else is in the same boat and can't get mounts changed, I
suggested staging data to and from a PVFS2^WOrangeFS ephemeral
filesystem on jobs' TMPDIR local mounts if they will fit.  Of course
other libraries will potentially corrupt data on nolock mounts.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users