Re: [OMPI users] ompio on Lustre

2018-10-16 Thread Dave Love
"Gabriel, Edgar"  writes:

> a) if we detect a Lustre file system without flock support, we can
> printout an error message. Completely disabling MPI I/O is on the
> ompio architecture not possible at the moment, since the Lustre
> component can disqualify itself, but the generic Unix FS component
> would kick in in that case, and still continue execution. To be more
> precise, the query function of the Lustre component has no way to
> return anything than "I am interested to run" or "I am not interested
> to run"
>
> b) I can add an MCA parameter that would allow the Lustre component to
> abort execution of the job entirely. While this parameter would
> probably be by default set to 'false', a system administrator could
> configure it to be set to 'true' an particular platform.

Assuming the operations which didn't fail for me are actually OK with
noflock (and maybe they're not in other circumstances), can't you just
do the same as ROMIO and fail with an explanation on just the ones that
will fail without flock?  That seems the best from a user's point of
view if there's an advantage to using OMPIO rather than ROMIO.

I guess it might be clear which operations are problematic if I
understood what in fs/lustre requires flock mounts and what the full
semantics of the option are, which seem to be more than documented.

Thanks for looking into it, anyhow.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-16 Thread Dave Love
"Latham, Robert J."  writes:

> it's hard to implement fcntl-lock-free versions of Atomic mode and
> Shared file pointer so file systems like PVFS don't support those modes
> (and return an error indicating such at open time).

Ah.  For some reason I thought PVFS had the support to pass the tests
somehow, but it's been quite a while since I used it.

> You can run lock-free for noncontiguous writes, though at a significant
> performance cost.  In ROMIO we can disable data sieving write by
> setting the hint "romio_ds_write" to "disable", which will fall back to
> piece-wise operations.  Could be OK if you know your noncontiguous
> accesses are only a little bit noncontiguous.

Does that mean it could actually support more operations (without
failing due to missing flock)?

Of course, I realize one should just use flock mounts with Lustre, as I
used to.  I don't remember this stuff being written down explicitly
anywhere, though -- is it somewhere?

Thanks for the info.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Gabriel, Edgar
Dave,
Thank you for your detailed report and testing, that is indeed very helpful. We 
will definitely have to do something.
Here is what I think would be potentially doable.

a) if we detect a Lustre file system without flock support, we can printout an 
error message. Completely disabling MPI I/O is on the ompio architecture not 
possible at the moment, since the Lustre component can disqualify itself, but 
the generic Unix FS component would kick in in that case, and still continue 
execution. To be more precise, the query function of the Lustre component has 
no way to return anything than "I am interested to run" or "I am not interested 
to run"

b)  I can add an MCA parameter that would allow the Lustre component to abort 
execution of the job entirely. While this parameter would probably be by 
default set to 'false', a system administrator could configure it to be set to 
'true' an particular platform. 

I will discuss this also with a couple of other people in the next couple of 
days.
Thanks
Edgar 

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Monday, October 15, 2018 4:22 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> For what it's worth, I found the following from running ROMIO's tests with
> OMPIO on Lustre mounted without flock (or localflock).  I used 48 processes
> on two nodes with Lustre for tests which don't require a specific number.
> 
> OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails
> noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.
> 
> On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
> shared_fp, ordered_fp, and error.
> 
> Please can OMPIO be changed to fail in the same way as ROMIO (with a clear
> message) for the operations it can't support without flock.
> Otherwise it looks as if you can potentially get invalid data, or at least 
> waste
> time debugging other errors.
> 
> I'd debug the common failure on the "error" test, but ptrace is disabled on 
> the
> system.
> 
> In case anyone else is in the same boat and can't get mounts changed, I
> suggested staging data to and from a PVFS2^WOrangeFS ephemeral
> filesystem on jobs' TMPDIR local mounts if they will fit.  Of course other
> libraries will potentially corrupt data on nolock mounts.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Latham, Robert J.
On Mon, 2018-10-15 at 12:21 +0100, Dave Love wrote:
> For what it's worth, I found the following from running ROMIO's tests
> with OMPIO on Lustre mounted without flock (or localflock).  I used
> 48
> processes on two nodes with Lustre for tests which don't require a
> specific number.
> 
> OMPIO fails tests atomicity, misc, and error on ext4; it additionally
> fails noncontig_coll2, fp, shared_fp, and ordered_fp on
> Lustre/noflock.
> 
> On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
> shared_fp, ordered_fp, and error.
> 
> Please can OMPIO be changed to fail in the same way as ROMIO (with a
> clear message) for the operations it can't support without flock.
> Otherwise it looks as if you can potentially get invalid data, or at
> least waste time debugging other errors.
> 
> I'd debug the common failure on the "error" test, but ptrace is
> disabled
> on the system.
> 
> In case anyone else is in the same boat and can't get mounts changed,
> I
> suggested staging data to and from a PVFS2^WOrangeFS ephemeral
> filesystem on jobs' TMPDIR local mounts if they will fit.  Of course
> other libraries will potentially corrupt data on nolock mounts.

ROMIO uses fcntl locks for Atomic mode, Shared file pointer updates,
and to prefent false sharing in the data sieving optimization for
noncontiguous writes.

it's hard to implement fcntl-lock-free versions of Atomic mode and
Shared file pointer so file systems like PVFS don't support those modes
(and return an error indicating such at open time).

You can run lock-free for noncontiguous writes, though at a significant
performance cost.  In ROMIO we can disable data sieving write by
setting the hint "romio_ds_write" to "disable", which will fall back to
piece-wise operations.  Could be OK if you know your noncontiguous
accesses are only a little bit noncontiguous.

Perhaps OMPIO has a similar option, but I am not familiar with its
tuning knobs.

==rob

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Dave Love
For what it's worth, I found the following from running ROMIO's tests
with OMPIO on Lustre mounted without flock (or localflock).  I used 48
processes on two nodes with Lustre for tests which don't require a
specific number.

OMPIO fails tests atomicity, misc, and error on ext4; it additionally
fails noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.

On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
shared_fp, ordered_fp, and error.

Please can OMPIO be changed to fail in the same way as ROMIO (with a
clear message) for the operations it can't support without flock.
Otherwise it looks as if you can potentially get invalid data, or at
least waste time debugging other errors.

I'd debug the common failure on the "error" test, but ptrace is disabled
on the system.

In case anyone else is in the same boat and can't get mounts changed, I
suggested staging data to and from a PVFS2^WOrangeFS ephemeral
filesystem on jobs' TMPDIR local mounts if they will fit.  Of course
other libraries will potentially corrupt data on nolock mounts.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-10 Thread Gabriel, Edgar
Well, good question. To be fair, the test passes if you run it with a lower 
number of processes. In addition, I had a couple of years back a discussion on 
that with one of the HDF5 developers, and it seemed to be ok to run it this way.

That being said, after thinking about it a bit, I think the fix to properly 
support it is at this point relatively easy, I will try to make it work in the 
next couple of days (there was a big chunk of code brought in for another fix 
last year in fall, and I think we have actually everything in place to properly 
support the atomicity operations).

Edgar

> -Original Message-
> From: Dave Love [mailto:dave.l...@manchester.ac.uk]
> Sent: Wednesday, October 10, 2018 3:46 AM
> To: Gabriel, Edgar 
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> "Gabriel, Edgar"  writes:
> 
> > Ok, thanks. I usually run these test with 4 or 8, but the major item
> > is that atomicity is one of the areas that are not well supported in
> > ompio (along with data representations), so a failure in those tests
> > is not entirely surprising .
> 
> If it's not expected to work, could it be made to return a helpful error, 
> rather
> than just not working properly?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-10 Thread Dave Love
"Gabriel, Edgar"  writes:

> Ok, thanks. I usually run these test with 4 or 8, but the major item
> is that atomicity is one of the areas that are not well supported in
> ompio (along with data representations), so a failure in those tests
> is not entirely surprising . 

If it's not expected to work, could it be made to return a helpful
error, rather than just not working properly?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-09 Thread Gabriel, Edgar
Ok, thanks. I usually run these test with 4 or 8, but the major item is that 
atomicity is one of the areas that are not well supported in ompio (along with 
data representations), so a failure in those tests is not entirely surprising . 
Most of the work to support atomicity properly is actually in place, but we 
didn't have the manpower (and requests to be honest) to finish that work.

Thanks
Edgar 


> -Original Message-
> From: Dave Love [mailto:dave.l...@manchester.ac.uk]
> Sent: Tuesday, October 9, 2018 7:05 AM
> To: Gabriel, Edgar 
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> "Gabriel, Edgar"  writes:
> 
> > Hm, thanks for the report, I will look into this. I did not run the
> > romio tests, but the hdf5 tests are run regularly and with 3.1.2 you
> > should not have any problems on a regular unix fs. How many processes
> > did you use, and which tests did you run specifically? The main tests
> > that I execute from their parallel testsuite are testphdf5 and
> > t_shapesame.
> 
> Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core SMP 
> job
> (so 24 processes), where $TMPDIR is on ext4:
> 
>   export HDF5_PARAPREFIX=$TMPDIR
>   make check RUNPARALLEL='mpirun'
> 
> It stopped after testphdf5 spewed "Atomicity Test Failed" errors.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-09 Thread Dave Love
"Gabriel, Edgar"  writes:

> Hm, thanks for the report, I will look into this. I did not run the
> romio tests, but the hdf5 tests are run regularly and with 3.1.2 you
> should not have any problems on a regular unix fs. How many processes
> did you use, and which tests did you run specifically? The main tests
> that I execute from their parallel testsuite are testphdf5 and
> t_shapesame.

Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core
SMP job (so 24 processes), where $TMPDIR is on ext4:

  export HDF5_PARAPREFIX=$TMPDIR
  make check RUNPARALLEL='mpirun'

It stopped after testphdf5 spewed "Atomicity Test Failed" errors.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-08 Thread Gabriel, Edgar
Hm, thanks for the report, I will look into this. I did not run the romio 
tests, but the hdf5 tests are run regularly and with 3.1.2 you should not have 
any problems on a regular unix fs. How many processes did you use, and which 
tests did you run specifically? The main tests that I execute from their 
parallel testsuite are testphdf5 and t_shapesame.

I will also look into the testmpio that you mentioned in the next couple of 
days.
Thanks
Edgar


> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Monday, October 8, 2018 10:20 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> I said I'd report back about trying ompio on lustre mounted without flock.
> 
> I couldn't immediately figure out how to run MTT.  I tried the parallel
> hdf5 tests from the hdf5 1.10.3, but I got errors with that even with the
> relevant environment variable to put the files on (local) /tmp.
> Then it occurred to me rather late that romio would have tests.  Using the
> "runtests" script modified to use "--mca io ompio" in the romio/test directory
> from ompi 3.1.2 on no-flock-mounted Lustre, after building the tests with an
> installed ompi-3.1.2, it did this and apparently hung at the end:
> 
>    Testing simple.c 
>No Errors
>    Testing async.c 
>No Errors
>    Testing async-multiple.c 
>No Errors
>    Testing atomicity.c 
>   Process 3: readbuf[118] is 0, should be 10
>   Process 2: readbuf[65] is 0, should be 10
>   --
>   MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
>   with errorcode 1.
> 
>   NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>   You may or may not see output from other processes, depending on
>   exactly when Open MPI kills them.
>   --
>   Process 1: readbuf[145] is 0, should be 10
>    Testing coll_test.c 
>No Errors
>    Testing excl.c 
>   error opening file test
>   error opening file test
>   error opening file test
> 
> Then I ran on local /tmp as a sanity check and still got errors:
> 
>    Testing I/O functions 
>    Testing simple.c 
>No Errors
>    Testing async.c 
>No Errors
>    Testing async-multiple.c 
>No Errors
>    Testing atomicity.c 
>   Process 2: readbuf[155] is 0, should be 10
>   Process 1: readbuf[128] is 0, should be 10
>   Process 3: readbuf[128] is 0, should be 10
>   --
>   MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
>   with errorcode 1.
> 
>   NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>   You may or may not see output from other processes, depending on
>   exactly when Open MPI kills them.
>   --
>    Testing coll_test.c 
>No Errors
>    Testing excl.c 
>No Errors
>    Testing file_info.c 
>No Errors
>    Testing i_noncontig.c 
>No Errors
>    Testing noncontig.c 
>No Errors
>    Testing noncontig_coll.c 
>No Errors
>    Testing noncontig_coll2.c 
>No Errors
>    Testing aggregation1 
>No Errors
>    Testing aggregation2 
>No Errors
>    Testing hindexed 
>No Errors
>    Testing misc.c 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   Found 12 errors
>    Testing shared_fp.c 
>No Errors
>    Testing ordered_fp.c 
>No Errors
>    Testing split_coll.c 
>No Errors
>    Testing psimple.c 
>No Errors
>    Testing error.c 
>   File set view did not return an error
>Found 1 errors
>    Testing status.c 
>No Errors
>    Testing types_with_zeros 
>No Errors
>    Testing darray_read 
>

Re: [OMPI users] ompio on Lustre

2018-10-08 Thread Dave Love
I said I'd report back about trying ompio on lustre mounted without flock.

I couldn't immediately figure out how to run MTT.  I tried the parallel
hdf5 tests from the hdf5 1.10.3, but I got errors with that even with
the relevant environment variable to put the files on (local) /tmp.
Then it occurred to me rather late that romio would have tests.  Using
the "runtests" script modified to use "--mca io ompio" in the romio/test
directory from ompi 3.1.2 on no-flock-mounted Lustre, after building the
tests with an installed ompi-3.1.2, it did this and apparently hung at
the end:

   Testing simple.c 
   No Errors
   Testing async.c 
   No Errors
   Testing async-multiple.c 
   No Errors
   Testing atomicity.c 
  Process 3: readbuf[118] is 0, should be 10
  Process 2: readbuf[65] is 0, should be 10
  --
  MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
  with errorcode 1.
  
  NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  You may or may not see output from other processes, depending on
  exactly when Open MPI kills them.
  --
  Process 1: readbuf[145] is 0, should be 10
   Testing coll_test.c 
   No Errors
   Testing excl.c 
  error opening file test
  error opening file test
  error opening file test

Then I ran on local /tmp as a sanity check and still got errors:

   Testing I/O functions 
   Testing simple.c 
   No Errors
   Testing async.c 
   No Errors
   Testing async-multiple.c 
   No Errors
   Testing atomicity.c 
  Process 2: readbuf[155] is 0, should be 10
  Process 1: readbuf[128] is 0, should be 10
  Process 3: readbuf[128] is 0, should be 10
  --
  MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
  with errorcode 1.
  
  NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  You may or may not see output from other processes, depending on
  exactly when Open MPI kills them.
  --
   Testing coll_test.c 
   No Errors
   Testing excl.c 
   No Errors
   Testing file_info.c 
   No Errors
   Testing i_noncontig.c 
   No Errors
   Testing noncontig.c 
   No Errors
   Testing noncontig_coll.c 
   No Errors
   Testing noncontig_coll2.c 
   No Errors
   Testing aggregation1 
   No Errors
   Testing aggregation2 
   No Errors
   Testing hindexed 
   No Errors
   Testing misc.c 
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn in bytes = 3280, should be 1000
  
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn in bytes = 3280, should be 1000
  
  file pointer posn in bytes = 3280, should be 1000
  
  file pointer posn in bytes = 3280, should be 1000
  
  Found 12 errors
   Testing shared_fp.c 
   No Errors
   Testing ordered_fp.c 
   No Errors
   Testing split_coll.c 
   No Errors
   Testing psimple.c 
   No Errors
   Testing error.c 
  File set view did not return an error
   Found 1 errors
   Testing status.c 
   No Errors
   Testing types_with_zeros 
   No Errors
   Testing darray_read 
   No Errors

I even got an error with romio on /tmp (modifying the script to use
mpirun --mca io romi314):

   Testing error.c 
  Unexpected error message MPI_ERR_ARG: invalid argument of some other kind
   Found 1 errors
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-05 Thread Dave Love
"Gabriel, Edgar"  writes:

> It was originally for performance reasons, but this should be fixed at
> this point. I am not aware of correctness problems.
>
> However, let me try to clarify your question about: What do you
> precisely mean by "MPI I/O on Lustre mounts without flock"? Was the
> Lustre filesystem mounted without flock?

No, it wasn't (and romio complains).

> If yes, that could lead to
> some problems, we had that on our Lustre installation for a while, but
> problems were even occurring without MPI I/O in that case (although I
> do not recall all details, just that we had to change the mount
> options).

Yes, without at least localflock you might expect problems with things
like bdb and sqlite, but I couldn't see any file locking calls in the
Lustre component.  If it is a problem, shouldn't the component fail like
without it like romio does?

I have suggested ephemeral PVFS^WOrangeFS but I doubt that will be
thought useful.

> Maybe just take a testsuite (either ours or HDF5), make sure
> to run it in a multi-node configuration and see whether it works
> correctly.

For some reason I didn't think MTT, if that's what you mean, was
available, but I see it is; I'll see if I can drive it when I have a
chance.  Tests from HDF5 might be easiest, thanks for the suggestion.
I'd tried with ANL's "testmpio", which was the only thing I found
immediately, but it threw up errors even on a local filesystem, at which
stage I thought it was best to ask...  I'll report back if I get useful
results.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-05 Thread Gabriel, Edgar
It was originally for performance reasons, but this should be fixed at this 
point. I am not aware of correctness problems.

However, let me try to clarify your question about: What do you precisely mean 
by "MPI I/O on Lustre mounts without flock"? Was the Lustre filesystem mounted 
without flock? If yes, that could lead to some problems, we had that on our 
Lustre installation for a while, but problems were even occurring without MPI 
I/O in that case (although I do not recall all details, just that we had to 
change the mount options). Maybe just take a testsuite (either ours or HDF5), 
make sure to run it in a multi-node configuration and see whether it works 
correctly.

Thanks
Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Friday, October 5, 2018 5:15 AM
> To: users@lists.open-mpi.org
> Subject: [OMPI users] ompio on Lustre
> 
> Is romio preferred over ompio on Lustre for performance or correctness?
> If it's relevant, the context is MPI-IO on Lustre mounts without flock, which
> ompio doesn't seem to require.
> Thanks.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] ompio on Lustre

2018-10-05 Thread Dave Love
Is romio preferred over ompio on Lustre for performance or correctness?
If it's relevant, the context is MPI-IO on Lustre mounts without flock,
which ompio doesn't seem to require.
Thanks.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users