Re: [OMPI users] Deadlock in netcdf tests

2019-10-26 Thread Orion Poplawski via users

Okay, I've filed:

https://github.com/open-mpi/ompi/issues/7109  - deadlock

and

https://github.com/open-mpi/ompi/issues/7110  - ompio error

I've found the hdf5 and netcdf testsuites quite adept at finding issues 
with openmpi over the years.


Thanks again for the help.

On 10/26/19 6:01 AM, Gabriel, Edgar wrote:

Orion,
It might be a good idea. This bug is triggered from the fcoll/two_phase 
component (and having spent just two minutes in looking at it, I have a 
suspicion what triggers it, namely in int vs. long conversion issue), so it is 
probably unrelated to the other one.

I need to add running the netcdf test cases to my list of standard testsuites, 
but we didn't used to have any problems with them :-(
Thanks for the report, we will be working on them!

Edgar



-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion
Poplawski via users
Sent: Friday, October 25, 2019 10:21 PM
To: Open MPI Users 
Cc: Orion Poplawski 
Subject: Re: [OMPI users] Deadlock in netcdf tests

Thanks for the response, the workaround helps.

With that out of the way I see:

+ mpiexec -n 4 ./tst_parallel4
Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >=
num_aggregators(1)fd_size=461172966257152 off=4156705856
Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >=
num_aggregators(1)fd_size=4611731477435006976 off=4157193280

Should I file issues for both of these?

On 10/25/19 2:29 AM, Gilles Gouaillardet via users wrote:

Orion,


thanks for the report.


I can confirm this is indeed an Open MPI bug.

FWIW, a workaround is to disable the fcoll/vulcan component.

That can be achieved by

mpirun --mca fcoll ^vulcan ...

or

OMPI_MCA_fcoll=^vulcan mpirun ...


I also noted the tst_parallel3 program crashes with the ROMIO component.


Cheers,


Gilles

On 10/25/2019 12:55 PM, Orion Poplawski via users wrote:

On 10/24/19 9:28 PM, Orion Poplawski via users wrote:

Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are
seeing a test hang with openmpi 4.0.2. Backtrace:

(gdb) bt
#0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
#1  0x7f90c1ac8a05 in ompi_request_default_wait () from
/usr/lib64/openmpi/lib/libmpi.so.40
#2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from
/usr/lib64/openmpi/lib/libmpi.so.40
#3  0x7f90c1b2bb73 in
ompi_coll_base_allreduce_intra_recursivedoubling () from
/usr/lib64/openmpi/lib/libmpi.so.40
#4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from
/usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
#5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from
/usr/lib64/openmpi/lib/libmca_common_ompio.so.41
#6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from
/usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
#7  0x7f90c1af033f in PMPI_File_write_at_all () from
/usr/lib64/openmpi/lib/libmpi.so.40
#8  0x7f90c1627d7b in H5FD_mpio_write () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#9  0x7f90c14636ee in H5FD_write () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#10 0x7f90c1442eb3 in H5F__accum_write () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#11 0x7f90c1543729 in H5PB_write () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#12 0x7f90c144d69c in H5F_block_write () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#13 0x7f90c161cd10 in H5C_apply_candidate_list () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#14 0x7f90c161ad02 in H5AC__run_sync_point () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#15 0x7f90c161bd4f in H5AC__flush_entries () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#16 0x7f90c13b154d in H5AC_flush () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#18 0x7f90c1448e64 in H5F__flush () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#20 0x7f90c144f171 in H5F_flush_mounts () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#21 0x7f90c143e3a5 in H5Fflush () from
/usr/lib64/openmpi/lib/libhdf5.so.103
#22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at
../../libhdf5/hdf5file.c:222
#23 0x7f90c1c1816e in NC4_enddef (ncid=) at
../../libhdf5/hdf5file.c:544
#24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at
../../libdispatch/dfile.c:1004
#25 0x56527d0def27 in test_pio (flag=0) at
../../nc_test4/tst_parallel3.c:206
#26 0x56527d0de62c in main (argc=,

argv=
out>) at ../../nc_test4/tst_parallel3.c:91

processes are running full out.

Suggestions for debugging this would be greatly appreciated.



Some more info - I think now it is more dependent on openmpi versions
than netcdf itself:

- last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx
1.5.2, pmix-3.1.4.  Possible start of the failure was with openmpi
4.0.2-rc1 and ucx 1.6.0.

- netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2,
ucx 1.6.1, pmix 3.1.4

- net

Re: [OMPI users] Deadlock in netcdf tests

2019-10-26 Thread Gabriel, Edgar via users
Orion,
It might be a good idea. This bug is triggered from the fcoll/two_phase 
component (and having spent just two minutes in looking at it, I have a 
suspicion what triggers it, namely in int vs. long conversion issue), so it is 
probably unrelated to the other one.

I need to add running the netcdf test cases to my list of standard testsuites, 
but we didn't used to have any problems with them :-(
Thanks for the report, we will be working on them!

Edgar


> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion
> Poplawski via users
> Sent: Friday, October 25, 2019 10:21 PM
> To: Open MPI Users 
> Cc: Orion Poplawski 
> Subject: Re: [OMPI users] Deadlock in netcdf tests
> 
> Thanks for the response, the workaround helps.
> 
> With that out of the way I see:
> 
> + mpiexec -n 4 ./tst_parallel4
> Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >=
> num_aggregators(1)fd_size=461172966257152 off=4156705856
> Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >=
> num_aggregators(1)fd_size=4611731477435006976 off=4157193280
> 
> Should I file issues for both of these?
> 
> On 10/25/19 2:29 AM, Gilles Gouaillardet via users wrote:
> > Orion,
> >
> >
> > thanks for the report.
> >
> >
> > I can confirm this is indeed an Open MPI bug.
> >
> > FWIW, a workaround is to disable the fcoll/vulcan component.
> >
> > That can be achieved by
> >
> > mpirun --mca fcoll ^vulcan ...
> >
> > or
> >
> > OMPI_MCA_fcoll=^vulcan mpirun ...
> >
> >
> > I also noted the tst_parallel3 program crashes with the ROMIO component.
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > On 10/25/2019 12:55 PM, Orion Poplawski via users wrote:
> >> On 10/24/19 9:28 PM, Orion Poplawski via users wrote:
> >>> Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are
> >>> seeing a test hang with openmpi 4.0.2. Backtrace:
> >>>
> >>> (gdb) bt
> >>> #0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
> >>> #1  0x7f90c1ac8a05 in ompi_request_default_wait () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #3  0x7f90c1b2bb73 in
> >>> ompi_coll_base_allreduce_intra_recursivedoubling () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from
> >>> /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
> >>> #5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from
> >>> /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
> >>> #6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from
> >>> /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
> >>> #7  0x7f90c1af033f in PMPI_File_write_at_all () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #8  0x7f90c1627d7b in H5FD_mpio_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #9  0x7f90c14636ee in H5FD_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #10 0x7f90c1442eb3 in H5F__accum_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #11 0x7f90c1543729 in H5PB_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #12 0x7f90c144d69c in H5F_block_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #13 0x7f90c161cd10 in H5C_apply_candidate_list () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #14 0x7f90c161ad02 in H5AC__run_sync_point () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #15 0x7f90c161bd4f in H5AC__flush_entries () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #16 0x7f90c13b154d in H5AC_flush () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #18 0x7f90c1448e64 in H5F__flush () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #20 0x7f90c144f171 in H5F_flush_mounts () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #21 0x7f90c143e3a5 in H5Fflush () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at
> >>> ../../libhdf5/hdf5file.c:222
> >>> #23 0x7f90c1c1816e in NC4_enddef (ncid=) at
> >>> ../../libhdf5/hdf5file.c:544
> >>> #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at
> >>> ../../libdispatch/dfile.c:1004
> >>> #25 0x56527d0def27 in test_pio (flag=0) at
> >>> ../../nc_test4/tst_parallel3.c:206
> >>> #26 0x56527d0de62c in main (argc=,
> argv= >>> out>) at ../../nc_test4/tst_parallel3.c:91
> >>>
> >>> processes are running full out.
> >>>
> >>> Suggestions for debugging this would be greatly appreciated.
> >>>
> >>
> >> Some more info - I think now it is more dependent on openmpi versions
> >> than netcdf itself:
> >>
> >> - last successful build was with netcdf 4.7