Hello Adam,

This helps some.  Could you post first 20 lines of you config.log.  This
will
help in trying to reproduce.  The content of your host file (you can use
generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node impacts the way
the reduce scatter operation works.

One thing to note about the openib BTL - it is on life support.   That's
why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

You may get much better success by installing UCX
<https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use
UCX.  You may actually already have UCX installed on your system if
a recent version of MOFED is installed.

You can check this by running /usr/bin/ofed_rpm_info.  It will show which
ucx version has been installed.
If UCX is installed, you can add --with-ucx to the Open MPi configuration
line and it should build in UCX
support.   If Open MPI is built with UCX support, it will by default use
UCX for message transport rather than
the OpenIB BTL.

thanks,

Howard


Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> On tcp side it doesn't seg fault anymore but will timeout on some tests
> but on the openib side it will still seg fault, here is the output:
>
> [pandora:19256] *** Process received signal ***
> [pandora:19256] Signal: Segmentation fault (11)
> [pandora:19256] Signal code: Address not mapped (1)
> [pandora:19256] Failing at address: 0x7f911c69fff0
> [pandora:19255] *** Process received signal ***
> [pandora:19255] Signal: Segmentation fault (11)
> [pandora:19255] Signal code: Address not mapped (1)
> [pandora:19255] Failing at address: 0x7ff09cd3fff0
> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
> [pandora:19256] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
> [pandora:19256] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
> [pandora:19256] [ 4] [pandora:19255] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
> [pandora:19255] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
> [pandora:19256] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19256] [ 6] IMB-MPI1[0x407155]
> [pandora:19256] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
> [pandora:19255] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
> [pandora:19256] [ 9] IMB-MPI1[0x401d49]
> [pandora:19256] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
> [pandora:19255] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
> [pandora:19255] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
> [pandora:19255] [ 5] IMB-MPI1[0x40b83b]
> [pandora:19255] [ 6] IMB-MPI1[0x407155]
> [pandora:19255] [ 7] IMB-MPI1[0x4022ea]
> [pandora:19255] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
> [pandora:19255] [ 9] IMB-MPI1[0x401d49]
> [pandora:19255] *** End of error message ***
> [phoebe:12418] *** Process received signal ***
> [phoebe:12418] Signal: Segmentation fault (11)
> [phoebe:12418] Signal code: Address not mapped (1)
> [phoebe:12418] Failing at address: 0x7f5ce27dfff0
> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
> [phoebe:12418] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
> [phoebe:12418] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
> [phoebe:12418] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:12418] [ 6] IMB-MPI1[0x407155]
> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:12418] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
> [phoebe:12418] [ 9] IMB-MPI1[0x401d49]
> [phoebe:12418] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> - Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> Can you try the latest 4.0.x nightly snapshot and see if the problem
>> still occurs?
>>
>>     https://www.open-mpi.org/nightly/v4.0.x/
>>
>>
>> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
>> >
>> > I do here is the output:
>> >
>> > 2 total processes killed (some possibly by mpirun during cleanup)
>> > [pandora:12238] *** Process received signal ***
>> > [pandora:12238] Signal: Segmentation fault (11)
>> > [pandora:12238] Signal code: Invalid permissions (2)
>> > [pandora:12238] Failing at address: 0x7f5c8e31fff0
>> > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
>> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
>> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
>> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
>> > [pandora:12237] Signal code: Invalid permissions (2)
>> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0
>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
>> > [pandora:12238] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
>> > [pandora:12238] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
>> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
>> > [pandora:12238] [ 6] IMB-MPI1[0x407155]
>> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
>> > [pandora:12238] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
>> > [pandora:12238] [ 9] IMB-MPI1[0x401d49]
>> > [pandora:12238] *** End of error message ***
>> > [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
>> > [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
>> > [pandora:12237] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
>> > [pandora:12237] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
>> > [pandora:12237] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
>> > [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
>> > [pandora:12237] [ 6] IMB-MPI1[0x407155]
>> > [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
>> > [pandora:12237] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
>> > [pandora:12237] [ 9] IMB-MPI1[0x401d49]
>> > [pandora:12237] *** End of error message ***
>> > [phoebe:07408] *** Process received signal ***
>> > [phoebe:07408] Signal: Segmentation fault (11)
>> > [phoebe:07408] Signal code: Invalid permissions (2)
>> > [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
>> > [titan:07169] *** Process received signal ***
>> > [titan:07169] Signal: Segmentation fault (11)
>> > [titan:07169] Signal code: Invalid permissions (2)
>> > [titan:07169] Failing at address: 0x7fc01295fff0
>> > [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
>> > [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
>> > [phoebe:07408] [ 2] [titan:07169] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
>> > [titan:07169] [ 1]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
>> > [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
>> > [titan:07169] [ 2]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
>> > [phoebe:07408] [ 4]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
>> > [titan:07169] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
>> > [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
>> > [phoebe:07408] [ 6] IMB-MPI1[0x407155]
>> >
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
>> > [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
>> > [phoebe:07408] [ 8]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
>> > [titan:07169] [ 5] IMB-MPI1[0x40b83b]
>> > [titan:07169] [ 6] IMB-MPI1[0x407155]
>> > [titan:07169] [ 7]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
>> > [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
>> > [phoebe:07408] *** End of error message ***
>> > IMB-MPI1[0x4022ea]
>> > [titan:07169] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
>> > [titan:07169] [ 9] IMB-MPI1[0x401d49]
>> > [titan:07169] *** End of error message ***
>> >
>> --------------------------------------------------------------------------
>> > Primary job  terminated normally, but 1 process returned
>> > a non-zero exit code. Per user-direction, the job has been aborted.
>> >
>> --------------------------------------------------------------------------
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that process rank 0 with PID 0 on node pandora exited on
>> signal 11 (Segmentation fault).
>> >
>> --------------------------------------------------------------------------
>> >
>> >
>> > - Adam LeBlanc
>> >
>> > On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <hpprit...@gmail.com>
>> wrote:
>> > HI Adam,
>> >
>> > As a sanity check, if you try to use --mca btl self,vader,tcp
>> >
>> > do you still see the segmentation fault?
>> >
>> > Howard
>> >
>> >
>> > Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
>> alebl...@iol.unh.edu>:
>> > Hello,
>> >
>> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
>> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
>> btl_openib_allow_ib 1 -np 6
>> >  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>> >
>> > I get this error:
>> >
>> > #----------------------------------------------------------------
>> > # Benchmarking Reduce_scatter
>> > # #processes = 4
>> > # ( 2 additional processes waiting in MPI_Barrier)
>> > #----------------------------------------------------------------
>> >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>> >             0         1000         0.14         0.15         0.14
>> >             4         1000         5.00         7.58         6.28
>> >             8         1000         5.13         7.68         6.41
>> >            16         1000         5.05         7.74         6.39
>> >            32         1000         5.43         7.96         6.75
>> >            64         1000         6.78         8.56         7.69
>> >           128         1000         7.77         9.55         8.59
>> >           256         1000         8.28        10.96         9.66
>> >           512         1000         9.19        12.49        10.85
>> >          1024         1000        11.78        15.01        13.38
>> >          2048         1000        17.41        19.51        18.52
>> >          4096         1000        25.73        28.22        26.89
>> >          8192         1000        47.75        49.44        48.79
>> >         16384         1000        81.10        90.15        84.75
>> >         32768         1000       163.01       178.58       173.19
>> >         65536          640       315.63       340.51       333.18
>> >        131072          320       475.48       528.82       510.85
>> >        262144          160       979.70      1063.81      1035.61
>> >        524288           80      2070.51      2242.58      2150.15
>> >       1048576           40      4177.36      4527.25      4431.65
>> >       2097152           20      8738.08      9340.50      9147.89
>> > [pandora:04500] *** Process received signal ***
>> > [pandora:04500] Signal: Segmentation fault (11)
>> > [pandora:04500] Signal code: Address not mapped (1)
>> > [pandora:04500] Failing at address: 0x7f310ebffff0
>> > [pandora:04499] *** Process received signal ***
>> > [pandora:04499] Signal: Segmentation fault (11)
>> > [pandora:04499] Signal code: Address not mapped (1)
>> > [pandora:04499] Failing at address: 0x7f28b11ffff0
>> > [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
>> > [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
>> > [pandora:04500] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
>> > [pandora:04500] [ 3] [pandora:04499] [ 0]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
>> > [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
>> > [pandora:04499] [ 1]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
>> > [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
>> > [pandora:04500] [ 6] IMB-MPI1[0x407155]
>> > [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
>> > [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
>> > [pandora:04499] [ 2]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
>> > [pandora:04500] [ 9] IMB-MPI1[0x401d49]
>> > [pandora:04500] *** End of error message ***
>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
>> > [pandora:04499] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
>> > [pandora:04499] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
>> > [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
>> > [pandora:04499] [ 6] IMB-MPI1[0x407155]
>> > [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
>> > [pandora:04499] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
>> > [pandora:04499] [ 9] IMB-MPI1[0x401d49]
>> > [pandora:04499] *** End of error message ***
>> > [phoebe:03779] *** Process received signal ***
>> > [phoebe:03779] Signal: Segmentation fault (11)
>> > [phoebe:03779] Signal code: Address not mapped (1)
>> > [phoebe:03779] Failing at address: 0x7f483d6ffff0
>> > [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
>> > [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
>> > [phoebe:03779] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
>> > [phoebe:03779] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
>> > [phoebe:03779] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
>> > [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
>> > [phoebe:03779] [ 6] IMB-MPI1[0x407155]
>> > [phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
>> > [phoebe:03779] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
>> > [phoebe:03779] [ 9] IMB-MPI1[0x401d49]
>> > [phoebe:03779] *** End of error message ***
>> >
>> --------------------------------------------------------------------------
>> > Primary job  terminated normally, but 1 process returned
>> > a non-zero exit code. Per user-direction, the job has been aborted.
>> >
>> --------------------------------------------------------------------------
>> >
>> --------------------------------------------------------------------------
>> > mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib
>> exited on signal 11 (Segmentation fault).
>> >
>> --------------------------------------------------------------------------
>> >
>> > Also if I reinstall 3.1.2 I do not have this issue at all.
>> >
>> > Any thoughts on what could be the issue?
>> >
>> > Thanks,
>> > Adam LeBlanc
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to