On tcp side it doesn't seg fault anymore but will timeout on some tests but
on the openib side it will still seg fault, here is the output:

[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
[pandora:19256] Failing at address: 0x7f911c69fff0
[pandora:19255] *** Process received signal ***
[pandora:19255] Signal: Segmentation fault (11)
[pandora:19255] Signal code: Address not mapped (1)
[pandora:19255] Failing at address: 0x7ff09cd3fff0
[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]
[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]
[pandora:19256] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]
[pandora:19256] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]
[pandora:19256] [ 4] [pandora:19255] [ 0]
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]
[pandora:19255] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]
[pandora:19256] [ 5] IMB-MPI1[0x40b83b]
[pandora:19256] [ 6] IMB-MPI1[0x407155]
[pandora:19256] [ 7] IMB-MPI1[0x4022ea]
[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]
[pandora:19255] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]
[pandora:19256] [ 9] IMB-MPI1[0x401d49]
[pandora:19256] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]
[pandora:19255] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]
[pandora:19255] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]
[pandora:19255] [ 5] IMB-MPI1[0x40b83b]
[pandora:19255] [ 6] IMB-MPI1[0x407155]
[pandora:19255] [ 7] IMB-MPI1[0x4022ea]
[pandora:19255] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]
[pandora:19255] [ 9] IMB-MPI1[0x401d49]
[pandora:19255] *** End of error message ***
[phoebe:12418] *** Process received signal ***
[phoebe:12418] Signal: Segmentation fault (11)
[phoebe:12418] Signal code: Address not mapped (1)
[phoebe:12418] Failing at address: 0x7f5ce27dfff0
[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]
[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]
[phoebe:12418] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]
[phoebe:12418] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]
[phoebe:12418] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]
[phoebe:12418] [ 5] IMB-MPI1[0x40b83b]
[phoebe:12418] [ 6] IMB-MPI1[0x407155]
[phoebe:12418] [ 7] IMB-MPI1[0x4022ea]
[phoebe:12418] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]
[phoebe:12418] [ 9] IMB-MPI1[0x401d49]
[phoebe:12418] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node pandora exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

- Adam LeBlanc

On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Can you try the latest 4.0.x nightly snapshot and see if the problem still
> occurs?
>
>     https://www.open-mpi.org/nightly/v4.0.x/
>
>
> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <alebl...@iol.unh.edu> wrote:
> >
> > I do here is the output:
> >
> > 2 total processes killed (some possibly by mpirun during cleanup)
> > [pandora:12238] *** Process received signal ***
> > [pandora:12238] Signal: Segmentation fault (11)
> > [pandora:12238] Signal code: Invalid permissions (2)
> > [pandora:12238] Failing at address: 0x7f5c8e31fff0
> > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> > [pandora:12237] Signal code: Invalid permissions (2)
> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> > [pandora:12238] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> > [pandora:12238] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> > [pandora:12238] [ 6] IMB-MPI1[0x407155]
> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> > [pandora:12238] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
> > [pandora:12238] [ 9] IMB-MPI1[0x401d49]
> > [pandora:12238] *** End of error message ***
> > [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
> > [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
> > [pandora:12237] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
> > [pandora:12237] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
> > [pandora:12237] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
> > [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
> > [pandora:12237] [ 6] IMB-MPI1[0x407155]
> > [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
> > [pandora:12237] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
> > [pandora:12237] [ 9] IMB-MPI1[0x401d49]
> > [pandora:12237] *** End of error message ***
> > [phoebe:07408] *** Process received signal ***
> > [phoebe:07408] Signal: Segmentation fault (11)
> > [phoebe:07408] Signal code: Invalid permissions (2)
> > [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
> > [titan:07169] *** Process received signal ***
> > [titan:07169] Signal: Segmentation fault (11)
> > [titan:07169] Signal code: Invalid permissions (2)
> > [titan:07169] Failing at address: 0x7fc01295fff0
> > [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
> > [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
> > [phoebe:07408] [ 2] [titan:07169] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
> > [titan:07169] [ 1]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
> > [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
> > [titan:07169] [ 2]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
> > [phoebe:07408] [ 4]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
> > [titan:07169] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
> > [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
> > [phoebe:07408] [ 6] IMB-MPI1[0x407155]
> >
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
> > [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
> > [phoebe:07408] [ 8]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
> > [titan:07169] [ 5] IMB-MPI1[0x40b83b]
> > [titan:07169] [ 6] IMB-MPI1[0x407155]
> > [titan:07169] [ 7]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
> > [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
> > [phoebe:07408] *** End of error message ***
> > IMB-MPI1[0x4022ea]
> > [titan:07169] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
> > [titan:07169] [ 9] IMB-MPI1[0x401d49]
> > [titan:07169] *** End of error message ***
> >
> --------------------------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> >
> --------------------------------------------------------------------------
> >
> --------------------------------------------------------------------------
> > mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> >
> --------------------------------------------------------------------------
> >
> >
> > - Adam LeBlanc
> >
> > On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <hpprit...@gmail.com>
> wrote:
> > HI Adam,
> >
> > As a sanity check, if you try to use --mca btl self,vader,tcp
> >
> > do you still see the segmentation fault?
> >
> > Howard
> >
> >
> > Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
> alebl...@iol.unh.edu>:
> > Hello,
> >
> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
> >  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> >
> > I get this error:
> >
> > #----------------------------------------------------------------
> > # Benchmarking Reduce_scatter
> > # #processes = 4
> > # ( 2 additional processes waiting in MPI_Barrier)
> > #----------------------------------------------------------------
> >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> >             0         1000         0.14         0.15         0.14
> >             4         1000         5.00         7.58         6.28
> >             8         1000         5.13         7.68         6.41
> >            16         1000         5.05         7.74         6.39
> >            32         1000         5.43         7.96         6.75
> >            64         1000         6.78         8.56         7.69
> >           128         1000         7.77         9.55         8.59
> >           256         1000         8.28        10.96         9.66
> >           512         1000         9.19        12.49        10.85
> >          1024         1000        11.78        15.01        13.38
> >          2048         1000        17.41        19.51        18.52
> >          4096         1000        25.73        28.22        26.89
> >          8192         1000        47.75        49.44        48.79
> >         16384         1000        81.10        90.15        84.75
> >         32768         1000       163.01       178.58       173.19
> >         65536          640       315.63       340.51       333.18
> >        131072          320       475.48       528.82       510.85
> >        262144          160       979.70      1063.81      1035.61
> >        524288           80      2070.51      2242.58      2150.15
> >       1048576           40      4177.36      4527.25      4431.65
> >       2097152           20      8738.08      9340.50      9147.89
> > [pandora:04500] *** Process received signal ***
> > [pandora:04500] Signal: Segmentation fault (11)
> > [pandora:04500] Signal code: Address not mapped (1)
> > [pandora:04500] Failing at address: 0x7f310ebffff0
> > [pandora:04499] *** Process received signal ***
> > [pandora:04499] Signal: Segmentation fault (11)
> > [pandora:04499] Signal code: Address not mapped (1)
> > [pandora:04499] Failing at address: 0x7f28b11ffff0
> > [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> > [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> > [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> > [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> > [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> > [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> > [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> > [pandora:04500] [ 6] IMB-MPI1[0x407155]
> > [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> > [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> > [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> > [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> > [pandora:04500] *** End of error message ***
> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> > [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> > [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> > [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> > [pandora:04499] [ 6] IMB-MPI1[0x407155]
> > [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> > [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> > [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> > [pandora:04499] *** End of error message ***
> > [phoebe:03779] *** Process received signal ***
> > [phoebe:03779] Signal: Segmentation fault (11)
> > [phoebe:03779] Signal code: Address not mapped (1)
> > [phoebe:03779] Failing at address: 0x7f483d6ffff0
> > [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> > [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> > [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> > [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> > [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> > [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> > [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> > [phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
> > [phoebe:03779] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
> > [phoebe:03779] [ 9] IMB-MPI1[0x401d49]
> > [phoebe:03779] *** End of error message ***
> >
> --------------------------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> >
> --------------------------------------------------------------------------
> >
> --------------------------------------------------------------------------
> > mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib
> exited on signal 11 (Segmentation fault).
> >
> --------------------------------------------------------------------------
> >
> > Also if I reinstall 3.1.2 I do not have this issue at all.
> >
> > Any thoughts on what could be the issue?
> >
> > Thanks,
> > Adam LeBlanc
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to