Hello,

When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
--mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_allow_ib 1 -np 6
 -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1

I get this error:

#----------------------------------------------------------------
# Benchmarking Reduce_scatter
# #processes = 4
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.14         0.15         0.14
            4         1000         5.00         7.58         6.28
            8         1000         5.13         7.68         6.41
           16         1000         5.05         7.74         6.39
           32         1000         5.43         7.96         6.75
           64         1000         6.78         8.56         7.69
          128         1000         7.77         9.55         8.59
          256         1000         8.28        10.96         9.66
          512         1000         9.19        12.49        10.85
         1024         1000        11.78        15.01        13.38
         2048         1000        17.41        19.51        18.52
         4096         1000        25.73        28.22        26.89
         8192         1000        47.75        49.44        48.79
        16384         1000        81.10        90.15        84.75
        32768         1000       163.01       178.58       173.19
        65536          640       315.63       340.51       333.18
       131072          320       475.48       528.82       510.85
       262144          160       979.70      1063.81      1035.61
       524288           80      2070.51      2242.58      2150.15
      1048576           40      4177.36      4527.25      4431.65
      2097152           20      8738.08      9340.50      9147.89
[pandora:04500] *** Process received signal ***
[pandora:04500] Signal: Segmentation fault (11)
[pandora:04500] Signal code: Address not mapped (1)
[pandora:04500] Failing at address: 0x7f310ebffff0
[pandora:04499] *** Process received signal ***
[pandora:04499] Signal: Segmentation fault (11)
[pandora:04499] Signal code: Address not mapped (1)
[pandora:04499] Failing at address: 0x7f28b11ffff0
[pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
[pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
[pandora:04500] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
[pandora:04500] [ 3] [pandora:04499] [ 0]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
[pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
[pandora:04499] [ 1]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
[pandora:04500] [ 5] IMB-MPI1[0x40b83b]
[pandora:04500] [ 6] IMB-MPI1[0x407155]
[pandora:04500] [ 7] IMB-MPI1[0x4022ea]
[pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
[pandora:04499] [ 2]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
[pandora:04500] [ 9] IMB-MPI1[0x401d49]
[pandora:04500] *** End of error message ***
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
[pandora:04499] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
[pandora:04499] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
[pandora:04499] [ 5] IMB-MPI1[0x40b83b]
[pandora:04499] [ 6] IMB-MPI1[0x407155]
[pandora:04499] [ 7] IMB-MPI1[0x4022ea]
[pandora:04499] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
[pandora:04499] [ 9] IMB-MPI1[0x401d49]
[pandora:04499] *** End of error message ***
[phoebe:03779] *** Process received signal ***
[phoebe:03779] Signal: Segmentation fault (11)
[phoebe:03779] Signal code: Address not mapped (1)
[phoebe:03779] Failing at address: 0x7f483d6ffff0
[phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
[phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
[phoebe:03779] [ 2]
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
[phoebe:03779] [ 3]
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
[phoebe:03779] [ 4]
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
[phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
[phoebe:03779] [ 6] IMB-MPI1[0x407155]
[phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
[phoebe:03779] [ 8]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
[phoebe:03779] [ 9] IMB-MPI1[0x401d49]
[phoebe:03779] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Also if I reinstall 3.1.2 I do not have this issue at all.

Any thoughts on what could be the issue?

Thanks,
Adam LeBlanc
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to