Hello, When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_allow_ib 1 -np 6 -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
I get this error: #---------------------------------------------------------------- # Benchmarking Reduce_scatter # #processes = 4 # ( 2 additional processes waiting in MPI_Barrier) #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.14 0.15 0.14 4 1000 5.00 7.58 6.28 8 1000 5.13 7.68 6.41 16 1000 5.05 7.74 6.39 32 1000 5.43 7.96 6.75 64 1000 6.78 8.56 7.69 128 1000 7.77 9.55 8.59 256 1000 8.28 10.96 9.66 512 1000 9.19 12.49 10.85 1024 1000 11.78 15.01 13.38 2048 1000 17.41 19.51 18.52 4096 1000 25.73 28.22 26.89 8192 1000 47.75 49.44 48.79 16384 1000 81.10 90.15 84.75 32768 1000 163.01 178.58 173.19 65536 640 315.63 340.51 333.18 131072 320 475.48 528.82 510.85 262144 160 979.70 1063.81 1035.61 524288 80 2070.51 2242.58 2150.15 1048576 40 4177.36 4527.25 4431.65 2097152 20 8738.08 9340.50 9147.89 [pandora:04500] *** Process received signal *** [pandora:04500] Signal: Segmentation fault (11) [pandora:04500] Signal code: Address not mapped (1) [pandora:04500] Failing at address: 0x7f310ebffff0 [pandora:04499] *** Process received signal *** [pandora:04499] Signal: Segmentation fault (11) [pandora:04499] Signal code: Address not mapped (1) [pandora:04499] Failing at address: 0x7f28b11ffff0 [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680] [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0] [pandora:04500] [ 2] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55] [pandora:04500] [ 3] [pandora:04499] [ 0] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b] [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680] [pandora:04499] [ 1] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7] [pandora:04500] [ 5] IMB-MPI1[0x40b83b] [pandora:04500] [ 6] IMB-MPI1[0x407155] [pandora:04500] [ 7] IMB-MPI1[0x4022ea] [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0] [pandora:04499] [ 2] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5] [pandora:04500] [ 9] IMB-MPI1[0x401d49] [pandora:04500] *** End of error message *** /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55] [pandora:04499] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b] [pandora:04499] [ 4] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7] [pandora:04499] [ 5] IMB-MPI1[0x40b83b] [pandora:04499] [ 6] IMB-MPI1[0x407155] [pandora:04499] [ 7] IMB-MPI1[0x4022ea] [pandora:04499] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5] [pandora:04499] [ 9] IMB-MPI1[0x401d49] [pandora:04499] *** End of error message *** [phoebe:03779] *** Process received signal *** [phoebe:03779] Signal: Segmentation fault (11) [phoebe:03779] Signal code: Address not mapped (1) [phoebe:03779] Failing at address: 0x7f483d6ffff0 [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680] [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0] [phoebe:03779] [ 2] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55] [phoebe:03779] [ 3] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b] [phoebe:03779] [ 4] /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7] [phoebe:03779] [ 5] IMB-MPI1[0x40b83b] [phoebe:03779] [ 6] IMB-MPI1[0x407155] [phoebe:03779] [ 7] IMB-MPI1[0x4022ea] [phoebe:03779] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5] [phoebe:03779] [ 9] IMB-MPI1[0x401d49] [phoebe:03779] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Also if I reinstall 3.1.2 I do not have this issue at all. Any thoughts on what could be the issue? Thanks, Adam LeBlanc
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users