Hi Open MPI Users,

Just wanted to report a bug we have seen with OpenMPI 3.1.3 and 4.0.0 when 
using the Intel 2019 Update 1 compilers on our Skylake/OmniPath-1 cluster. The 
bug occurs when running the Github master src_c variant of the Intel MPI 
Benchmarks.

Configuration: 

./configure --prefix=/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144 
--with-slurm --with-psm2 
CC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icc
 
CXX=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/icpc
 
FC=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.1.144/linux/bin/intel64/ifort
 --with-zlib=/home/projects/x86-64/zlib/1.2.11 
--with-valgrind=/home/projects/x86-64/valgrind/3.13.0

Operating System is RedHat 7.4 release and we utilize a local build of GCC 
7.2.0 for our Intel compiler (C++) header files. Everything makes correctly, 
and passes a make check without any issues.

We then compile IMB and run IMB-MPI1 on 24 nodes and get the following:

#----------------------------------------------------------------
# Benchmarking Reduce_scatter
# #processes = 64
# ( 1088 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.18         0.19         0.18
            4         1000         7.39        10.37         8.68
            8         1000         7.84        11.14         9.23
           16         1000         8.50        12.37        10.14
           32         1000        10.37        14.66        12.15
           64         1000        13.76        18.82        16.17
          128         1000        21.63        27.61        24.87
          256         1000        39.98        47.27        43.96
          512         1000        72.93        78.59        75.15
         1024         1000       147.21       152.98       149.94
         2048         1000       413.41       426.90       420.15
         4096         1000       421.28       442.58       434.52
         8192         1000       418.31       450.20       438.51
        16384         1000      1082.85      1221.44      1140.92
        32768         1000      2434.11      2529.90      2476.72
        65536          640      5469.57      6048.60      5687.08
       131072          320     11702.94     12435.06     12075.07
       262144          160     19214.42     20433.83     19883.80
       524288           80     49462.22     53896.43     52101.56
      1048576           40    119422.53    131922.20    126920.99
      2097152           20    256345.97    288185.72    275767.05
[node06:351648] *** Process received signal ***
[node06:351648] Signal: Segmentation fault (11)
[node06:351648] Signal code: Invalid permissions (2)
[node06:351648] Failing at address: 0x7fdb6efc4000
[node06:351648] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7fdb8646c5e0]
[node06:351648] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
[node06:351648] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7fdb858d847a]
[node06:351648] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7fdb86c43b29]
[node06:351648] [ 4] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7fdb86c1de67]
[node06:351648] [ 5] ./IMB-MPI1[0x40d624]
[node06:351648] [ 6] ./IMB-MPI1[0x407d16]
[node06:351648] [ 7] ./IMB-MPI1[0x403356]
[node06:351648] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdb860bbc05]
[node06:351648] [ 9] ./IMB-MPI1[0x402da9]
[node06:351648] *** End of error message ***
[node06:351649] *** Process received signal ***
[node06:351649] Signal: Segmentation fault (11)
[node06:351649] Signal code: Invalid permissions (2)
[node06:351649] Failing at address: 0x7f9b19c6f000
[node06:351649] [ 0] /lib64/libpthread.so.0(+0xf5e0)[0x7f9b311295e0]
[node06:351649] [ 1] ./IMB-MPI1(__intel_avx_rep_memcpy+0x140)[0x415380]
[node06:351649] [ 2] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libopen-pal.so.40(opal_datatype_copy_content_same_ddt+0xca)[0x7f9b3059547a]
[node06:351649] [ 3] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x3f9)[0x7f9b31900b29]
[node06:351649] [ 4] 
/home/projects/x86-64-skylake/openmpi/3.1.3/intel/19.1.144/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1d7)[0x7f9b318dae67]
[node06:351649] [ 5] ./IMB-MPI1[0x40d624]
[node06:351649] [ 6] ./IMB-MPI1[0x407d16]
[node06:351649] [node06:351657] *** Process received signal ***


-- 
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
 

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to