Hi, I am running some tests on a PPC platform that is using LSF and I see the following problem every time I launch a job that runs on 2 nodes or more:
[crest1:49998] *** Process received signal *** [crest1:49998] Signal: Segmentation fault (11) [crest1:49998] Signal code: Address not mapped (1) [crest1:49998] Failing at address: 0x10061636d2d [crest1:49998] [ 0] [0x100000050478] [crest1:49998] [ 1] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(+0x0)[0x1000009c0000] [crest1:49998] [ 2] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so(straddr_isIPv4+0x44)[0x100000e31b64] [crest1:49998] [ 3] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_array2LIST+0x114)[0x100000be79b4] [crest1:49998] [ 4] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_constructList+0xfc)[0x100000becdbc] [crest1:49998] [ 5] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_launch+0x184)[0x100000bed9c4] [crest1:49998] [ 6] /ccs/home/gvh/install/crest/ompi3_llvm/lib/openmpi/mca_plm_lsf.so(+0x2660)[0x100000992660] [crest1:49998] [ 7] /ccs/home/gvh/install/crest/ompi3_llvm/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x940)[0x1000001f7730] [crest1:49998] [ 8] /ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x100013e4] [crest1:49998] [ 9] /ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x10000f10] [crest1:49998] [10] /lib64/power8/libc.so.6(+0x24580)[0x1000004f4580] [crest1:49998] [11] /lib64/power8/libc.so.6(__libc_start_main+0xc4)[0x1000004f4774] [crest1:49998] *** End of error message *** I do not experience that problem with master and the only difference about the LSF support between master and the v3 branch is: https://github.com/open-mpi/ompi/commit/92c996487c589ef8558a087ce2a9923dacdf0b99 If I can confirm that this change fixes the problem with the v3 branch, would you guys accept to bring it into the v3 branch? Thanks,
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel