Re: [OMPI users] [External] Re: segfault in libibverbs.so
I've been doing a lot of research on this issue (See my next e-mail on this topic which I'll be posting ina few minutes), and OpenMPI will use ibverbs or UCX. In OpenMPI 4.0 and later, ibverbs is deprecated in favor of UCX. Prentice On 7/27/20 7:49 PM, gil...@rist.or.jp wrote: Prentice, ibverbs might be used by UCX (either pml/ucx or btl/uct), so to be 100% sure, you should mpirun --mca pml ob1 --mca btl ^openib,uct ... in order to force btl/tcp, you need to ensure pml/ob1 is used, and then you always need the btl/self component mpirun --mca pml ob1 --mca btl tcp,self ... Cheers, Gilles - Original Message - Can anyone explain why my job still calls libibverbs when I run it with '-mca btl ^openib'? If I instead use '-mca btl tcp', my jobs don't segfault. I would assum 'mca btl ^openib' and '-mca btl tcp' to essentially be equivalent, but there's obviously a difference in the two. Prentice On 7/23/20 3:34 PM, Prentice Bisbal wrote: I manage a cluster that is very heterogeneous. Some nodes have InfiniBand, while others have 10 Gb/s Ethernet. We recently upgraded to CentOS 7, and built a new software stack for CentOS 7. We are using OpenMPI 4.0.3, and we are using Slurm 19.05.5 as our job scheduler. We just noticed that when jobs are sent to the nodes with IB, the segfault immediately, with the segfault appearing to come from libibverbs.so. This is what I see in the stderr output for one of these failed jobs: srun: error: greene021: tasks 0-3: Segmentation fault And here is what I see in the log messages of the compute node where that segfault happened: Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at 7f0635f38910 ip 7f0635f49405 sp 7ffe354485a0 error 4 Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at 7f23d51ea910 ip 7f23d51fb405 sp 7ffef250a9a0 error 4 Jul 23 15:19:41 greene021 kernel: in libibverbs.so.1.5.22.4[7f23d51ec000+18000] Jul 23 15:19:41 greene021 kernel: Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at 7ff504ba5910 ip 7ff504bb6405 sp 7917ccb0 error 4 Jul 23 15:19:41 greene021 kernel: in libibverbs.so.1.5.22.4[7ff504ba7000+18000] Jul 23 15:19:41 greene021 kernel: Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at 7fa58abc5910 ip 7fa58abd6405 sp 7ffdde50c0d0 error 4 Jul 23 15:19:41 greene021 kernel: in libibverbs.so.1.5.22.4[7fa58abc7000+18000] Jul 23 15:19:41 greene021 kernel: Jul 23 15:19:41 greene021 kernel: in libibverbs.so.1.5.22.4[7f0635f3a000+18000] Jul 23 15:19:41 greene021 kernel Any idea what is going on here, or how to debug further? I've been using OpenMPI for years, and it usually just works. I normally start my job with srun like this: srun ./mpihello But even if I try to take IB out of the equation by starting the job like this: mpirun -mca btl ^openib ./mpihello I still get a segfault issue, although the message to stderr is now a little different: -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpirun noticed that process rank 1 with PID 8502 on node greene021 exited on signal 11 (Segmentation fault). -- The segfaults happens immediately. It seems to happen as soon as MPI_Init() is called. The program I'm running is very simple MPI "Hello world!" program. The output of ompi_info is below my signature, in case that helps. Prentice $ ompi_info Package: Open MPI u...@host.example.com Distribution Open MPI: 4.0.3 Open MPI repo revision: v4.0.3 Open MPI release date: Mar 03, 2020 Open RTE: 4.0.3 Open RTE repo revision: v4.0.3 Open RTE release date: Mar 03, 2020 OPAL: 4.0.3 OPAL repo revision: v4.0.3 OPAL release date: Mar 03, 2020 MPI API: 3.1.0 Ident string: 4.0.3 Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3 Configured architecture: x86_64-unknown-linux-gnu Configure host: dawson027.pppl.gov Configured by: lglant Configured on: Mon Jun 1 12:37:07 EDT 2020 Configure host: dawson027.pppl.gov Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4. 0.3' '--with-ucx' '--with-verbs' '--with- libfabric' '--with-libevent=/usr' '--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi' Built by: lglant Built on: Mon Jun 1 13:05:40 EDT 2020 Built host: dawson027.pppl.gov C bindings: yes C++ bindings: no
[OMPI users] WARNING: There was an error initializing an OpenFabrics device
Last week I posted on here that I was getting immediate segfaults when I ran MPI programs, and the system logs shows that the segfaults were occuring in libibverbs.so, and that the problem was still occurring even if I specified '-mca btl ^openib'. Since then, I've made a lot of progress on the problem, and now my jobs run, but I'm now getting this error sent to standard error: WARNING: There was an error initializing an OpenFabrics device. Local host: greene021 Local device: qib0 For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, compiled with GCC 9.3.0. While researching the immediate segfault issue, I came across this Red Hat Bug Report: https://bugzilla.redhat.com/show_bug.cgi?id=1754099 According to that bug report, there was a regression in the version of UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and downgrading to the UCX package that came with CentOS 7.7 (UCX 1.4.0-1.el7). Suspecting this might be the cause of my problem, I did the same. After the downgrade, my jobs still segfaulted, but at least I now got a backtrace showing that the segfault was happening in UCX. Now I suspected a bug in UCX, so I went to the UCX website and installed the latest stable version (1.8.1) by building the SRPM provided by the UCX website: https://github.com/openucx/ucx/releases/tag/v1.8.1 After that, my application runs, but I get the error message above (repeated here): WARNING: There was an error initializing an OpenFabrics device. Local host: greene021 Local device: qib0 Googling for that error message, I came across this OpenMPI bug discussion: https://github.com/open-mpi/ompi/issues/6517 According to this, if I rebuild OpenMPI with the option ''--without-verbs", that message will go away. I tried that, but I am still getting the error message. Here's the configure command-line, taken from ompi_info: Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx' '--without-verbs' '--with-libfabric' '--with-libevent=/usr' '--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi' I have two questions: 1. How can I be sure that this message is really just a result of the old openib code (as stated in the OpenMPI bug discussion above), and my job is actually using InfiniBand with UCX? 2. If the message above is harmless, how can I make it go away so my users don't see it? If you've made it this far, thanks for reading my whole message. Any help will be greatly appreciated! -- Prentice
Re: [OMPI users] WARNING: There was an error initializing an OpenFabrics device
One more bit of information: These are QLogic IB cards, not Mellanox: $ lspci | grep QL 05:00.0 InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02) On 7/28/20 2:03 PM, Prentice Bisbal wrote: Last week I posted on here that I was getting immediate segfaults when I ran MPI programs, and the system logs shows that the segfaults were occuring in libibverbs.so, and that the problem was still occurring even if I specified '-mca btl ^openib'. Since then, I've made a lot of progress on the problem, and now my jobs run, but I'm now getting this error sent to standard error: WARNING: There was an error initializing an OpenFabrics device. Local host: greene021 Local device: qib0 For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, compiled with GCC 9.3.0. While researching the immediate segfault issue, I came across this Red Hat Bug Report: https://bugzilla.redhat.com/show_bug.cgi?id=1754099 According to that bug report, there was a regression in the version of UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and downgrading to the UCX package that came with CentOS 7.7 (UCX 1.4.0-1.el7). Suspecting this might be the cause of my problem, I did the same. After the downgrade, my jobs still segfaulted, but at least I now got a backtrace showing that the segfault was happening in UCX. Now I suspected a bug in UCX, so I went to the UCX website and installed the latest stable version (1.8.1) by building the SRPM provided by the UCX website: https://github.com/openucx/ucx/releases/tag/v1.8.1 After that, my application runs, but I get the error message above (repeated here): WARNING: There was an error initializing an OpenFabrics device. Local host: greene021 Local device: qib0 Googling for that error message, I came across this OpenMPI bug discussion: https://github.com/open-mpi/ompi/issues/6517 According to this, if I rebuild OpenMPI with the option ''--without-verbs", that message will go away. I tried that, but I am still getting the error message. Here's the configure command-line, taken from ompi_info: Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx' '--without-verbs' '--with-libfabric' '--with-libevent=/usr' '--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi' I have two questions: 1. How can I be sure that this message is really just a result of the old openib code (as stated in the OpenMPI bug discussion above), and my job is actually using InfiniBand with UCX? 2. If the message above is harmless, how can I make it go away so my users don't see it? If you've made it this far, thanks for reading my whole message. Any help will be greatly appreciated! -- Prentice -- Prentice Bisbal Lead Software Engineer Research Computing Princeton Plasma Physics Laboratory http://www.pppl.gov