Re: [OMPI users] [External] Re: segfault in libibverbs.so

2020-07-28 Thread Prentice Bisbal via users
I've been doing a lot of research on this issue (See my next e-mail on 
this topic which I'll be posting ina  few minutes), and OpenMPI will use 
ibverbs or UCX. In OpenMPI 4.0 and later, ibverbs is deprecated in favor 
of UCX.


Prentice

On 7/27/20 7:49 PM, gil...@rist.or.jp wrote:

Prentice,

ibverbs might be used by UCX (either pml/ucx or btl/uct),
so to be 100% sure, you should

mpirun --mca pml ob1 --mca btl ^openib,uct ...

in order to force btl/tcp, you need to ensure pml/ob1 is used,
and then you always need the btl/self component

mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

- Original Message -

Can anyone explain why my job still calls libibverbs when I run it

with

'-mca btl ^openib'?

If I instead use '-mca btl tcp', my jobs don't segfault. I would assum
'mca btl ^openib' and '-mca btl tcp' to essentially be equivalent, but
there's obviously a difference in the two.

Prentice

On 7/23/20 3:34 PM, Prentice Bisbal wrote:

I manage a cluster that is very heterogeneous. Some nodes have
InfiniBand, while others have 10 Gb/s Ethernet. We recently upgraded
to CentOS 7, and built a new software stack for CentOS 7. We are

using

OpenMPI 4.0.3, and we are using Slurm 19.05.5 as our job scheduler.

We just noticed that when jobs are sent to the nodes with IB, the
segfault immediately, with the segfault appearing to come from
libibverbs.so. This is what I see in the stderr output for one of
these failed jobs:

srun: error: greene021: tasks 0-3: Segmentation fault

And here is what I see in the log messages of the compute node where
that segfault happened:

Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at
7f0635f38910 ip 7f0635f49405 sp 7ffe354485a0 error 4
Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at
7f23d51ea910 ip 7f23d51fb405 sp 7ffef250a9a0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7f23d51ec000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at
7ff504ba5910 ip 7ff504bb6405 sp 7917ccb0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7ff504ba7000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at
7fa58abc5910 ip 7fa58abd6405 sp 7ffdde50c0d0 error 4
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7fa58abc7000+18000]
Jul 23 15:19:41 greene021 kernel:
Jul 23 15:19:41 greene021 kernel: in
libibverbs.so.1.5.22.4[7f0635f3a000+18000]
Jul 23 15:19:41 greene021 kernel

Any idea what is going on here, or how to debug further? I've been
using OpenMPI for years, and it usually just works.

I normally start my job with srun like this:

srun ./mpihello

But even if I try to take IB out of the equation by starting the job
like this:

mpirun -mca btl ^openib ./mpihello

I still get a segfault issue, although the message to stderr is now

a

little different:



--

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


--



--

mpirun noticed that process rank 1 with PID 8502 on node greene021
exited on signal 11 (Segmentation fault).


--


The segfaults happens immediately. It seems to happen as soon as
MPI_Init() is called. The program I'm running is very simple MPI
"Hello world!" program.

The output of  ompi_info is below my signature, in case that helps.

Prentice

$ ompi_info
  Package: Open MPI u...@host.example.com

Distribution

     Open MPI: 4.0.3
   Open MPI repo revision: v4.0.3
    Open MPI release date: Mar 03, 2020
     Open RTE: 4.0.3
   Open RTE repo revision: v4.0.3
    Open RTE release date: Mar 03, 2020
     OPAL: 4.0.3
   OPAL repo revision: v4.0.3
    OPAL release date: Mar 03, 2020
  MPI API: 3.1.0
     Ident string: 4.0.3
   Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3
  Configured architecture: x86_64-unknown-linux-gnu
   Configure host: dawson027.pppl.gov
    Configured by: lglant
    Configured on: Mon Jun  1 12:37:07 EDT 2020
   Configure host: dawson027.pppl.gov
   Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.

0.3'

   '--with-ucx' '--with-verbs' '--with-

libfabric'

   '--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64'
'--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi'
     Built by: lglant
     Built on: Mon Jun  1 13:05:40 EDT 2020
   Built host: dawson027.pppl.gov
   C bindings: yes
     C++ bindings: no
   

[OMPI users] WARNING: There was an error initializing an OpenFabrics device

2020-07-28 Thread Prentice Bisbal via users
Last week I posted on here that I was getting immediate segfaults when I 
ran MPI programs, and the system logs shows that the segfaults were 
occuring in libibverbs.so, and that the problem was still occurring even 
if I specified '-mca btl ^openib'.


Since then, I've made a lot of progress on the problem, and now my jobs 
run, but I'm now getting this error sent to standard error:


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, compiled 
with GCC 9.3.0.


While researching the immediate segfault issue, I came across this Red 
Hat Bug Report:


https://bugzilla.redhat.com/show_bug.cgi?id=1754099

According to that bug report, there was a regression in the version of 
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and downgrading 
to the UCX package that came with CentOS 7.7 (UCX 1.4.0-1.el7). 
Suspecting this might be the cause of my problem, I did the same.


After the downgrade, my jobs still segfaulted, but at least I now got a 
backtrace showing that the segfault was happening in UCX.


Now I suspected a bug in UCX, so I went to the UCX website and installed 
the latest stable version (1.8.1) by building the SRPM provided by the 
UCX website:


https://github.com/openucx/ucx/releases/tag/v1.8.1

After that, my application runs, but I get the error message above 
(repeated here):


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

Googling for that error message, I came across this OpenMPI bug discussion:

https://github.com/open-mpi/ompi/issues/6517

According to this, if I rebuild OpenMPI with the option 
''--without-verbs", that message will go away. I tried that, but I am 
still getting the error message. Here's the configure command-line, 
taken from ompi_info:


Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' 
'--with-ucx' '--without-verbs' '--with-libfabric' '--with-libevent=/usr' 
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' 
'--with-pmi'


I have two questions:

1. How can I be sure that this message is really just a result of the 
old openib code (as stated in the OpenMPI bug discussion above), and my 
job is actually using InfiniBand with UCX?


2. If the message above is harmless, how can I make it go away so my 
users don't see it?


If you've made it this far, thanks for reading my whole message. Any 
help will be greatly appreciated!


--
Prentice



Re: [OMPI users] WARNING: There was an error initializing an OpenFabrics device

2020-07-28 Thread Prentice Bisbal via users

One more bit of information: These are QLogic IB cards, not Mellanox:

$ lspci | grep QL
05:00.0 InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02)

On 7/28/20 2:03 PM, Prentice Bisbal wrote:


Last week I posted on here that I was getting immediate segfaults when 
I ran MPI programs, and the system logs shows that the segfaults were 
occuring in libibverbs.so, and that the problem was still occurring 
even if I specified '-mca btl ^openib'.


Since then, I've made a lot of progress on the problem, and now my 
jobs run, but I'm now getting this error sent to standard error:


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, 
compiled with GCC 9.3.0.


While researching the immediate segfault issue, I came across this Red 
Hat Bug Report:


https://bugzilla.redhat.com/show_bug.cgi?id=1754099

According to that bug report, there was a regression in the version of 
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and 
downgrading to the UCX package that came with CentOS 7.7 (UCX 
1.4.0-1.el7). Suspecting this might be the cause of my problem, I did 
the same.


After the downgrade, my jobs still segfaulted, but at least I now got 
a backtrace showing that the segfault was happening in UCX.


Now I suspected a bug in UCX, so I went to the UCX website and 
installed the latest stable version (1.8.1) by building the SRPM 
provided by the UCX website:


https://github.com/openucx/ucx/releases/tag/v1.8.1

After that, my application runs, but I get the error message above 
(repeated here):


WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

Googling for that error message, I came across this OpenMPI bug 
discussion:


https://github.com/open-mpi/ompi/issues/6517

According to this, if I rebuild OpenMPI with the option 
''--without-verbs", that message will go away. I tried that, but I am 
still getting the error message. Here's the configure command-line, 
taken from ompi_info:


Configure command line: 
'--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx' 
'--without-verbs' '--with-libfabric' '--with-libevent=/usr' 
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' 
'--with-pmi'


I have two questions:

1. How can I be sure that this message is really just a result of the 
old openib code (as stated in the OpenMPI bug discussion above), and 
my job is actually using InfiniBand with UCX?


2. If the message above is harmless, how can I make it go away so my 
users don't see it?


If you've made it this far, thanks for reading my whole message. Any 
help will be greatly appreciated!


--
Prentice


--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov