Hello all,

We recently applied the latest RedHat update (/etc/redhat-release says
"Red Hat Enterprise Linux WS release 4 (Nahant Update 7)") to our cluster,
and now codes that use IB seg fault.

We have tried multiple versions of OpenMPI and PGI and GNU compilers.  We
have compiled with --memory-manager=none and without that.  None of that
seems to matter.

When we copy mca_btl_openib.la and mca_btl_openib.so from a version of
OpenMPI compiled before the update into $OMPI_HOME/lib/openmpi/,
everything works fine - no seg faults.  To me this suggests something in
the relationship between those two files and libibverbs, although I'm at a
loss as to what that might be. Note that the old version of libibverbs is gone from the system, but the new version seems to imply it has both IBVerbs 1.0 and 1.1. That's just an assumption on my part based on looking at "strings /usr/lib64/libibverbs.so.1.0.0 | grep IBVER" and seeing IBVERBS_1.0 and IBVERBS_1.1 in the output.

The RPMs RedHat provides for ibverbs is libibverbs-1.1.1-9.el4 and the
openib RPM is openib-1.3-5.el4.

The fairly uninformative seg fault looks like:
[me@node421 ~]$ mpirun -np 5 ./cpi127
[node422:28808] *** Process received signal ***
[node421:29922] *** Process received signal ***
[node421:29922] Signal: Segmentation fault (11)
[node421:29922] Signal code: Address not mapped (1)
[node421:29922] Failing at address: (nil)
[node422:28808] Signal: Segmentation fault (11)
[node422:28808] Signal code: Address not mapped (1)
[node422:28808] Failing at address: (nil)
[node422:28808] *** End of error message ***
[node421:29922] *** End of error message ***
[node421.engin.umich.edu:29917] [0,0,0]-[0,1,2] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
mpirun noticed that job rank 0 with PID 29919 on node node421 exited on signal 
15 (Terminated).
4 additional processes aborted (not shown)

Running that same code over Ethernet ("-mca btl ^openib") works fine.


The configure line for OpenMPI looks roughly like:
  ./configure --prefix=/home/software/rhel4/openmpi-1.2.7rc5/pgi-7.2 
--with-tm=/usr/local/torque --with-openib=/usr CC=pgcc CXX=pgCC FC=pgf90 
F77=pgf90

sometimes I added: --memory-manager=none

We're running the embedded subnet manager in our Topspin TS120 switch (but
I don't think that's the problem, since codes with the old libraries do
work fine).

Has anyone else seen any oddness with RH Update 7, libibverbs 1.1.1 and
OpenMPI, or are we looking at the wrong things?

config.log and ompi_info output are in the attached zip file.

Unfortunately, it's very possible that it's something local to our
installation, but if we had confirmation that this works for someone else,
it would greatly narrow our search space.

Thanks for any insights.

--andy

*****************************************************************************
**                                                                         **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely     **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions      **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
**                                                                         **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*****************************************************************************


<<attachment: ib-segfault-configlog-ompiinfo.zip>>

Reply via email to