On Apr 25, 2009, at 11:59 AM, Anton Starikov wrote:
I can confirm that I have exactly the same problem, also on Dell
system, even with latest openpmpi.
Our system is:
Dell M905
OpenSUSE 11.1
kernel: 2.6.27.21-0.1-default
ofed-1.4-21.12 from SUSE repositories.
OpenMPI-1.3.2
But what I can also add, it not only affect openmpi, if this messages
are triggered after mpirun:
[node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP CQ with -2 errno says Success
Then IB stack hangs. You cannot even reload it, have to reboot node.
Something that severe should not be able to be caused by Open MPI.
Specifically: Open MPI should not be able to hang the OFED stack.
Have you run layer 0 diagnostics to know that your fabric is clean?
You might want to contact your IB vendor to find out how to do that.
--
Jeff Squyres
Cisco Systems