On Nov 27, 2007, at 10:49 AM, Andrew Friedley wrote:



Brock Palen wrote:
On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote:

If this is what I think it is, try using this MCA parameter:

-mca btl_openib_ib_timeout 20

The user used this option and it allowed the run to complete.
You say its a issue with the fabric ibshowerrors does not show any
problems.

Its topspin (cisco) gear, nic's, switch,cables.
Should I follow up with cisco more?

Sure why not, if you think it'd be useful.  FWIW, I see this on
Voltaire/Mellanox hardware with Open MPI; others here at LLNL tell me
they've seen it with MVAPICH as well.

What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ?



Andrew

Brock

If this fixes it -- I don't fully understand what's going on, but it's
an issue in the IB fabrics itself.  Someone else might be able to
explain in more detail..

Andrew


Brian Dobbins wrote:
Hi Brock
We have a user whos code keep failing at a similar point in the
code. The errors (below) would make me think its a fabric problem,
but ibcheckerrors is not returning any issues.  He is using
openmpi-1.2.0  With OFED on RHEL4,

  Strangely enough, I hit this exact problem about half an hour
ago...
what compilers is he using for the code / OpenMPI?  I haven't
narrowed
down the cause yet because the system I'm on is a tad, uh,
disheveled,
but it'd be good to find any commonality.  I'm using PGI-7.1-2
(pgf77/pgf90) with OpenMPI-1.2.4.  The system also happens to be
RHEL 4
(Update 3).

.. Also, the code I'm running is CCSM, and it gave an error message
about being unable to read a file correctly right before my
synchronization.  This code has worked on other systems in the past
(non-IB, non-IBRIX), but something as basic as a file write
shouldn't be
adversely affected by such things, hence I'm going to try backing the
compiler down to a 'known-good' one first., since perhaps that's my
problem. I don't suppose you saw any messages of that sort? I did already try setting the retry count parameter up to 20 (from 7), but
that didn't fix it.

  Cheers,
  - Brian


Brian Dobbins
Yale University HPC

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to