Serguei,

this looks like a very different issue, orted cannot be remotely started.

that typically occurs if orted cannot find some dependencies

(the Open MPI libs and/or the compiler runtime)

for example, from a node, ssh <other node> orted should not fail because 
of unresolved dependencies.

a simple trick is to replace

mpirun ...

with

`which mpirun` ...

a better option (as long as you do not plan to relocate Open MPI install 
dir) is to configure with

--enable-mpirun-prefix-by-default

Cheers,

Gilles

----- Original Message -----

    Hi All !

    As there are no any positive changes with "UDSM + IPoIB" problem 
since my previous post,
    we installed IPoIB on the cluster and "No OpenFabrics connection..." 
error doesn't appear more.
    But now OpenMPI reports about another problem:

    In app ERROR OUTPUT stream:

    [node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had 
inadequate space in file base/plm_base_launch_support.c at line 1035

    In app OUTPUT stream:

    --------------------------------------------------------------------
------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-
default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp (--tmpdir/orte_
tmpdir_base).
      Please check with your sys admin to determine the correct location 
to use.

    *  compilation of the orted with dynamic libraries when static are 
required
      (e.g., on Cray). Please check your configure cmd line and consider 
using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).
    --------------------------------------------------------------------
------

    When I'm trying to run the task using single node - all works 
properly.
    But when I specify "run on 2 nodes", the problem appears.

    I tried to run ping using IPoIB addresses and all hosts are resolved 
properly,
    ping requests and replies are going over IB without any problems.
    So all nodes (including head) see each other via IPoIB.
    But MPI app fails.

    Same test task works perfect on all nodes being run with Ethernet 
transport instead of InfiniBand.

    P.S. We use Torque resource manager to enqueue MPI tasks.

    Best regards,
    Sergei.



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to