Hi,

I try to execute simple MPI_Bcast code in nancy site. I tried in some of
the clusters in nancy, and the result is same. There is an issue. Last
execution was in graphene. Each time I got same error message.

==========================================
Connection closed by 172.16.64.43
Connection closed by 172.16.64.44
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[graphene-42.nancy.grid5000.fr:03619] [[56950,0],0]
grpcomm:direct:send_relay proc [[56950,0],1] not running - cannot relay
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   graphene-42
  target node:  graphene-44

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
=======================================================

I checked all the environment variable. All of them contain the value which
I need. Most of mca parameter has experimented which can affect to mpi code.
--mca routed direct
--mca plm_rsh_agent "ssh"
--mca btl_tcp_if_include eth0

Again same result. I started experimented code in each node, single node,
it is ok. There is a problem when I try to all reserved nodes.

*mpirun --mca btl_tcp_if_include eth0 --mca plm_rsh_agent "ssh" -hostfile
$OAR_NODEFILE -n 4 bcast *

After such kind of result, I exited from nancy site and connected to
grenoble site. I reserved 4 nodes in cluster edel and executed same code.
It works.

I think there is a problem with communication in nancy site. If I am not
right, what is a problem with my code or command line?
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to