The problem is likely to be a firewall between the target node and the node 
where mpirun is executing - see the error message and suggested causes:



> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).


Ralph


> On Mar 24, 2017, at 8:13 AM, Emin Nuriyev <emin.nuri...@ucdconnect.ie> wrote:
> 
> Hi,
> 
> I try to execute simple MPI_Bcast code in nancy site. I tried in some of the 
> clusters in nancy, and the result is same. There is an issue. Last execution 
> was in graphene. Each time I got same error message. 
> 
> ==========================================
> Connection closed by 172.16.64.43
> Connection closed by 172.16.64.44
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> [graphene-42.nancy.grid5000.fr:03619 
> <http://graphene-42.nancy.grid5000.fr:03619/>] [[56950,0],0] 
> grpcomm:direct:send_relay proc [[56950,0],1] not running - cannot relay
> --------------------------------------------------------------------------
> ORTE does not know how to route a message to the specified daemon
> located on the indicated node:
> 
>   my node:   graphene-42
>   target node:  graphene-44
> 
> This is usually an internal programming error that should be
> reported to the developers. In the meantime, a workaround may
> be to set the MCA param routed=direct on the command line or
> in your environment. We apologize for the problem.
> =======================================================
> 
> I checked all the environment variable. All of them contain the value which I 
> need. Most of mca parameter has experimented which can affect to mpi code.
> --mca routed direct
> --mca plm_rsh_agent "ssh"
> --mca btl_tcp_if_include eth0
> 
> Again same result. I started experimented code in each node, single node, it 
> is ok. There is a problem when I try to all reserved nodes.
> 
> mpirun --mca btl_tcp_if_include eth0 --mca plm_rsh_agent "ssh" -hostfile 
> $OAR_NODEFILE -n 4 bcast 
> 
> After such kind of result, I exited from nancy site and connected to grenoble 
> site. I reserved 4 nodes in cluster edel and executed same code. It works.
> 
> I think there is a problem with communication in nancy site. If I am not 
> right, what is a problem with my code or command line?
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to