The problem is likely to be a firewall between the target node and the node where mpirun is executing - see the error message and suggested causes:
> * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). Ralph > On Mar 24, 2017, at 8:13 AM, Emin Nuriyev <emin.nuri...@ucdconnect.ie> wrote: > > Hi, > > I try to execute simple MPI_Bcast code in nancy site. I tried in some of the > clusters in nancy, and the result is same. There is an issue. Last execution > was in graphene. Each time I got same error message. > > ========================================== > Connection closed by 172.16.64.43 > Connection closed by 172.16.64.44 > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > [graphene-42.nancy.grid5000.fr:03619 > <http://graphene-42.nancy.grid5000.fr:03619/>] [[56950,0],0] > grpcomm:direct:send_relay proc [[56950,0],1] not running - cannot relay > -------------------------------------------------------------------------- > ORTE does not know how to route a message to the specified daemon > located on the indicated node: > > my node: graphene-42 > target node: graphene-44 > > This is usually an internal programming error that should be > reported to the developers. In the meantime, a workaround may > be to set the MCA param routed=direct on the command line or > in your environment. We apologize for the problem. > ======================================================= > > I checked all the environment variable. All of them contain the value which I > need. Most of mca parameter has experimented which can affect to mpi code. > --mca routed direct > --mca plm_rsh_agent "ssh" > --mca btl_tcp_if_include eth0 > > Again same result. I started experimented code in each node, single node, it > is ok. There is a problem when I try to all reserved nodes. > > mpirun --mca btl_tcp_if_include eth0 --mca plm_rsh_agent "ssh" -hostfile > $OAR_NODEFILE -n 4 bcast > > After such kind of result, I exited from nancy site and connected to grenoble > site. I reserved 4 nodes in cluster edel and executed same code. It works. > > I think there is a problem with communication in nancy site. If I am not > right, what is a problem with my code or command line? > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel