Hi, I try to execute simple MPI_Bcast code in nancy site. I tried in some of the clusters in nancy, and the result is same. There is an issue. Last execution was in graphene. Each time I got same error message.
========================================== Connection closed by 172.16.64.43 Connection closed by 172.16.64.44 -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [graphene-42.nancy.grid5000.fr:03619] [[56950,0],0] grpcomm:direct:send_relay proc [[56950,0],1] not running - cannot relay -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: graphene-42 target node: graphene-44 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. ======================================================= I checked all the environment variable. All of them contain the value which I need. Most of mca parameter has experimented which can affect to mpi code. --mca routed direct --mca plm_rsh_agent "ssh" --mca btl_tcp_if_include eth0 Again same result. I started experimented code in each node, single node, it is ok. There is a problem when I try to all reserved nodes. *mpirun --mca btl_tcp_if_include eth0 --mca plm_rsh_agent "ssh" -hostfile $OAR_NODEFILE -n 4 bcast * After such kind of result, I exited from nancy site and connected to grenoble site. I reserved 4 nodes in cluster edel and executed same code. It works. I think there is a problem with communication in nancy site. If I am not right, what is a problem with my code or command line?
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel