Take a look at contrib/platform/lanl/cray_xe6/optimized-lustre.conf. There are a couple of MCA variables that need to be set in order to enable mpirun on Cray systems.
-Nathan On Thu, Apr 16, 2015 at 04:29:21PM -0400, Aurélien Bouteiller wrote: > > > - Improved support for Cray > > Cray's compilers, networks or the programming environment in general? > > I can compile on our Cray XC30, but not run with the options I used > previously with trunk. Is there some secret sauce I am missing here ? > I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS. > /lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1 -mca btl > ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp > -mca plm_base_strip_prefix_from_node_names 1 -nolocal -novm > --debug-daemons -mca oob_base_verbose 1000 -mca ras_alps_apstat_cmd > $(which apstat) -mca ras alps -mca oob_tcp_if_include ipogif0 -map-by > node hostname > [aprun6-darter:16915] mca: base: components_register: registering oob > components > [aprun6-darter:16915] mca: base: components_register: found loaded > component tcp > [aprun6-darter:16915] mca: base: components_register: component tcp > register function successful > [aprun6-darter:16915] mca: base: components_open: opening oob components > [aprun6-darter:16915] mca: base: components_open: found loaded component > tcp > [aprun6-darter:16915] mca: base: components_open: component tcp open > function successful > [aprun6-darter:16915] mca:oob:select: checking available component tcp > [aprun6-darter:16915] mca:oob:select: Querying component [tcp] > [aprun6-darter:16915] oob:tcp: component_available called > [aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo > (not in include list) > [aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo > (not in include list) > [aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 > [aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to > our list of V4 connections > [aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1 > (not in include list) > [aprun6-darter:16915] [[54804,0],0] TCP STARTUP > [aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0 > [aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286 > [aprun6-darter:16915] mca:oob:select: Adding component to end > [aprun6-darter:16915] mca:oob:select: Found 1 active transports > [nid00414:32573] mca: base: components_register: registering oob > components > [nid00414:32573] mca: base: components_register: found loaded component > tcp > [nid00414:32573] mca: base: components_register: component tcp register > function successful > [nid00414:32573] mca: base: components_open: opening oob components > [nid00414:32573] mca: base: components_open: found loaded component tcp > [nid00414:32573] mca: base: components_open: component tcp open function > successful > [nid00414:32573] mca:oob:select: checking available component tcp > [nid00414:32573] mca:oob:select: Querying component [tcp] > [nid00414:32573] oob:tcp: component_available called > [nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in > include list) > [nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 > [nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our > list of V4 connections > [nid00414:32573] [[54804,0],1] TCP STARTUP > [nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0 > [nid00414:32573] [[54804,0],1] assigned IPv4 port 57372 > [nid00414:32573] mca:oob:select: Adding component to end > [nid00414:32573] mca:oob:select: Found 1 active transports > Daemon [[54804,0],1] checking in as pid 32573 on host nid00414 > [nid00414:32573] [[54804,0],1] orted: up and running - waiting for > commands! > [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 > [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 > [nid00414:32573] [[54804,0],1]: set_addr to uri > 3591634944.0;tcp://10.128.2.134:57286 > [nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is > reachable via component tcp > [nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address > tcp://10.128.2.134:57286 > [nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE > [nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0] > [nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via > component tcp > [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 > [nid00414:32573] [[54804,0],1] oob:base:send to target [[INVALID],INVALID] > [nid00414:32573] [[54804,0],1] oob:base:send unknown peer > [[INVALID],INVALID] > [nid00414:32573] [[54804,0],1] is NOT reachable by TCP > Application 1329706 exit codes: 1 > Application 1329706 resources: utime ~0s, stime ~0s, Rss ~5304, inblocks > ~6404, outblocks ~28 > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > [aprun6-darter:16915] [[54804,0],0] TCP SHUTDOWN > [aprun6-darter:16915] mca: base: close: component tcp closed > [aprun6-darter:16915] mca: base: close: unloading component tcp > -- > Aurelien Bouteiller ~ https://icl.cs.utk.edu/~bouteill/ > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17234.php
pgpUj6bWaA0ub.pgp
Description: PGP signature