> > - Improved support for Cray > > Cray's compilers, networks or the programming environment in general?
I can compile on our Cray XC30, but not run with the options I used previously with trunk. Is there some secret sauce I am missing here ? I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS. /lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1 -mca btl ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp -mca plm_base_strip_prefix_from_node_names 1 -nolocal -novm --debug-daemons -mca oob_base_verbose 1000 -mca ras_alps_apstat_cmd $(which apstat) -mca ras alps -mca oob_tcp_if_include ipogif0 -map-by node hostname [aprun6-darter:16915] mca: base: components_register: registering oob components [aprun6-darter:16915] mca: base: components_register: found loaded component tcp [aprun6-darter:16915] mca: base: components_register: component tcp register function successful [aprun6-darter:16915] mca: base: components_open: opening oob components [aprun6-darter:16915] mca: base: components_open: found loaded component tcp [aprun6-darter:16915] mca: base: components_open: component tcp open function successful [aprun6-darter:16915] mca:oob:select: checking available component tcp [aprun6-darter:16915] mca:oob:select: Querying component [tcp] [aprun6-darter:16915] oob:tcp: component_available called [aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo (not in include list) [aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo (not in include list) [aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to our list of V4 connections [aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1 (not in include list) [aprun6-darter:16915] [[54804,0],0] TCP STARTUP [aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0 [aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286 [aprun6-darter:16915] mca:oob:select: Adding component to end [aprun6-darter:16915] mca:oob:select: Found 1 active transports [nid00414:32573] mca: base: components_register: registering oob components [nid00414:32573] mca: base: components_register: found loaded component tcp [nid00414:32573] mca: base: components_register: component tcp register function successful [nid00414:32573] mca: base: components_open: opening oob components [nid00414:32573] mca: base: components_open: found loaded component tcp [nid00414:32573] mca: base: components_open: component tcp open function successful [nid00414:32573] mca:oob:select: checking available component tcp [nid00414:32573] mca:oob:select: Querying component [tcp] [nid00414:32573] oob:tcp: component_available called [nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in include list) [nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our list of V4 connections [nid00414:32573] [[54804,0],1] TCP STARTUP [nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0 [nid00414:32573] [[54804,0],1] assigned IPv4 port 57372 [nid00414:32573] mca:oob:select: Adding component to end [nid00414:32573] mca:oob:select: Found 1 active transports Daemon [[54804,0],1] checking in as pid 32573 on host nid00414 [nid00414:32573] [[54804,0],1] orted: up and running - waiting for commands! [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 [nid00414:32573] [[54804,0],1]: set_addr to uri 3591634944.0;tcp://10.128.2.134:57286 [nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is reachable via component tcp [nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address tcp://10.128.2.134:57286 [nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE [nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0] [nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via component tcp [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 [nid00414:32573] [[54804,0],1] oob:base:send to target [[INVALID],INVALID] [nid00414:32573] [[54804,0],1] oob:base:send unknown peer [[INVALID],INVALID] [nid00414:32573] [[54804,0],1] is NOT reachable by TCP Application 1329706 exit codes: 1 Application 1329706 resources: utime ~0s, stime ~0s, Rss ~5304, inblocks ~6404, outblocks ~28 -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [aprun6-darter:16915] [[54804,0],0] TCP SHUTDOWN [aprun6-darter:16915] mca: base: close: component tcp closed [aprun6-darter:16915] mca: base: close: unloading component tcp -- Aurélien Bouteiller ~ https://icl.cs.utk.edu/~bouteill/
signature.asc
Description: Message signed with OpenPGP using GPGMail