[OMPI devel] 1.8.5rc1 and OOB on Cray XC30
> > - Improved support for Cray > > Cray's compilers, networks or the programming environment in general? I can compile on our Cray XC30, but not run with the options I used previously with trunk. Is there some secret sauce I am missing here ? I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS. /lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1 -mca btl ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp -mca plm_base_strip_prefix_from_node_names 1 -nolocal -novm --debug-daemons -mca oob_base_verbose 1000 -mca ras_alps_apstat_cmd $(which apstat) -mca ras alps -mca oob_tcp_if_include ipogif0 -map-by node hostname [aprun6-darter:16915] mca: base: components_register: registering oob components [aprun6-darter:16915] mca: base: components_register: found loaded component tcp [aprun6-darter:16915] mca: base: components_register: component tcp register function successful [aprun6-darter:16915] mca: base: components_open: opening oob components [aprun6-darter:16915] mca: base: components_open: found loaded component tcp [aprun6-darter:16915] mca: base: components_open: component tcp open function successful [aprun6-darter:16915] mca:oob:select: checking available component tcp [aprun6-darter:16915] mca:oob:select: Querying component [tcp] [aprun6-darter:16915] oob:tcp: component_available called [aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo (not in include list) [aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo (not in include list) [aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to our list of V4 connections [aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1 (not in include list) [aprun6-darter:16915] [[54804,0],0] TCP STARTUP [aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0 [aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286 [aprun6-darter:16915] mca:oob:select: Adding component to end [aprun6-darter:16915] mca:oob:select: Found 1 active transports [nid00414:32573] mca: base: components_register: registering oob components [nid00414:32573] mca: base: components_register: found loaded component tcp [nid00414:32573] mca: base: components_register: component tcp register function successful [nid00414:32573] mca: base: components_open: opening oob components [nid00414:32573] mca: base: components_open: found loaded component tcp [nid00414:32573] mca: base: components_open: component tcp open function successful [nid00414:32573] mca:oob:select: checking available component tcp [nid00414:32573] mca:oob:select: Querying component [tcp] [nid00414:32573] oob:tcp: component_available called [nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in include list) [nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 [nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our list of V4 connections [nid00414:32573] [[54804,0],1] TCP STARTUP [nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0 [nid00414:32573] [[54804,0],1] assigned IPv4 port 57372 [nid00414:32573] mca:oob:select: Adding component to end [nid00414:32573] mca:oob:select: Found 1 active transports Daemon [[54804,0],1] checking in as pid 32573 on host nid00414 [nid00414:32573] [[54804,0],1] orted: up and running - waiting for commands! [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 [nid00414:32573] [[54804,0],1]: set_addr to uri 3591634944.0;tcp://10.128.2.134:57286 [nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is reachable via component tcp [nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address tcp://10.128.2.134:57286 [nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE [nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0] [nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via component tcp [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 [nid00414:32573] [[54804,0],1] oob:base:send to target [[INVALID],INVALID] [nid00414:32573] [[54804,0],1] oob:base:send unknown peer [[INVALID],INVALID] [nid00414:32573] [[54804,0],1] is NOT reachable by TCP Application 1329706 exit codes: 1 Application 1329706 resources: utime ~0s, stime ~0s, Rss ~5304, inblocks ~6404, outblocks ~28 -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection
Re: [OMPI devel] 1.8.5rc1 and OOB on Cray XC30
Take a look at contrib/platform/lanl/cray_xe6/optimized-lustre.conf. There are a couple of MCA variables that need to be set in order to enable mpirun on Cray systems. -Nathan On Thu, Apr 16, 2015 at 04:29:21PM -0400, Aurélien Bouteiller wrote: > > >- Improved support for Cray > > Cray's compilers, networks or the programming environment in general? > >I can compile on our Cray XC30, but not run with the options I used >previously with trunk. Is there some secret sauce I am missing here ? >I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS. >/lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1 -mca btl >ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp >-mca plm_base_strip_prefix_from_node_names 1 -nolocal -novm >--debug-daemons -mca oob_base_verbose 1000 -mca ras_alps_apstat_cmd >$(which apstat) -mca ras alps -mca oob_tcp_if_include ipogif0 -map-by >node hostname >[aprun6-darter:16915] mca: base: components_register: registering oob >components >[aprun6-darter:16915] mca: base: components_register: found loaded >component tcp >[aprun6-darter:16915] mca: base: components_register: component tcp >register function successful >[aprun6-darter:16915] mca: base: components_open: opening oob components >[aprun6-darter:16915] mca: base: components_open: found loaded component >tcp >[aprun6-darter:16915] mca: base: components_open: component tcp open >function successful >[aprun6-darter:16915] mca:oob:select: checking available component tcp >[aprun6-darter:16915] mca:oob:select: Querying component [tcp] >[aprun6-darter:16915] oob:tcp: component_available called >[aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo >(not in include list) >[aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 >[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo >(not in include list) >[aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 >[aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to >our list of V4 connections >[aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1 >(not in include list) >[aprun6-darter:16915] [[54804,0],0] TCP STARTUP >[aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0 >[aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286 >[aprun6-darter:16915] mca:oob:select: Adding component to end >[aprun6-darter:16915] mca:oob:select: Found 1 active transports >[nid00414:32573] mca: base: components_register: registering oob >components >[nid00414:32573] mca: base: components_register: found loaded component >tcp >[nid00414:32573] mca: base: components_register: component tcp register >function successful >[nid00414:32573] mca: base: components_open: opening oob components >[nid00414:32573] mca: base: components_open: found loaded component tcp >[nid00414:32573] mca: base: components_open: component tcp open function >successful >[nid00414:32573] mca:oob:select: checking available component tcp >[nid00414:32573] mca:oob:select: Querying component [tcp] >[nid00414:32573] oob:tcp: component_available called >[nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >[nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in >include list) >[nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 >[nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our >list of V4 connections >[nid00414:32573] [[54804,0],1] TCP STARTUP >[nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0 >[nid00414:32573] [[54804,0],1] assigned IPv4 port 57372 >[nid00414:32573] mca:oob:select: Adding component to end >[nid00414:32573] mca:oob:select: Found 1 active transports >Daemon [[54804,0],1] checking in as pid 32573 on host nid00414 >[nid00414:32573] [[54804,0],1] orted: up and running - waiting for >commands! >[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 >[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199 >[nid00414:32573] [[54804,0],1]: set_addr to uri >3591634944.0;tcp://10.128.2.134:57286 >[nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is >reachable via component tcp >[nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address >tcp://10.128.2.134:57286 >[nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE >[nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0] >[nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via >component tcp >[n
[OMPI devel] interaction with slurm 14.11
Our site effectively runs all slurm jobs with sbatch --export=NONE ... and creates the necessary environment inside the batch script. After upgading to 14.11, OpenMPI mpirun jobs hit 2015-04-15T08:53:54+08:00 nod0138 slurmstepd[3122]: error: execve(): orted: No such file or directory The issue appears to be that, as of 14.11, srun now recognizes --export=NONE and, more importantly, the SLURM_EXPORT_ENV=NONE set in the jobs environment if you submit with sbatch --export=NONE . The simple workaround is to unset SLURM_EXPORT_ENV before mpirun. Possibly mpirun should add --export=ALL to its srun commands. Cheers David