[OMPI devel] 1.8.5rc1 and OOB on Cray XC30

2015-04-16 Thread Aurélien Bouteiller
> 
> - Improved support for Cray
> 
> Cray's compilers, networks or the programming environment in general?

I can compile on our Cray XC30, but not run with the options I used previously 
with trunk. Is there some secret sauce I am missing here ?
I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS.


/lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1   -mca btl 
ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp -mca 
plm_base_strip_prefix_from_node_names 1 -nolocal -novm  --debug-daemons -mca 
oob_base_verbose 1000 -mca ras_alps_apstat_cmd $(which apstat) -mca ras alps  
-mca oob_tcp_if_include ipogif0  -map-by node hostname
[aprun6-darter:16915] mca: base: components_register: registering oob components
[aprun6-darter:16915] mca: base: components_register: found loaded component tcp
[aprun6-darter:16915] mca: base: components_register: component tcp register 
function successful
[aprun6-darter:16915] mca: base: components_open: opening oob components
[aprun6-darter:16915] mca: base: components_open: found loaded component tcp
[aprun6-darter:16915] mca: base: components_open: component tcp open function 
successful
[aprun6-darter:16915] mca:oob:select: checking available component tcp
[aprun6-darter:16915] mca:oob:select: Querying component [tcp]
[aprun6-darter:16915] oob:tcp: component_available called
[aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo (not in 
include list)
[aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo (not in 
include list)
[aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to our 
list of V4 connections
[aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1 (not 
in include list)
[aprun6-darter:16915] [[54804,0],0] TCP STARTUP
[aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0
[aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286
[aprun6-darter:16915] mca:oob:select: Adding component to end
[aprun6-darter:16915] mca:oob:select: Found 1 active transports
[nid00414:32573] mca: base: components_register: registering oob components
[nid00414:32573] mca: base: components_register: found loaded component tcp
[nid00414:32573] mca: base: components_register: component tcp register 
function successful
[nid00414:32573] mca: base: components_open: opening oob components
[nid00414:32573] mca: base: components_open: found loaded component tcp
[nid00414:32573] mca: base: components_open: component tcp open function 
successful
[nid00414:32573] mca:oob:select: checking available component tcp
[nid00414:32573] mca:oob:select: Querying component [tcp]
[nid00414:32573] oob:tcp: component_available called
[nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in 
include list)
[nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our list of 
V4 connections
[nid00414:32573] [[54804,0],1] TCP STARTUP
[nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0
[nid00414:32573] [[54804,0],1] assigned IPv4 port 57372
[nid00414:32573] mca:oob:select: Adding component to end
[nid00414:32573] mca:oob:select: Found 1 active transports
Daemon [[54804,0],1] checking in as pid 32573 on host nid00414
[nid00414:32573] [[54804,0],1] orted: up and running - waiting for commands!
[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
[nid00414:32573] [[54804,0],1]: set_addr to uri 
3591634944.0;tcp://10.128.2.134:57286
[nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is 
reachable via component tcp
[nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address 
tcp://10.128.2.134:57286
[nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE
[nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0]
[nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via component 
tcp
[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
[nid00414:32573] [[54804,0],1] oob:base:send to target [[INVALID],INVALID]
[nid00414:32573] [[54804,0],1] oob:base:send unknown peer [[INVALID],INVALID]
[nid00414:32573] [[54804,0],1] is NOT reachable by TCP
Application 1329706 exit codes: 1
Application 1329706 resources: utime ~0s, stime ~0s, Rss ~5304, inblocks ~6404, 
outblocks ~28
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection 

Re: [OMPI devel] 1.8.5rc1 and OOB on Cray XC30

2015-04-16 Thread Nathan Hjelm

Take a look at contrib/platform/lanl/cray_xe6/optimized-lustre.conf. There are a
couple of MCA variables that need to be set in order to enable mpirun on Cray
systems.

-Nathan

On Thu, Apr 16, 2015 at 04:29:21PM -0400, Aurélien Bouteiller wrote:
>   
> 
>- Improved support for Cray
> 
>  Cray's compilers, networks or the programming environment in general? 
> 
>I can compile on our Cray XC30, but not run with the options I used
>previously with trunk. Is there some secret sauce I am missing here ? 
>I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS. 
>/lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1   -mca btl
>ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp
>-mca plm_base_strip_prefix_from_node_names 1 -nolocal -novm 
>--debug-daemons -mca oob_base_verbose 1000 -mca ras_alps_apstat_cmd
>$(which apstat) -mca ras alps  -mca oob_tcp_if_include ipogif0  -map-by
>node hostname 
>[aprun6-darter:16915] mca: base: components_register: registering oob
>components
>[aprun6-darter:16915] mca: base: components_register: found loaded
>component tcp
>[aprun6-darter:16915] mca: base: components_register: component tcp
>register function successful
>[aprun6-darter:16915] mca: base: components_open: opening oob components
>[aprun6-darter:16915] mca: base: components_open: found loaded component
>tcp
>[aprun6-darter:16915] mca: base: components_open: component tcp open
>function successful
>[aprun6-darter:16915] mca:oob:select: checking available component tcp
>[aprun6-darter:16915] mca:oob:select: Querying component [tcp]
>[aprun6-darter:16915] oob:tcp: component_available called
>[aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo
>(not in include list)
>[aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo
>(not in include list)
>[aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
>[aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to
>our list of V4 connections
>[aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>[aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1
>(not in include list)
>[aprun6-darter:16915] [[54804,0],0] TCP STARTUP
>[aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0
>[aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286
>[aprun6-darter:16915] mca:oob:select: Adding component to end
>[aprun6-darter:16915] mca:oob:select: Found 1 active transports
>[nid00414:32573] mca: base: components_register: registering oob
>components
>[nid00414:32573] mca: base: components_register: found loaded component
>tcp
>[nid00414:32573] mca: base: components_register: component tcp register
>function successful
>[nid00414:32573] mca: base: components_open: opening oob components
>[nid00414:32573] mca: base: components_open: found loaded component tcp
>[nid00414:32573] mca: base: components_open: component tcp open function
>successful
>[nid00414:32573] mca:oob:select: checking available component tcp
>[nid00414:32573] mca:oob:select: Querying component [tcp]
>[nid00414:32573] oob:tcp: component_available called
>[nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>[nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in
>include list)
>[nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
>[nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our
>list of V4 connections
>[nid00414:32573] [[54804,0],1] TCP STARTUP
>[nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0
>[nid00414:32573] [[54804,0],1] assigned IPv4 port 57372
>[nid00414:32573] mca:oob:select: Adding component to end
>[nid00414:32573] mca:oob:select: Found 1 active transports
>Daemon [[54804,0],1] checking in as pid 32573 on host nid00414
>[nid00414:32573] [[54804,0],1] orted: up and running - waiting for
>commands!
>[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
>[nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
>[nid00414:32573] [[54804,0],1]: set_addr to uri
>3591634944.0;tcp://10.128.2.134:57286
>[nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is
>reachable via component tcp
>[nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address
>tcp://10.128.2.134:57286
>[nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE
>[nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0]
>[nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via
>component tcp
>[n

[OMPI devel] interaction with slurm 14.11

2015-04-16 Thread David Singleton
Our site effectively runs all slurm jobs with sbatch --export=NONE ...  and
creates the necessary environment inside the batch script.  After upgading
to 14.11,  OpenMPI mpirun jobs hit

2015-04-15T08:53:54+08:00 nod0138 slurmstepd[3122]: error: execve(): orted:
No such file or directory

The issue appears to be that, as of 14.11, srun now recognizes
--export=NONE and, more importantly, the SLURM_EXPORT_ENV=NONE set in the
jobs environment if you submit with sbatch --export=NONE .   The simple
workaround is to unset SLURM_EXPORT_ENV before mpirun.  Possibly mpirun
should add --export=ALL to its srun commands.

Cheers
David