Take a look at contrib/platform/lanl/cray_xe6/optimized-lustre.conf. There are a
couple of MCA variables that need to be set in order to enable mpirun on Cray
systems.

-Nathan

On Thu, Apr 16, 2015 at 04:29:21PM -0400, Aurélien Bouteiller wrote:
>       
> 
>        - Improved support for Cray
> 
>      Cray's compilers, networks or the programming environment in general? 
> 
>    I can compile on our Cray XC30, but not run with the options I used
>    previously with trunk. Is there some secret sauce I am missing here ? 
>    I get an error with OOB on the node daemons. ESS PMI, RAS and PLM ALPS. 
>    /lustre/medusa/bouteill/openmpi-1.8.5rc1/bin/mpirun -np 1   -mca btl
>    ugni,sm,self -mca coll tuned,basic,self -mca orte_tmpdir_base /var/tmp
>    -mca plm_base_strip_prefix_from_node_names 1 -nolocal -novm 
>    --debug-daemons -mca oob_base_verbose 1000 -mca ras_alps_apstat_cmd
>    $(which apstat) -mca ras alps  -mca oob_tcp_if_include ipogif0  -map-by
>    node hostname 
>    [aprun6-darter:16915] mca: base: components_register: registering oob
>    components
>    [aprun6-darter:16915] mca: base: components_register: found loaded
>    component tcp
>    [aprun6-darter:16915] mca: base: components_register: component tcp
>    register function successful
>    [aprun6-darter:16915] mca: base: components_open: opening oob components
>    [aprun6-darter:16915] mca: base: components_open: found loaded component
>    tcp
>    [aprun6-darter:16915] mca: base: components_open: component tcp open
>    function successful
>    [aprun6-darter:16915] mca:oob:select: checking available component tcp
>    [aprun6-darter:16915] mca:oob:select: Querying component [tcp]
>    [aprun6-darter:16915] oob:tcp: component_available called
>    [aprun6-darter:16915] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>    [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo
>    (not in include list)
>    [aprun6-darter:16915] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>    [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface lo
>    (not in include list)
>    [aprun6-darter:16915] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
>    [aprun6-darter:16915] [[54804,0],0] oob:tcp:init adding 10.128.2.134 to
>    our list of V4 connections
>    [aprun6-darter:16915] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>    [aprun6-darter:16915] [[54804,0],0] oob:tcp:init rejecting interface eth1
>    (not in include list)
>    [aprun6-darter:16915] [[54804,0],0] TCP STARTUP
>    [aprun6-darter:16915] [[54804,0],0] attempting to bind to IPv4 port 0
>    [aprun6-darter:16915] [[54804,0],0] assigned IPv4 port 57286
>    [aprun6-darter:16915] mca:oob:select: Adding component to end
>    [aprun6-darter:16915] mca:oob:select: Found 1 active transports
>    [nid00414:32573] mca: base: components_register: registering oob
>    components
>    [nid00414:32573] mca: base: components_register: found loaded component
>    tcp
>    [nid00414:32573] mca: base: components_register: component tcp register
>    function successful
>    [nid00414:32573] mca: base: components_open: opening oob components
>    [nid00414:32573] mca: base: components_open: found loaded component tcp
>    [nid00414:32573] mca: base: components_open: component tcp open function
>    successful
>    [nid00414:32573] mca:oob:select: checking available component tcp
>    [nid00414:32573] mca:oob:select: Querying component [tcp]
>    [nid00414:32573] oob:tcp: component_available called
>    [nid00414:32573] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>    [nid00414:32573] [[54804,0],1] oob:tcp:init rejecting interface lo (not in
>    include list)
>    [nid00414:32573] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
>    [nid00414:32573] [[54804,0],1] oob:tcp:init adding 10.128.1.161 to our
>    list of V4 connections
>    [nid00414:32573] [[54804,0],1] TCP STARTUP
>    [nid00414:32573] [[54804,0],1] attempting to bind to IPv4 port 0
>    [nid00414:32573] [[54804,0],1] assigned IPv4 port 57372
>    [nid00414:32573] mca:oob:select: Adding component to end
>    [nid00414:32573] mca:oob:select: Found 1 active transports
>    Daemon [[54804,0],1] checking in as pid 32573 on host nid00414
>    [nid00414:32573] [[54804,0],1] orted: up and running - waiting for
>    commands!
>    [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
>    [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
>    [nid00414:32573] [[54804,0],1]: set_addr to uri
>    3591634944.0;tcp://10.128.2.134:57286
>    [nid00414:32573] [[54804,0],1]:set_addr checking if peer [[54804,0],0] is
>    reachable via component tcp
>    [nid00414:32573] [[54804,0],1] oob:tcp: working peer [[54804,0],0] address
>    tcp://10.128.2.134:57286
>    [nid00414:32573] [[54804,0],1] PASSING ADDR 10.128.2.134 TO MODULE
>    [nid00414:32573] [[54804,0],1]:tcp set addr for peer [[54804,0],0]
>    [nid00414:32573] [[54804,0],1]: peer [[54804,0],0] is reachable via
>    component tcp
>    [nid00414:32573] [[54804,0],1] OOB_SEND: rml_oob_send.c:199
>    [nid00414:32573] [[54804,0],1] oob:base:send to target [[INVALID],INVALID]
>    [nid00414:32573] [[54804,0],1] oob:base:send unknown peer
>    [[INVALID],INVALID]
>    [nid00414:32573] [[54804,0],1] is NOT reachable by TCP
>    Application 1329706 exit codes: 1
>    Application 1329706 resources: utime ~0s, stime ~0s, Rss ~5304, inblocks
>    ~6404, outblocks ~28
>    --------------------------------------------------------------------------
>    An ORTE daemon has unexpectedly failed after launch and before
>    communicating back to mpirun. This could be caused by a number
>    of factors, including an inability to create a connection back
>    to mpirun due to a lack of common network interfaces and/or no
>    route found between them. Please check network connectivity
>    (including firewalls and network routing requirements).
>    --------------------------------------------------------------------------
>    [aprun6-darter:16915] [[54804,0],0] TCP SHUTDOWN
>    [aprun6-darter:16915] mca: base: close: component tcp closed
>    [aprun6-darter:16915] mca: base: close: unloading component tcp
>    --  
>    Aurelien Bouteiller ~ https://icl.cs.utk.edu/~bouteill/



> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17234.php

Attachment: pgpUj6bWaA0ub.pgp
Description: PGP signature

Reply via email to