Andy,

what about reconfiguring Open MPI with LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic" ?

IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a simple script that sets the LD_LIBRARY_PATH and exec the original orted

do you have the same behaviour on non MIC hardware when Open MPI is compiled with intel compilers ? if it works on non MIC hardware, the root cause could be in the sshd_config of the MIC that does not
accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM, Ralph Castain wrote:
Hmmm…certainly looks that way. I’ll investigate.

On Apr 14, 2015, at 6:06 AM, Andy Riebs <andy.ri...@hp.com <mailto:andy.ri...@hp.com>> wrote:

Hi Ralph,

Still no happiness... It looks like my LD_LIBRARY_PATH just isn't getting propagated?

$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
        linux-vdso.so.1 =>  (0x00007fffa1d3b000)
libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002ab6ce464000) libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002ab6ce7d3000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ab6cebbd000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ab6ceded000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ab6ceff1000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ab6cf1f9000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ab6cf3fc000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ab6cf60f000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ab6cf82c000)
libimf.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so (0x00002ab6cfb84000) libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002ab6cffd6000) libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002ab6d086f000) libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002ab6d0a82000)
        /lib64/ld-linux-k1om.so.2 (0x00002ab6ce243000)

$ echo $LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib

$ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose 5 --mca memheap_base_verbose 100 --leave-session-attached --mca mca_component_show_load_errors 1 $PWD/mic.out
--------------------------------------------------------------------------
A deprecated MCA variable value was specified in the environment or
on the command line.  Deprecated MCA variables should be avoided;
they may disappear in future releases.

  Deprecated variable: mca_component_show_load_errors
  New variable: mca_base_component_show_load_errors
--------------------------------------------------------------------------
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [rsh]
[atl1-02-mic0:16183] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [atl1-02-mic0:16183] mca:base:select:( plm) Query of component [rsh] set priority to 10 [atl1-02-mic0:16183] mca:base:select:( plm) Querying component [isolated] [atl1-02-mic0:16183] mca:base:select:( plm) Query of component [isolated] set priority to 0
[atl1-02-mic0:16183] mca:base:select:(  plm) Querying component [slurm]
[atl1-02-mic0:16183] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[atl1-02-mic0:16183] mca:base:select:(  plm) Selected component [rsh]
[atl1-02-mic0:16183] plm:base:set_hnp_name: initial bias 16183 nodename hash 4238360777
[atl1-02-mic0:16183] plm:base:set_hnp_name: final jobfam 33630
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive start comm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_job
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm creating map
[atl1-02-mic0:16183] [[33630,0],0] setup:vm: working unmanaged allocation
[atl1-02-mic0:16183] [[33630,0],0] using dash_host
[atl1-02-mic0:16183] [[33630,0],0] checking node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm add new daemon [[33630,0],1] [atl1-02-mic0:16183] [[33630,0],0] plm:base:setup_vm assigning new daemon [[33630,0],1] to node mic1
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: launching vm
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: local shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: assuming same remote shell as local shell
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: remote shell: 0 (bash)
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2" [atl1-02-mic0:16183] [[33630,0],0] plm:rsh:launch daemon 0 not a child of mine [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: adding node mic1 to launch list
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: activating launch event
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: recording launch of daemon [[33630,0],1] [atl1-02-mic0:16183] [[33630,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh mic1 PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/ariebs/mic/mpi-nightly/bin/orted -mca orte_leave_session_attached "1" --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid "2203975680" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca orte_hnp_uri "2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1" --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca plm_base_verbose "5" --mca memheap_base_verbose "100" --mca mca_component_show_load_errors "1" -mca plm "rsh" -mca rmaps_ppr_n_pernode "2"] /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
[atl1-02-mic0:16183] [[33630,0],0] daemon 1 failed with status 127
[atl1-02-mic0:16183] [[33630,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive stop comm


On 04/13/2015 07:47 PM, Ralph Castain wrote:
Weird. I’m not sure what to try at that point - IIRC, building static won’t resolve this problem (but you could try and see). You could add the following to the cmd line and see if it tells us anything useful:

—leave-session-attached —mca mca_component_show_load_errors 1

You might also do an ldd on /home/ariebs/mic/mpi-nightly/bin/orted and see where it is looking for libimf since it (and not mic.out) is the one complaining


On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com <mailto:andy.ri...@hp.com>> wrote:

Ralph and Nathan,

The problem may be something trivial, as I don't typically use "shmemrun" to start jobs. With the following, I *think* I've demonstrated that the problem library is where it belongs on the remote system:

$ ldd mic.out
        linux-vdso.so.1 => (0x00007fffb83ff000)
liboshmem.so.0 => /home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x00002b059cfbb000) libmpi.so.0 => /home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x00002b059d35a000) libopen-rte.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0 (0x00002b059d7e3000) libopen-pal.so.0 => /home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0 (0x00002b059db53000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b059df3d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b059e16c000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002b059e371000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b059e574000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b059e786000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b059e9a4000)
        librt.so.1 => /lib64/librt.so.1 (0x00002b059ecfc000)
libimf.so =>*/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so* (0x00002b059ef04000) libsvml.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so (0x00002b059f356000) libirng.so => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so (0x00002b059fbef000) libintlc.so.5 => /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5 (0x00002b059fe02000)
        /lib64/ld-linux-k1om.so.2 (0x00002b059cd9a000)
$ echo $LD_LIBRARY_PATH
*/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic*:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file */opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so* /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: *error while loading shared libraries: libimf.so*: cannot open shared object file: No such file or directory
...


On 04/13/2015 04:25 PM, Nathan Hjelm wrote:
For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.

That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:
    Progress!  I can run my trivial program on the local PHI, but not the
    other PHI, on the system. Here are the interesting parts:

    A pretty good recipe with last night's nightly master:

    $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
    CXX="icpc -mmic" \
        --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
         AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
    LD=x86_64-k1om-linux-ld \
         --enable-mpirun-prefix-by-default --disable-io-romio
    --disable-mpi-fortran \
         --enable-orterun-prefix-by-default \
         --enable-debug
    $ make && make install
    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
    yoda --mca btl sm,self,tcp $PWD/mic.out
    Hello World from process 0 of 2
    Hello World from process 1 of 2
    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
    yoda --mca btl openib,sm,self $PWD/mic.out
    Hello World from process 0 of 2
    Hello World from process 1 of 2
    $

    However, I can't seem to cross the fabric. I can ssh freely back and forth
    between mic0 and mic1. However, running the next 2 tests from mic0, it
    certainly seems like the second one should work, too:

    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
    --mca btl sm,self,tcp $PWD/mic.out
    Hello World from process 0 of 2
    Hello World from process 1 of 2
    $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic1 -N 2 --mca spml yoda
    --mca btl sm,self,tcp $PWD/mic.out
    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
    libraries: libimf.so: cannot open shared object file: No such file or
    directory
    --------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
    (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
    use.

    *  compilation of the orted with dynamic libraries when static are
    required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).
     ...
    $

    (Note that I get the same results with "--mca btl openib,sm,self"....)

    $ ssh mic1 file
    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
    /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so: ELF
    64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om), version 1
    (SYSV), dynamically linked, not stripped
    $ shmemrun -x
    LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp $PWD/mic.out
    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
    libraries: libimf.so: cannot open shared object file: No such file or
    directory
    --------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
    (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
    use.

    *  compilation of the orted with dynamic libraries when static are
    required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).

    Following here is
    - IB information
    - Running the failing case with lots of debugging information. (As you
    might imagine, I've tried 17 ways from Sunday to try to ensure that
    libimf.so is found.)

    $ ibv_devices
        device                 node GUID
        ------              ----------------
        mlx4_0              24be05ffffa57160
        scif0               4c79bafffe4402b6
    $ ibv_devinfo
    hca_id: mlx4_0
            transport:                      InfiniBand (0)
            fw_ver:                         2.11.1250
            node_guid:                      24be:05ff:ffa5:7160
            sys_image_guid:                 24be:05ff:ffa5:7163
            vendor_id:                      0x02c9
            vendor_part_id:                 4099
            hw_ver:                         0x0
            phys_port_cnt:                  2
                    port:   1
                            state:                  PORT_ACTIVE (4)
                            max_mtu:                2048 (4)
                            active_mtu:             2048 (4)
                            sm_lid:                 8
                            port_lid:               86
                            port_lmc:               0x00
                            link_layer:             InfiniBand

                    port:   2
                            state:                  PORT_DOWN (1)
                            max_mtu:                2048 (4)
                            active_mtu:             2048 (4)
                            sm_lid:                 0
                            port_lid:               0
                            port_lmc:               0x00
                            link_layer:             InfiniBand

    hca_id: scif0
            transport:                      SCIF (2)
            fw_ver:                         0.0.1
            node_guid:                      4c79:baff:fe44:02b6
            sys_image_guid:                 4c79:baff:fe44:02b6
            vendor_id:                      0x8086
            vendor_part_id:                 0
            hw_ver:                         0x1
            phys_port_cnt:                  1
                    port:   1
                            state:                  PORT_ACTIVE (4)
                            max_mtu:                4096 (5)
                            active_mtu:             4096 (5)
                            sm_lid:                 1
                            port_lid:               1001
                            port_lmc:               0x00
                            link_layer:             SCIF

    $ shmemrun -x
    LD_PRELOAD=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
    -H mic1 -N 2 --mca spml yoda --mca btl sm,self,tcp --mca plm_base_verbose
    5 --mca memheap_base_verbose 100 $PWD/mic.out
    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [rsh]
    [atl1-01-mic0:191024] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
    rsh path NULL
    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component [rsh] set
    priority to 10
    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component
    [isolated]
    [atl1-01-mic0:191024] mca:base:select:(  plm) Query of component
    [isolated] set priority to 0
    [atl1-01-mic0:191024] mca:base:select:(  plm) Querying component [slurm]
    [atl1-01-mic0:191024] mca:base:select:(  plm) Skipping component [slurm].
    Query failed to return a module
    [atl1-01-mic0:191024] mca:base:select:(  plm) Selected component [rsh]
    [atl1-01-mic0:191024] plm:base:set_hnp_name: initial bias 191024 nodename
    hash 4121194178
    [atl1-01-mic0:191024] plm:base:set_hnp_name: final jobfam 29012
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh_setup on agent ssh : rsh path
    NULL
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive start comm
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_job
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm creating map
    [atl1-01-mic0:191024] [[29012,0],0] setup:vm: working unmanaged allocation
    [atl1-01-mic0:191024] [[29012,0],0] using dash_host
    [atl1-01-mic0:191024] [[29012,0],0] checking node mic1
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm add new daemon
    [[29012,0],1]
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:setup_vm assigning new daemon
    [[29012,0],1] to node mic1
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: launching vm
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: local shell: 0 (bash)
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: assuming same remote shell as
    local shell
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: remote shell: 0 (bash)
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: final template argv:
            /usr/bin/ssh <template>
    PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
    LD_LIBRARY_PATH ;
    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
    orte_ess_jobid "1901330432" -mca orte_ess_vpid "<template>" -mca
    orte_ess_num_procs "2" -mca orte_hnp_uri
    
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
    rmaps_ppr_n_pernode "2"
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a child of
    mine
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to launch
    list
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of daemon
    [[29012,0],1]
    [atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing: (/usr/bin/ssh)
    [/usr/bin/ssh mic1     PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ;
    export PATH ;
    LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ; export
    LD_LIBRARY_PATH ;
    DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH ;
    export DYLD_LIBRARY_PATH ;   /home/ariebs/mic/mpi-nightly/bin/orted
    --hnp-topo-sig 0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca
    orte_ess_jobid "1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs
    "2" -mca orte_hnp_uri
    
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
    --tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
    plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh" -mca
    rmaps_ppr_n_pernode "2"]
    /home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
    libraries: libimf.so: cannot open shared object file: No such file or
    directory
    [atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending orted_exit
    commands
    --------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
    (--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
    use.

    *  compilation of the orted with dynamic libraries when static are
    required
      (e.g., on Cray). Please check your configure cmd line and consider using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).
    --------------------------------------------------------------------------
    [atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm

    On 04/13/2015 08:50 AM, Andy Riebs wrote:

      Hi Ralph,

      Here are the results with last night's "master" nightly,
      openmpi-dev-1487-g9c6d452.tar.bz2, and adding the memheap_base_verbose
      option (yes, it looks like the "ERROR_LOG" problem has gone away):

      $ cat /proc/sys/kernel/shmmax
      33554432
      $ cat /proc/sys/kernel/shmall
      2097152
      $ cat /proc/sys/kernel/shmmni
      4096
      $ export SHMEM_SYMMETRIC_HEAP=1M
      $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca plm_base_verbose 5
      --mca memheap_base_verbose 100 $PWD/mic.out
      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [rsh]
      [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
      rsh path NULL
      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component [rsh]
      set priority to 10
      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
      [isolated]
      [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
      [isolated] set priority to 0
      [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component [slurm]
      [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
      [slurm]. Query failed to return a module
      [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component [rsh]
      [atl1-01-mic0:190439] plm:base:set_hnp_name: initial bias 190439
      nodename hash 4121194178
      [atl1-01-mic0:190439] plm:base:set_hnp_name: final jobfam 31875
      [atl1-01-mic0:190439] [[31875,0],0] plm:rsh_setup on agent ssh : rsh
      path NULL
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive start comm
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_job
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm creating map
      [atl1-01-mic0:190439] [[31875,0],0] setup:vm: working unmanaged
      allocation
      [atl1-01-mic0:190439] [[31875,0],0] using dash_host
      [atl1-01-mic0:190439] [[31875,0],0] checking node atl1-01-mic0
      [atl1-01-mic0:190439] [[31875,0],0] ignoring myself
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:setup_vm only HNP in
      allocation
      [atl1-01-mic0:190439] [[31875,0],0] complete_setup on job [31875,1]
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch_apps for job
      [31875,1]
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch wiring up iof for
      job [31875,1]
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch [31875,1] registered
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:launch job [31875,1] is not
      a dynamic spawn
      [atl1-01-mic0:190441] mca: base: components_register: registering
      memheap components
      [atl1-01-mic0:190441] mca: base: components_register: found loaded
      component buddy
      [atl1-01-mic0:190441] mca: base: components_register: component buddy
      has no register or open function
      [atl1-01-mic0:190442] mca: base: components_register: registering
      memheap components
      [atl1-01-mic0:190442] mca: base: components_register: found loaded
      component buddy
      [atl1-01-mic0:190442] mca: base: components_register: component buddy
      has no register or open function
      [atl1-01-mic0:190442] mca: base: components_register: found loaded
      component ptmalloc
      [atl1-01-mic0:190442] mca: base: components_register: component ptmalloc
      has no register or open function
      [atl1-01-mic0:190441] mca: base: components_register: found loaded
      component ptmalloc
      [atl1-01-mic0:190441] mca: base: components_register: component ptmalloc
      has no register or open function
      [atl1-01-mic0:190441] mca: base: components_open: opening memheap
      components
      [atl1-01-mic0:190441] mca: base: components_open: found loaded component
      buddy
      [atl1-01-mic0:190441] mca: base: components_open: component buddy open
      function successful
      [atl1-01-mic0:190441] mca: base: components_open: found loaded component
      ptmalloc
      [atl1-01-mic0:190441] mca: base: components_open: component ptmalloc
      open function successful
      [atl1-01-mic0:190442] mca: base: components_open: opening memheap
      components
      [atl1-01-mic0:190442] mca: base: components_open: found loaded component
      buddy
      [atl1-01-mic0:190442] mca: base: components_open: component buddy open
      function successful
      [atl1-01-mic0:190442] mca: base: components_open: found loaded component
      ptmalloc
      [atl1-01-mic0:190442] mca: base: components_open: component ptmalloc
      open function successful
      [atl1-01-mic0:190442] base/memheap_base_alloc.c:38 -
      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
      segments by method: 1
      [atl1-01-mic0:190441] base/memheap_base_alloc.c:38 -
      mca_memheap_base_alloc_init() Memheap alloc memory: 270532608 byte(s), 1
      segments by method: 1
      [atl1-01-mic0:190442] base/memheap_base_static.c:205 - _load_segments()
      add: 00600000-00601000 rw-p 00000000 00:11
      6029314                            /home/ariebs/bench/hello/mic.out
      [atl1-01-mic0:190441] base/memheap_base_static.c:205 - _load_segments()
      add: 00600000-00601000 rw-p 00000000 00:11
      6029314                            /home/ariebs/bench/hello/mic.out
      [atl1-01-mic0:190442] base/memheap_base_static.c:75 -
      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
      segments
      [atl1-01-mic0:190442] base/memheap_base_register.c:39 -
      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
      270532608 bytes type=0x1 id=0xFFFFFFFF
      [atl1-01-mic0:190441] base/memheap_base_static.c:75 -
      mca_memheap_base_static_init() Memheap static memory: 3824 byte(s), 2
      segments
      [atl1-01-mic0:190441] base/memheap_base_register.c:39 -
      mca_memheap_base_reg() register seg#00: 0x0xff000000 - 0x0x10f200000
      270532608 bytes type=0x1 id=0xFFFFFFFF
      [atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
      _reg_segment() Failed to register segment
      [atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
      _reg_segment() Failed to register segment
      [atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
      failed to initialize - aborting
      [atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
      failed to initialize - aborting
      --------------------------------------------------------------------------
      It looks like SHMEM_INIT failed for some reason; your parallel process
      is
      likely to abort.  There are many reasons that a parallel process can
      fail during SHMEM_INIT; some of which are due to configuration or
      environment
      problems.  This failure appears to be an internal failure; here's some
      additional information (which may only be relevant to an Open SHMEM
      developer):

        mca_memheap_base_select() failed
        --> Returned "Error" (-1) instead of "Success" (0)
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0) with
      errorcode -1.
      --------------------------------------------------------------------------
      --------------------------------------------------------------------------
      A SHMEM process is aborting at a time when it cannot guarantee that all
      of its peer processes in the job will be killed properly.  You should
      double check that everything has shut down cleanly.

      Local host: atl1-01-mic0
      PID:        190441
      --------------------------------------------------------------------------
      -------------------------------------------------------
      Primary job  terminated normally, but 1 process returned
      a non-zero exit code.. Per user-direction, the job has been aborted.
      -------------------------------------------------------
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
      orted_exit commands
      --------------------------------------------------------------------------
      shmemrun detected that one or more processes exited with non-zero
      status, thus causing
      the job to be terminated. The first process to do so was:

        Process name: [[31875,1],0]
        Exit code:    255
      --------------------------------------------------------------------------
      [atl1-01-mic0:190439] 1 more process has sent help message
      help-shmem-runtime.txt / shmem_init:startup:internal-failure
      [atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate" to 0
      to see all help / error messages
      [atl1-01-mic0:190439] 1 more process has sent help message
      help-shmem-api.txt / shmem-abort
      [atl1-01-mic0:190439] 1 more process has sent help message
      help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all killed
      [atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm

      On 04/12/2015 03:09 PM, Ralph Castain wrote:

        Sorry about that - I hadn't brought it over to the 1.8 branch yet.
        I've done so now, which means the ERROR_LOG shouldn't show up any
        more. It won't fix the memheap problem, though.
        You might try adding "--mca memheap_base_verbose 100" to your cmd line
        so we can see why none of the memheap components are being selected.

          On Apr 12, 2015, at 11:30 AM, Andy Riebs<andy.ri...@hp.com>  wrote:
          Hi Ralph,

          Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

          $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
          plm_base_verbose 5 $PWD/mic.out
          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
          [rsh]
          [atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup on agent
          ssh : rsh path NULL
          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
          [rsh] set priority to 10
          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
          [isolated]
          [atl1-01-mic0:190189] mca:base:select:(  plm) Query of component
          [isolated] set priority to 0
          [atl1-01-mic0:190189] mca:base:select:(  plm) Querying component
          [slurm]
          [atl1-01-mic0:190189] mca:base:select:(  plm) Skipping component
          [slurm]. Query failed to return a module
          [atl1-01-mic0:190189] mca:base:select:(  plm) Selected component
          [rsh]
          [atl1-01-mic0:190189] plm:base:set_hnp_name: initial bias 190189
          nodename hash 4121194178
          [atl1-01-mic0:190189] plm:base:set_hnp_name: final jobfam 32137
          [atl1-01-mic0:190189] [[32137,0],0] plm:rsh_setup on agent ssh : rsh
          path NULL
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive start comm
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_job
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm creating map
          [atl1-01-mic0:190189] [[32137,0],0] setup:vm: working unmanaged
          allocation
          [atl1-01-mic0:190189] [[32137,0],0] using dash_host
          [atl1-01-mic0:190189] [[32137,0],0] checking node atl1-01-mic0
          [atl1-01-mic0:190189] [[32137,0],0] ignoring myself
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:setup_vm only HNP in
          allocation
          [atl1-01-mic0:190189] [[32137,0],0] complete_setup on job [32137,1]
          [atl1-01-mic0:190189] [[32137,0],0] ORTE_ERROR_LOG: Not found in
          file base/plm_base_launch_support.c at line 440
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch_apps for job
          [32137,1]
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch wiring up iof
          for job [32137,1]
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch [32137,1]
          registered
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:launch job [32137,1] is
          not a dynamic spawn
          
--------------------------------------------------------------------------
          It looks like SHMEM_INIT failed for some reason; your parallel
          process is
          likely to abort.  There are many reasons that a parallel process can
          fail during SHMEM_INIT; some of which are due to configuration or
          environment
          problems.  This failure appears to be an internal failure; here's
          some
          additional information (which may only be relevant to an Open SHMEM
          developer):

            mca_memheap_base_select() failed
            --> Returned "Error" (-1) instead of "Success" (0)
          
--------------------------------------------------------------------------
          [atl1-01-mic0:190191] Error: pshmem_init.c:61 - shmem_init() SHMEM
          failed to initialize - aborting
          [atl1-01-mic0:190192] Error: pshmem_init.c:61 - shmem_init() SHMEM
          failed to initialize - aborting
          
--------------------------------------------------------------------------
          SHMEM_ABORT was invoked on rank 1 (pid 190192, host=atl1-01-mic0)
          with errorcode -1.
          
--------------------------------------------------------------------------
          
--------------------------------------------------------------------------
          A SHMEM process is aborting at a time when it cannot guarantee that
          all
          of its peer processes in the job will be killed properly.  You
          should
          double check that everything has shut down cleanly.

          Local host: atl1-01-mic0
          PID:        190192
          
--------------------------------------------------------------------------
          -------------------------------------------------------
          Primary job  terminated normally, but 1 process returned
          a non-zero exit code.. Per user-direction, the job has been aborted.
          -------------------------------------------------------
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:orted_cmd sending
          orted_exit commands
          
--------------------------------------------------------------------------
          shmemrun detected that one or more processes exited with non-zero
          status, thus causing
          the job to be terminated. The first process to do so was:

            Process name: [[32137,1],0]
            Exit code:    255
          
--------------------------------------------------------------------------
          [atl1-01-mic0:190189] 1 more process has sent help message
          help-shmem-runtime.txt / shmem_init:startup:internal-failure
          [atl1-01-mic0:190189] Set MCA parameter "orte_base_help_aggregate"
          to 0 to see all help / error messages
          [atl1-01-mic0:190189] 1 more process has sent help message
          help-shmem-api.txt / shmem-abort
          [atl1-01-mic0:190189] 1 more process has sent help message
          help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
          killed
          [atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm

          On 04/11/2015 07:41 PM, Ralph Castain wrote:

            Got it - thanks. I fixed that ERROR_LOG issue (I think- please
            verify). I suspect the memheap issue relates to something else,
            but I probably need to let the OSHMEM folks comment on it

              On Apr 11, 2015, at 9:52 AM, Andy Riebs<andy.ri...@hp.com>
              wrote:
              Everything is built on the Xeon side, with the icc "-mmic"
              switch. I then ssh into one of the PHIs, and run shmemrun from
              there.

              On 04/11/2015 12:00 PM, Ralph Castain wrote:

                Let me try to understand the setup a little better. Are you
                running shmemrun on the PHI itself? Or is it running on the
                host processor, and you are trying to spawn a process onto the
                Phi?

                  On Apr 11, 2015, at 7:55 AM, Andy Riebs<andy.ri...@hp.com>
                  wrote:
                  Hi Ralph,

                  Yes, this is attempting to get OSHMEM to run on the Phi.

                  I grabbed openmpi-dev-1484-g033418f.tar.bz2 and configured
                  it with

                  $ ./configure --prefix=/home/ariebs/mic/mpi-nightly
                  CC=icc -mmic CXX=icpc -mmic    \
                      --build=x86_64-unknown-linux-gnu
                  --host=x86_64-k1om-linux    \
                       AR=x86_64-k1om-linux-ar
                  RANLIB=x86_64-k1om-linux-ranlib  LD=x86_64-k1om-linux-ld   \
                       --enable-mpirun-prefix-by-default
                  --disable-io-romio     --disable-mpi-fortran    \
                       --enable-debug
                  --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

                  (Note that I had to add "oob-ud" to the
                  "--enable-mca-no-build" option, as the build complained that
                  mca oob/ud needed mca common-verbs.)

                  With that configuration, here is what I am seeing now...

                  $ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
                  $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
                  plm_base_verbose 5 $PWD/mic.out
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                  component [rsh]
                  [atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup on
                  agent ssh : rsh path NULL
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                  component [rsh] set priority to 10
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                  component [isolated]
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Query of
                  component [isolated] set priority to 0
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Querying
                  component [slurm]
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
                  component [slurm]. Query failed to return a module
                  [atl1-01-mic0:189895] mca:base:select:(  plm) Selected
                  component [rsh]
                  [atl1-01-mic0:189895] plm:base:set_hnp_name: initial bias
                  189895 nodename hash 4121194178
                  [atl1-01-mic0:189895] plm:base:set_hnp_name: final jobfam
                  32419
                  [atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on agent
                  ssh : rsh path NULL
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive start
                  comm
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
                  creating map
                  [atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
                  unmanaged allocation
                  [atl1-01-mic0:189895] [[32419,0],0] using dash_host
                  [atl1-01-mic0:189895] [[32419,0],0] checking node
                  atl1-01-mic0
                  [atl1-01-mic0:189895] [[32419,0],0] ignoring myself
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm only
                  HNP in allocation
                  [atl1-01-mic0:189895] [[32419,0],0] complete_setup on job
                  [32419,1]
                  [atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
                  found in file base/plm_base_launch_support.c at line 440
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps for
                  job [32419,1]
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch wiring
                  up iof for job [32419,1]
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
                  [32419,1] registered
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
                  [32419,1] is not a dynamic spawn
                  [atl1-01-mic0:189899] Error: pshmem_init.c:61 - shmem_init()
                  SHMEM failed to initialize - aborting
                  [atl1-01-mic0:189898] Error: pshmem_init.c:61 - shmem_init()
                  SHMEM failed to initialize - aborting
                  
--------------------------------------------------------------------------
                  It looks like SHMEM_INIT failed for some reason; your
                  parallel process is
                  likely to abort.  There are many reasons that a parallel
                  process can
                  fail during SHMEM_INIT; some of which are due to
                  configuration or environment
                  problems.  This failure appears to be an internal failure;
                  here's some
                  additional information (which may only be relevant to an
                  Open SHMEM
                  developer):

                    mca_memheap_base_select() failed
                    --> Returned "Error" (-1) instead of "Success" (0)
                  
--------------------------------------------------------------------------
                  
--------------------------------------------------------------------------
                  SHMEM_ABORT was invoked on rank 1 (pid 189899,
                  host=atl1-01-mic0) with errorcode -1.
                  
--------------------------------------------------------------------------
                  
--------------------------------------------------------------------------
                  A SHMEM process is aborting at a time when it cannot
                  guarantee that all
                  of its peer processes in the job will be killed properly.
                  You should
                  double check that everything has shut down cleanly.

                  Local host: atl1-01-mic0
                  PID:        189899
                  
--------------------------------------------------------------------------
                  -------------------------------------------------------
                  Primary job  terminated normally, but 1 process returned
                  a non-zero exit code.. Per user-direction, the job has been
                  aborted.
                  -------------------------------------------------------
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:orted_cmd
                  sending orted_exit commands
                  
--------------------------------------------------------------------------
                  shmemrun detected that one or more processes exited with
                  non-zero status, thus causing
                  the job to be terminated. The first process to do so was:

                    Process name: [[32419,1],1]
                    Exit code:    255
                  
--------------------------------------------------------------------------
                  [atl1-01-mic0:189895] 1 more process has sent help message
                  help-shmem-runtime.txt / shmem_init:startup:internal-failure
                  [atl1-01-mic0:189895] Set MCA parameter
                  "orte_base_help_aggregate" to 0 to see all help / error
                  messages
                  [atl1-01-mic0:189895] 1 more process has sent help message
                  help-shmem-api.txt / shmem-abort
                  [atl1-01-mic0:189895] 1 more process has sent help message
                  help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee
                  all killed
                  [atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop
                  comm

                  On 04/10/2015 06:37 PM, Ralph Castain wrote:

                    Andy - could you please try the current 1.8.5 nightly
                    tarball and see if it helps? The error log indicates that
                    it is failing to get the topology from some daemon, I**m
                    assuming the one on the Phi?
                    You might also add **enable-debug to that configure line
                    and then put -mca plm_base_verbose on the shmemrun cmd to
                    get more help

                      On Apr 10, 2015, at 11:55 AM, Andy Riebs
                      <andy.ri...@hp.com>  wrote:
                      Summary: MPI jobs work fine, SHMEM jobs work just often
                      enough to be tantalizing, on an Intel Xeon Phi/MIC
                      system.

                      Longer version

                      Thanks to the excellent write-up last June
                      
(<https://www.open-mpi.org/community/lists/users/2014/06/24711.php>),
                      I have been able to build a version of Open MPI for the
                      Xeon Phi coprocessor that runs MPI jobs on the Phi
                      coprocessor with no problem, but not SHMEM jobs.  Just
                      at the point where I was about to document the problems
                      I was having with SHMEM, my trivial SHMEM job worked.
                      And then failed when I tried to run it again,
                      immediately afterwards. I have a feeling I may be in
                      uncharted  territory here.

                      Environment
                        * RHEL 6.5
                        * Intel Composer XE 2015
                        * Xeon Phi/MIC
                      ----------------

                      Configuration

                      $ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                      $ source
                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                      intel64
                      $ ./configure --prefix=/home/ariebs/mic/mpi \
                         CC="icc -mmic" CXX="icpc -mmic" \
                         --build=x86_64-unknown-linux-gnu
                      --host=x86_64-k1om-linux \
                          AR=x86_64-k1om-linux-ar
                      RANLIB=x86_64-k1om-linux-ranlib \
                          LD=x86_64-k1om-linux-ld \
                          --enable-mpirun-prefix-by-default --disable-io-romio
                      \
                          --disable-vt --disable-mpi-fortran \
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs
                      $ make
                      $ make install

                      ----------------

                      Test program

                      #include <stdio.h>
                      #include <stdlib.h>
                      #include <shmem.h>
                      int main(int argc, char **argv)
                      {
                              int me, num_pe;
                              shmem_init();
                              num_pe = num_pes();
                              me = my_pe();
                              printf("Hello World from process %ld of %ld\n",
                      me, num_pe);
                              exit(0);
                      }

                      ----------------

                      Building the program

                      export PATH=/home/ariebs/mic/mpi/bin:$PATH
                      export PATH=/usr/linux-k1om-4.7/bin/:$PATH
                      source
                      /opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
                      intel64
                      export
                      
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

                      icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
                      -pthread \
                              -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
                      -Wl,--enable-new-dtags \
                              -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
                      -lopen-rte -lopen-pal \
                              -lm -ldl -lutil \
                              -Wl,-rpath
                      
-Wl,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                      \
-L/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic
                      \
                              -o mic.out  shmem_hello.c

                      ----------------

                      Running the program

                      (Note that the program had been consistently failing.
                      Then, when I logged back into the system to capture the
                      results, it worked once,  and then immediately failed
                      when I tried again, as shown below. Logging in and out
                      isn't sufficient to correct the problem. Overall, I
                      think I had 3 successful runs in 30-40 attempts.)

                      $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                      [atl1-01-mic0:189372] [[30936,0],0] ORTE_ERROR_LOG: Not
                      found in file base/plm_base_launch_support.c at line 426
                      Hello World from process 0 of 2
                      Hello World from process 1 of 2
                      $ shmemrun -H localhost -N 2 --mca sshmem mmap ./mic.out
                      [atl1-01-mic0:189381] [[30881,0],0] ORTE_ERROR_LOG: Not
                      found in file base/plm_base_launch_support.c at line 426
                      [atl1-01-mic0:189383] Error: pshmem_init.c:61 -
                      shmem_init() SHMEM failed to initialize - aborting
                      
--------------------------------------------------------------------------
                      It looks like SHMEM_INIT failed for some reason; your
                      parallel process is
                      likely to abort.  There are many reasons that a parallel
                      process can
                      fail during SHMEM_INIT; some of which are due to
                      configuration or environment
                      problems.  This failure appears to be an internal
                      failure; here's some
                      additional information (which may only be relevant to an
                      Open SHMEM
                      developer):

                        mca_memheap_base_select() failed
                        --> Returned "Error" (-1) instead of "Success" (0)
                      
--------------------------------------------------------------------------
                      
--------------------------------------------------------------------------
                      SHMEM_ABORT was invoked on rank 0 (pid 189383,
                      host=atl1-01-mic0) with errorcode -1.
                      
--------------------------------------------------------------------------
                      
--------------------------------------------------------------------------
                      A SHMEM process is aborting at a time when it cannot
                      guarantee that all
                      of its peer processes in the job will be killed
                      properly.  You should
                      double check that everything has shut down cleanly.

                      Local host: atl1-01-mic0
                      PID:        189383
                      
--------------------------------------------------------------------------
                      -------------------------------------------------------
                      Primary job  terminated normally, but 1 process returned
                      a non-zero exit code.. Per user-direction, the job has
                      been aborted.
                      -------------------------------------------------------
                      
--------------------------------------------------------------------------
                      shmemrun detected that one or more processes exited with
                      non-zero status, thus causing
                      the job to be terminated. The first process to do so
                      was:

                        Process name: [[30881,1],0]
                        Exit code:    255
                      
--------------------------------------------------------------------------

                      Any thoughts about where to go from here?

                      Andy

  --
  Andy Riebs
  Hewlett-Packard Company
  High Performance Computing
  +1 404 648 9024
  My opinions are not necessarily those of HP

                      _______________________________________________
                      users mailing list
                      us...@open-mpi.org
                      Subscription:
                      http://www.open-mpi.org/mailman/listinfo.cgi/users
                      Link to this post:
                      
http://www.open-mpi.org/community/lists/users/2015/04/26670.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26676.php

                  _______________________________________________
                  users mailing list
                  us...@open-mpi.org
                  Subscription:
                  http://www.open-mpi.org/mailman/listinfo.cgi/users
                  Link to this post:
                  
http://www.open-mpi.org/community/lists/users/2015/04/26678.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26679.php

              _______________________________________________
              users mailing list
              us...@open-mpi.org
              Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
              Link to this post:
              http://www.open-mpi.org/community/lists/users/2015/04/26680.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26682.php

          _______________________________________________
          users mailing list
          us...@open-mpi.org
          Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
          Link to this post:
          http://www.open-mpi.org/community/lists/users/2015/04/26683.php

  _______________________________________________
  users mailing list
  us...@open-mpi.org
  Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
  Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26684.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26697.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26699.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26700.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/04/26706.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26716.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/04/26718.php

Reply via email to