from:"Andy Riebs"

[OMPI users] Using PLFS with Open MPI 1.8

2014-07-28 Thread Andy Riebs

Has anyone found the magic to apply the traditional PLFS 
ompi-1.7.x-plfs-prep.patch to the current version of Open MPI? It looks 
like it shouldn't take too much effort to update the patch, but it would 
be even better to learn that someone else has already made that available!


Andy

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

[OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-10 Thread Andy Riebs

n-zero exit code.. Per user-direction, the job has been aborted.
---
--
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30881,1],0]
  Exit code:    255
--

Any thoughts about where to go from here?

Andy

-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-11 Thread Andy Riebs

,0],0] plm:base:orted_cmd sending
orted_exit commands
--
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32419,1],1]
  Exit code:    255
--
[atl1-01-mic0:189895] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:189895] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[atl1-01-mic0:189895] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:189895] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive stop comm




On 04/10/2015 06:37 PM, Ralph Castain
  wrote:


  
  Andy - could you please try the current 1.8.5 nightly tarball and
  see if it helps? The error log indicates that it is failing to get
  the topology from some daemon, I’m assuming the one on the Phi?
  
  
  You might also add —enable-debug to that configure
line and then put -mca plm_base_verbose on the shmemrun cmd to
get more help
  
  
  

  
On Apr 10, 2015, at 11:55 AM, Andy Riebs <andy.ri...@hp.com> wrote:


   Summary:
MPI jobs work fine, SHMEM jobs work just often enough to
be tantalizing, on an Intel Xeon Phi/MIC system.

Longer version

Thanks to the excellent write-up last June (),

I have been able to build a version of Open MPI for the
Xeon Phi coprocessor that runs MPI jobs on the Phi
coprocessor with no problem, but not SHMEM jobs.  Just
at the point where I was about to document the problems
I was having with SHMEM, my trivial SHMEM job worked.
And then failed when I tried to run it again,
immediately afterwards. I have a feeling I may be in
uncharted  territory here.

Environment

  RHEL 6.5
  Intel Composer XE 2015
  Xeon Phi/MIC




Configuration

$ export PATH=/usr/linux-k1om-4.7/bin/:$PATH
$ source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
$ ./configure --prefix=/home/ariebs/mic/mpi \
   CC="icc -mmic" CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
    AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib \
    LD=x86_64-k1om-linux-ld \
    --enable-mpirun-prefix-by-default --disable-io-romio
\
    --disable-vt --disable-mpi-fortran \
   
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs
$ make
$ make install



Test program

#include 
#include 
#include 
int main(int argc, char **argv)
{
    int me, num_pe;
    shmem_init();
    num_pe = num_pes();
    me = my_pe();
    printf("Hello World from process %ld of %ld\n",
me, num_pe);
    exit(0);
}



Building the program

export PATH=/home/ariebs/mic/mpi/bin:$PATH
export PATH=/usr/linux-k1om-4.7/bin/:$PATH
source
/opt/intel/15.0/composer_xe_2015/bin/compilervars.sh
intel64
export
LD_LIBRARY_PATH=/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:$LD_LIBRARY_PATH

icc -mmic -std=gnu99 -I/home/ariebs/mic/mpi/include
-pthread \
    -Wl,-rpath -Wl,/home/ariebs/mic/mpi/lib
-Wl,--enable-new-dtags \
    -L/home/ariebs/mic/mpi/lib -loshmem -lmpi
-lopen-rte -lopen-pal \
    -lm -ldl -lutil \

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-11 Thread Andy Riebs


  
  
Everything is built on the Xeon side, with the icc "-mmic" switch. I
then ssh into one of the PHIs, and run shmemrun from there.


On 04/11/2015 12:00 PM, Ralph Castain
  wrote:


  
  Let me try to understand the setup a little better. Are you
  running shmemrun on the PHI itself? Or is it running on the host
  processor, and you are trying to spawn a process onto the Phi?
  
  
  

  
On Apr 11, 2015, at 7:55 AM, Andy Riebs <andy.ri...@hp.com> wrote:


   Hi Ralph,

Yes, this is attempting to get OSHMEM to run on the Phi.

I grabbed openmpi-dev-1484-g033418f.tar.bz2 and
configured it with

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly   
CC=icc -mmic CXX=icpc -mmic    \
    --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux    \
 AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib 
LD=x86_64-k1om-linux-ld   \
 --enable-mpirun-prefix-by-default
--disable-io-romio --disable-mpi-fortran    \
 --enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

(Note that I had to add "oob-ud" to the
"--enable-mca-no-build" option, as the build complained
that mca oob/ud needed mca common-verbs.)

With that configuration, here is what I am seeing now...

$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying
component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID] plm:rsh_lookup
on agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of
component [rsh] set priority to 10
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying
component [isolated]
[atl1-01-mic0:189895] mca:base:select:(  plm) Query of
component [isolated] set priority to 0
[atl1-01-mic0:189895] mca:base:select:(  plm) Querying
component [slurm]
[atl1-01-mic0:189895] mca:base:select:(  plm) Skipping
component [slurm]. Query failed to return a module
[atl1-01-mic0:189895] mca:base:select:(  plm) Selected
component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name: initial
bias 189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name: final
jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0] plm:rsh_setup on
agent ssh : rsh path NULL
[atl1-01-mic0:189895] [[32419,0],0] plm:base:receive
start comm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_job
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
creating map
[atl1-01-mic0:189895] [[32419,0],0] setup:vm: working
unmanaged allocation
[atl1-01-mic0:189895] [[32419,0],0] using dash_host
[atl1-01-mic0:189895] [[32419,0],0] checking node
atl1-01-mic0
[atl1-01-mic0:189895] [[32419,0],0] ignoring myself
[atl1-01-mic0:189895] [[32419,0],0] plm:base:setup_vm
only HNP in allocation
[atl1-01-mic0:189895] [[32419,0],0] complete_setup on
job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] ORTE_ERROR_LOG: Not
found in file base/plm_base_launch_support.c at line 440
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch_apps
for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
wiring up iof for job [32419,1]
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch
[32419,1] registered
[atl1-01-mic0:189895] [[32419,0],0] plm:base:launch job
[32419,1] is not a dynamic spawn
[atl1-01-mic0:189899] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
[atl1-01-mic0:189898] Error: pshmem_init.c:61 -
shmem_init() SHMEM failed to initialize - aborting
---

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-12 Thread Andy Riebs

rocess has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
[atl1-01-mic0:190189] [[32137,0],0] plm:base:receive stop comm


On 04/11/2015 07:41 PM, Ralph Castain
  wrote:


  
  Got it - thanks. I fixed that ERROR_LOG issue (I think- please
  verify). I suspect the memheap issue relates to something else,
  but I probably need to let the OSHMEM folks comment on it
  
  
  

  
    On Apr 11, 2015, at 9:52 AM, Andy Riebs <andy.ri...@hp.com> wrote:


   Everything
is built on the Xeon side, with the icc "-mmic" switch.
I then ssh into one of the PHIs, and run shmemrun from
there.


On 04/11/2015 12:00 PM,
  Ralph Castain wrote:

 Let me try to understand the
  setup a little better. Are you running shmemrun on the
  PHI itself? Or is it running on the host processor,
  and you are trying to spawn a process onto the Phi?
  
  
  

  
On Apr 11, 2015, at 7:55 AM, Andy
  Riebs <andy.ri...@hp.com>
  wrote:


  
Hi Ralph,

Yes, this is attempting to get OSHMEM to run
on the Phi.

I grabbed openmpi-dev-1484-g033418f.tar.bz2
and configured it with

$ ./configure
--prefix=/home/ariebs/mic/mpi-nightly   
CC=icc -mmic CXX=icpc -mmic    \
    --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux    \
 AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib 
LD=x86_64-k1om-linux-ld   \
 --enable-mpirun-prefix-by-default
--disable-io-romio
--disable-mpi-fortran    \
 --enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud

(Note that I had to add "oob-ud" to the
"--enable-mca-no-build" option, as the build
complained that mca oob/ud needed mca
common-verbs.)

With that configuration, here is what I am
seeing now...

$ export SHMEM_SYMMETRIC_HEAP_SIZE=1G
$ shmemrun -H localhost -N 2 --mca sshmem
mmap  --mca plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:189895] mca:base:select:( 
plm) Querying component [rsh]
[atl1-01-mic0:189895] [[INVALID],INVALID]
plm:rsh_lookup on agent ssh : rsh path NULL
[atl1-01-mic0:189895] mca:base:select:( 
plm) Query of component [rsh] set priority
to 10
[atl1-01-mic0:189895] mca:base:select:( 
plm) Querying component [isolated]
[atl1-01-mic0:189895] mca:base:select:( 
plm) Query of component [isolated] set
priority to 0
[atl1-01-mic0:189895] mca:base:select:( 
plm) Querying component [slurm]
[atl1-01-mic0:189895] mca:base:select:( 
plm) Skipping component [slurm]. Query
failed to return a module
[atl1-01-mic0:189895] mca:base:select:( 
plm) Selected component [rsh]
[atl1-01-mic0:189895] plm:base:set_hnp_name:
initial bias 189895 nodename hash 4121194178
[atl1-01-mic0:189895] plm:base:set_hnp_name:
final jobfam 32419
[atl1-01-mic0:189895] [[32419,0],0]
plm:rsh_setup on agent ssh : rsh path NULL
[atl1-

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Andy Riebs

riebs/bench/hello/mic.out
[atl1-01-mic0:190441] base/memheap_base_static.c:205 -
_load_segments() add: 0060-00601000 rw-p  00:11
6029314    /home/ariebs/bench/hello/mic.out
[atl1-01-mic0:190442] base/memheap_base_static.c:75 -
mca_memheap_base_static_init() Memheap static memory: 3824 byte(s),
2 segments
[atl1-01-mic0:190442] base/memheap_base_register.c:39 -
mca_memheap_base_reg() register seg#00: 0x0xff00 - 0x0x10f20
270532608 bytes type=0x1 id=0x
[atl1-01-mic0:190441] base/memheap_base_static.c:75 -
mca_memheap_base_static_init() Memheap static memory: 3824 byte(s),
2 segments
[atl1-01-mic0:190441] base/memheap_base_register.c:39 -
mca_memheap_base_reg() register seg#00: 0x0xff00 - 0x0x10f20
270532608 bytes type=0x1 id=0x
[atl1-01-mic0:190442] Error base/memheap_base_register.c:130 -
_reg_segment() Failed to register segment
[atl1-01-mic0:190441] Error base/memheap_base_register.c:130 -
_reg_segment() Failed to register segment
[atl1-01-mic0:190442] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
[atl1-01-mic0:190441] Error: pshmem_init.c:61 - shmem_init() SHMEM
failed to initialize - aborting
--
It looks like SHMEM_INIT failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during SHMEM_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open SHMEM
developer):

  mca_memheap_base_select() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--
--
SHMEM_ABORT was invoked on rank 0 (pid 190441, host=atl1-01-mic0)
with errorcode -1.
--
--
A SHMEM process is aborting at a time when it cannot guarantee that
all
of its peer processes in the job will be killed properly.  You
should
double check that everything has shut down cleanly.

Local host: atl1-01-mic0
PID:    190441
--
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
[atl1-01-mic0:190439] [[31875,0],0] plm:base:orted_cmd sending
orted_exit commands
--
shmemrun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31875,1],0]
  Exit code:    255
--
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-runtime.txt / shmem_init:startup:internal-failure
[atl1-01-mic0:190439] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[atl1-01-mic0:190439] 1 more process has sent help message
help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
killed
[atl1-01-mic0:190439] [[31875,0],0] plm:base:receive stop comm



On 04/12/2015 03:09 PM, Ralph Castain
  wrote:


  
  Sorry about that - I hadn’t brought it over to the 1.8 branch yet.
  I’ve done so now, which means the ERROR_LOG shouldn’t show up any
  more. It won’t fix the memheap problem, though.
  
  
  You might try adding “--mca memheap_base_verbose
100” to your cmd line so we can see why none of the memheap
components are being selected.
  
  
  
    
  
On Apr 12, 2015, at 11:30 AM, Andy Riebs <andy.ri...@hp.com> wrote:


   Hi Ralph,

Here's the output with openmpi-v1.8.4-202-gc2da6a5.tar.bz2:

$ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
plm_base_verbose 5 $PWD/mic.out
[atl1-01-mic0:190189] mca:base:select:(  plm) Querying
component [rsh]
[atl1-01-mic0:190189] [[INVALID],INVALID] plm:rsh_lookup

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Andy Riebs

s/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;  
/home/ariebs/mic/mpi-nightly/bin/orted --hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid
"1901330432" -mca orte_ess_vpid "" -mca
orte_ess_num_procs "2" -mca orte_hnp_uri
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "2"
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh:launch daemon 0 not a
child of mine
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: adding node mic1 to
launch list
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: activating launch event
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: recording launch of
daemon [[29012,0],1]
[atl1-01-mic0:191024] [[29012,0],0] plm:rsh: executing:
(/usr/bin/ssh) [/usr/bin/ssh mic1
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;  
/home/ariebs/mic/mpi-nightly/bin/orted --hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid
"1901330432" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca
orte_hnp_uri
"1901330432.0;usock;tcp://16.113.180.125,192.0.0.121:34249;ud://2359370.86.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "2"]
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file
or directory
[atl1-01-mic0:191024] [[29012,0],0] daemon 1 failed with status 127
[atl1-01-mic0:191024] [[29012,0],0] plm:base:orted_cmd sending
orted_exit commands
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with
--enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location
to use.

*  compilation of the orted with dynamic libraries when static are
required
  (e.g., on Cray). Please check your configure cmd line and consider
using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--
[atl1-01-mic0:191024] [[29012,0],0] plm:base:receive stop comm



On 04/13/2015 08:50 AM, Andy Riebs
  wrote:


  
  Hi Ralph,
  
  Here are the results with last night's "master" nightly,
  openmpi-dev-1487-g9c6d452.tar.bz2, and adding the
  memheap_base_verbose option (yes, it looks like the "ERROR_LOG"
  problem has gone away):
  
  $ cat /proc/sys/kernel/shmmax
  33554432
  $ cat /proc/sys/kernel/shmall
  2097152
  $ cat /proc/sys/kernel/shmmni
  4096
  $ export SHMEM_SYMMETRIC_HEAP=1M
  $ shmemrun -H localhost -N 2 --mca sshmem mmap  --mca
  plm_base_verbose 5 --mca memheap_base_verbose 100 $PWD/mic.out
  [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
  [rsh]
  [atl1-01-mic0:190439] [[INVALID],INVALID] plm:rsh_lookup on agent
  ssh : rsh path NULL
  [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
  [rsh] set priority to 10
  [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
  [isolated]
  [atl1-01-mic0:190439] mca:base:select:(  plm) Query of component
  [isolated] set priority to 0
  [atl1-01-mic0:190439] mca:base:select:(  plm) Querying component
  [slurm]
  [atl1-01-mic0:190439] mca:base:select:(  plm) Skipping component
  [slurm]. Query failed to return a module
  [atl1-01-mic0:190439] mca:base:select:(  plm) Selected component

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-13 Thread Andy Riebs


  
  
Ralph and Nathan,

The problem may be something trivial, as I don't typically use
"shmemrun" to start jobs. With the following, I *think* I've 
demonstrated that the problem library is where it belongs on the
remote system:

$ ldd mic.out
    linux-vdso.so.1 =>  (0x7fffb83ff000)
    liboshmem.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/liboshmem.so.0 (0x2b059cfbb000)
    libmpi.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libmpi.so.0 (0x2b059d35a000)
    libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x2b059d7e3000)
    libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x2b059db53000)
    libm.so.6 => /lib64/libm.so.6 (0x2b059df3d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x2b059e16c000)
    libutil.so.1 => /lib64/libutil.so.1 (0x2b059e371000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1
(0x2b059e574000)
    libpthread.so.0 => /lib64/libpthread.so.0
(0x2b059e786000)
    libc.so.6 => /lib64/libc.so.6 (0x2b059e9a4000)
    librt.so.1 => /lib64/librt.so.1 (0x2b059ecfc000)
    libimf.so =>
  /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
(0x2b059ef04000)
    libsvml.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libsvml.so
(0x2b059f356000)
    libirng.so =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libirng.so
(0x2b059fbef000)
    libintlc.so.5 =>
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libintlc.so.5
(0x2b059fe02000)
    /lib64/ld-linux-k1om.so.2 (0x2b059cd9a000)
$ echo $LD_LIBRARY_PATH 
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/../compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/ipp/tools/intel64/perfsys:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/intel64:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/intel64/gcc4.1:/opt/intel/15.0/composer_xe_2015.2.164/debugger/ipt/ia32/lib
$ ssh mic1 file /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so:
ELF 64-bit LSB shared object, Intel Xeon Phi coprocessor (k1om),
version 1 (SYSV), dynamically linked, not stripped
$ shmemrun -H mic1 -N 2 --mca btl scif,self $PWD/mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading
  shared libraries: libimf.so: cannot open shared object file:
No such file or directory
...


On 04/13/2015 04:25 PM, Nathan Hjelm
  wrote:


  
For talking between PHIs on the same system I recommend using the scif
BTL NOT tcp.

That said, it looks like the LD_LIBRARY_PATH is wrong on the remote
system. It looks like it can't find the intel compiler libraries.

-Nathan Hjelm
HPC-5, LANL

On Mon, Apr 13, 2015 at 04:06:21PM -0400, Andy Riebs wrote:

  
   Progress!  I can run my trivial program on the local PHI, but not the
   other PHI, on the system. Here are the interesting parts:

   A pretty good recipe with last night's nightly master:

   $ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
   CXX="icpc -mmic" \
   --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib 
   LD=x86_64-k1om-linux-ld \
--enable-mpirun-prefix-by-default --disable-io-romio
   --disable-mpi-fortran \
--enable-orterun-prefix-by-default \
--enable-debug
   $ make && make install
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H localhost -N 2 --mca spml
   yoda --mca btl openib,sm,self $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $

   However, I can't seem to cross the fabric. I can ssh freely back and forth
   between mic0 and mic1. However, running the next 2 tests from mic0, it 
   certainly seems like the second one should work, too:

   $ shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -H mic0 -N 2 --mca spml yoda
   --mca btl sm,self,tcp $PWD/mic.out
   Hello World from process 0 of 2
   Hello World from process 1 of 2
   $ shmemrun -x

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-14 Thread Andy Riebs

/usr/bin/ssh 
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;  
/home/ariebs/mic/mpi-nightly/bin/orted -mca
orte_leave_session_attached "1" --hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid
"2203975680" -mca orte_ess_vpid "" -mca
orte_ess_num_procs "2" -mca orte_hnp_uri
"2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" --mca
mca_component_show_load_errors "1" -mca plm "rsh" -mca
rmaps_ppr_n_pernode "2"
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh:launch daemon 0 not a
child of mine
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: adding node mic1 to
launch list
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: activating launch event
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: recording launch of
daemon [[33630,0],1]
[atl1-02-mic0:16183] [[33630,0],0] plm:rsh: executing:
(/usr/bin/ssh) [/usr/bin/ssh mic1
PATH=/home/ariebs/mic/mpi-nightly/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/ariebs/mic/mpi-nightly/lib:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;  
/home/ariebs/mic/mpi-nightly/bin/orted -mca
orte_leave_session_attached "1" --hnp-topo-sig
0N:1S:0L3:61L2:61L1:61C:244H:k1om -mca ess "env" -mca orte_ess_jobid
"2203975680" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "2" -mca
orte_hnp_uri
"2203975680.0;usock;tcp://16.113.180.127,192.0.0.122:34640;ud://2883658.78.1"
--tree-spawn --mca spml "yoda" --mca btl "sm,self,tcp" --mca
plm_base_verbose "5" --mca memheap_base_verbose "100" --mca
mca_component_show_load_errors "1" -mca plm "rsh" -mca
rmaps_ppr_n_pernode "2"]
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such file
or directory
[atl1-02-mic0:16183] [[33630,0],0] daemon 1 failed with status 127
[atl1-02-mic0:16183] [[33630,0],0] plm:base:orted_cmd sending
orted_exit commands
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with
--enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location
to use.

*  compilation of the orted with dynamic libraries when static are
required
  (e.g., on Cray). Please check your configure cmd line and consider
using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--
[atl1-02-mic0:16183] [[33630,0],0] plm:base:receive stop comm


On 04/13/2015 07:47 PM, Ralph Castain
  wrote:


  
  Weird. I’m not sure what to try at that point - IIRC, building
  static won’t resolve this problem (but you could try and see). You
  could add the following to the cmd line and see if it tells us
  anything useful:
  
  
  —leave-session-attached
—mca mca_component_show_load_errors 1
  
  
  You might also do an ldd on
/home/ariebs/mic/mpi-nightly/bin/orted and see where it is
looking for libimf since it (and not mic.out) is the one
complaining
  
  
  

  
On Apr 13, 2015, at 1:58 PM, Andy Riebs <andy.ri...@hp.com> wrote:


   Ralph and
Nathan,

The problem may be something trivial, as I don't
typically use "shmemrun" to start jobs. With the
following, I *think* I

Re: [OMPI users] One-sided communication, a missing/non-existing API call

2015-04-14 Thread Andy Riebs


  
  
Nick,

You may have more luck looking into the OSHMEM layer of Open MPI;
SHMEM is designed for one-sided communications.

BR,
Andy

On 04/14/2015 02:36 PM, Nick Papior
  Andersen wrote:


  
  Dear all,


I am trying to implement some features using a one-sided
  communication scheme.


The problem is that I understand the different one-sided
  communication schemes as this (basic words):
MPI_Get) 
fetches remote window memory to a local memory space
MPI_Get_Accumulate) 
1. fetches remote window memory to a local memory space
2. sends a local memory space (different from that used in
  1.) to the remote window and does OP on those two quantities
MPI_Put)
sends local memory space to remote window memory
MPI_Accumulate)
sends a local memory space to the remote window and does OP
  on those two quantities

(surprisingly the documentation says that this only works
  with windows within the same node, note that
  MPI_Get_Accumulate does not say this constraint)


?)
Where is the function that fetches remotely and does
  operation in a local memory space?


Do I really have to do MPI_Get to local memory, then do
  operation manually? (no it is not difficult, but... ;) )
I would like this to exist:
  MPI_Get_Reduce(origin,...,target,...,MPI_OP,...) 


When I just looked at the API names I thought
  Get_Accumulate did this, but to my surprise that was not the
  case at all. :)



-- 
  

  Kind regards Nick

  

  
  
  
  
  ___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26723.php

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-15 Thread Andy Riebs


  
  
Gilles and Ralph, thanks!

$ shmemrun -H mic0,mic1 -n 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=1M
$PWD/mic.out
[atl1-01-mic0:192474] [[29886,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 440
Hello World from process 0 of 2
Hello World from process 1 of 2
$

This was built with the openmpi-dev-1487-g9c6d452.tar.bz2 nightly
master. Oddly, -static-intel didn't work. Fortunately, -rpath did.
I'll follow-up in the next day or so with the winning build recipes
for both MPI and the user app to wrap up this note and, one hopes,
save others from some frustration in the future.

Andy


On 04/14/2015 11:10 PM, Ralph Castain
  wrote:


  
  I think Gilles may be correct here. In reviewing the code, it
  appears we have never (going back to the 1.6 series, at least)
  forwarded the local LD_LIBRARY_PATH to the remote node when
  exec’ing the orted. The only thing we have done is to set the PATH
  and LD_LIBRARY_PATH to support the OMPI prefix - not any
  supporting libs.
  
  
  What we have required, therefore, is that your path
be setup properly in the remote .bashrc (or pick your shell) to
handle the libraries.
  
  
  As I indicated, the -x option only forwards envars
to the application procs themselves, not the orted. I could try
to add another cmd line option to forward things for the orted,
but the concern we’ve had in the past (and still harbor) is that
the ssh cmd line is limited in length. Thus, adding some
potentially long paths to support this option could overwhelm it
and cause failures.
  
  
  I’d try the static method first, or perhaps the
LDFLAGS Gilles suggested.
  
  
  

  
On Apr 14, 2015, at 5:11 PM, Gilles
  Gouaillardet <gil...@rist.or.jp>
  wrote:


   Andy,

what about reconfiguring Open MPI with
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"

?

IIRC, an other option is : LDFLAGS="-static-intel"

last but not least, you can always replace orted with a
simple script that sets the LD_LIBRARY_PATH and exec the
original orted

do you have the same behaviour on non MIC hardware when
Open MPI is compiled with intel compilers ?
if it works on non MIC hardware, the root cause could be
in the sshd_config of the MIC that does not
accept to receive LD_LIBRARY_PATH

my 0.02 US$

Gilles

On 4/14/2015 11:20 PM,
  Ralph Castain wrote:

 Hmmm…certainly looks that way.
  I’ll investigate.
  

  
    On Apr 14, 2015, at 6:06 AM, Andy
  Riebs <andy.ri...@hp.com>
  wrote:


  
Hi Ralph,

Still no happiness... It looks like my
LD_LIBRARY_PATH just isn't getting
propagated?

$ ldd /home/ariebs/mic/mpi-nightly/bin/orted
    linux-vdso.so.1 => 
(0x7fffa1d3b000)
    libopen-rte.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-rte.so.0
(0x2ab6ce464000)
    libopen-pal.so.0 =>
/home/ariebs/mic/mpi-nightly/lib/libopen-pal.so.0
(0x2ab6ce7d3000)
    libm.so.6 => /lib64/libm.so.6
(0x2ab6cebbd000)
    libdl.so.2 => /lib64/libdl.so.2
(0x2ab6ceded000)
    librt.so.1 => /lib64/librt.so.1
(0x2ab6ceff1000)
    libutil.so.1 =>
/lib64/libutil.so.1 (0x2ab6cf1f9000)
    libgcc_s.so.1 =>
/lib64/libgcc_s.so.1 (0x2ab6cf3fc000)
    libpthread.so.0 =>
/lib64/libpthre

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-16 Thread Andy Riebs


  
  
Hi Ralph,

If I did this right (NEVER a good bet :-) ), it didn't work...

Using last night's master nightly,
openmpi-dev-1515-gc869490.tar.bz2, I built with the same script as
yesterday, but removing the LDFLAGS=-Wl, stuff:

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly CC="icc -mmic"
CXX="icpc -mmic" \
  --build=x86_64-unknown-linux-gnu --host=x86_64-k1om-linux \
   AR=x86_64-k1om-linux-ar RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
   --enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
   --enable-debug
--enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
$ make
$ make install
 ...
make[1]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
make[1]: Entering directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[2]: Entering directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make  install-exec-hook
make[3]: Entering directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[3]: ./config/find_common_syms: Command not found
make[3]: [install-exec-hook] Error 127 (ignored)
make[3]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[2]: Nothing to be done for `install-data-am'.
make[2]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
make[1]: Leaving directory
`/home/ariebs/mic/openmpi-dev-1515-gc869490'
$

But it seems to finish the install.

I then tried to run, adding the new mca arguments:

$  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca plm_rsh_pass_path
$PATH -mca plm_rsh_pass_libpath $MIC_LD_LIBRARY_PATH -H
mic0,mic1 -n 2 ./mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while loading shared
libraries: libimf.so: cannot open shared object file: No such
file or directory
 ...
$ echo $MIC_LD_LIBRARY_PATH
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
$ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.*
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
$


On 04/16/2015 07:22 AM, Ralph Castain
  wrote:


  
  FWIW: I just added (last night) a pair of new MCA
params for this purpose:


plm_rsh_pass_path    prepends the designated
  path to the remote shell's PATH prior to executing orted
plm_rsh_pass_libpath   same thing for
  LD_LIBRARY_PATH


I believe that will resolve the problem for Andy regardless
  of compiler used. In the master now, waiting for someone to
  verify it before adding to 1.8.5. Sadly, I am away from any
  cluster for the rest of this week, so I'd welcome anyone
  having a chance to test it.


  
  
On Thu, Apr 16, 2015 at 2:57 AM, Thomas
  Jahns 
  wrote:
  
Hello,
  
  
On Apr 15, 2015, at 02:11 , Gilles Gouaillardet
  wrote:

  what about
reconfiguring Open MPI with
LDFLAGS="-Wl,-rpath,/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic"
?

IIRC, an other option is :
LDFLAGS="-static-intel"
  

  
  
  

let me first state that I have no experience
  developing for MIC. But regarding the Intel runtime
  libraries, the only sane option in my opinion is to
  use the icc.cfg/ifort.cfg/icpc.cfg files that get put
  in the same directory as the corresponding compiler
  binaries and add a line like


-Wl,-rpath,/path/to/composerxe/lib/intel??


to that file.


Regards, Thomas

 

  
-- 
Thomas Jahns
DKRZ GmbH, Department: Application
  s

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-26 Thread Andy Riebs


  
  
Hi Ralph,

Did you solve this problem in a more general way? I finally sat down
this morning to try this with the openmpi-dev-1567-g11e8c20.tar.bz2
nightly kit from last week, and can't reproduce the problem at all.

Andy

On 04/16/2015 12:15 PM, Ralph Castain
  wrote:


  
  Sorry - I had to revert the commit due to a
reported MTT problem. I'll reinsert it after I get home and can
debug the problem this weekend.
  
On Thu, Apr 16, 2015 at 9:41 AM, Andy
  Riebs <andy.ri...@hp.com>
  wrote:
  
 Hi Ralph,
  
  If I did this right (NEVER a good bet :-) ), it didn't
  work...
  
  Using last night's master nightly,
  openmpi-dev-1515-gc869490.tar.bz2, I built with the same
  script as yesterday, but removing the LDFLAGS=-Wl, stuff:

$ ./configure --prefix=/home/ariebs/mic/mpi-nightly
CC="icc -mmic" CXX="icpc -mmic" \
  --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
   AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib LD=x86_64-k1om-linux-ld
\
   --enable-mpirun-prefix-by-default --disable-io-romio
--disable-mpi-fortran \
      --enable-debug
  --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
$ make
$ make install
    ...
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
  make[1]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make  install-exec-hook
  make[3]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[3]: ./config/find_common_syms: Command not found
  make[3]: [install-exec-hook] Error 127 (ignored)
  make[3]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Nothing to be done for `install-data-am'.
  make[2]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  $
  
  But it seems to finish the install.
  
  I then tried to run, adding the new mca arguments:
  
  $  shmemrun -x SHMEM_SYMMETRIC_HEAP_SIZE=1M -mca
  plm_rsh_pass_path $PATH -mca plm_rsh_pass_libpath
$MIC_LD_LIBRARY_PATH -H mic0,mic1 -n 2 ./mic.out
/home/ariebs/mic/mpi-nightly/bin/orted: error while
loading shared libraries: libimf.so: cannot open
  shared object file: No such file or directory
 ...
   $ echo $MIC_LD_LIBRARY_PATH
  /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mpirt/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/ipp/lib/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/mkl/lib/mic:/opt/intel/15.0/composer_xe_2015.2.164/tbb/lib/mic
  $ ls /opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.*
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.a
/opt/intel/15.0/composer_xe_2015.2.164/compiler/lib/mic/libimf.so
  $
  

  
  
  On 04/16/2015 07:22 AM, Ralph Castain wrote:
  

  
  

  
FWIW: I just added (last night) a
  pair of new MCA params for this purpose:
  
  
  plm_rsh_pass_path    prepends the
designated path to the remote shell's PATH prior
to executing orted
  plm_rsh_pass_libpath   same thing
for LD_LIBRARY_PATH
  
  
  I believe that will resolve the problem for
Andy regardless of compiler used. In the master
now, waiting for someone to verify it before
adding to 1.8.5. Sadly, I am away from any
cluster for t

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

2015-04-26 Thread Andy Riebs


  
  
Yes, it just worked -- I took the old command line, just to ensure
that I was testing the correct problem, and it worked. Then I
remembered that I had set OMPI_MCA_plm_rsh_pass_path and
OMPI_MCA_plm_rsh_pass_libpath in my test setup, so I removed those
from my environment, ran again, and it still worked!

Whatever it is that you're doing Ralph, keep it up :-)

Regardless of the cause or result, thanks $$ for poking at this!

Andy

On 04/26/2015 10:35 AM, Ralph Castain
  wrote:


  
  Not intentionally - I did add that new MCA param as we discussed,
  but don’t recall making any other changes in this area.
  
  
  There have been some other build system changes made
as a result of more extensive testing of the 1.8 release
candidate - it is possible that something in that area had an
impact here.
  
  
  Are you saying it just works, even without passing
the new param?
  
  
  

  
On Apr 26, 2015, at 6:39 AM, Andy Riebs <andy.ri...@hp.com> wrote:


   Hi Ralph,

Did you solve this problem in a more general way? I
finally sat down this morning to try this with the
openmpi-dev-1567-g11e8c20.tar.bz2 nightly kit from last
week, and can't reproduce the problem at all.

Andy

On 04/16/2015 12:15 PM,
  Ralph Castain wrote:


  Sorry - I had to revert the
commit due to a reported MTT problem. I'll reinsert
it after I get home and can debug the problem this
weekend.
  
On Thu, Apr 16, 2015 at
      9:41 AM, Andy Riebs <andy.ri...@hp.com>
  wrote:
  

  Hi Ralph,
  
  If I did this right (NEVER a good bet :-) ),
  it didn't work...
  
  Using last night's master nightly,
  openmpi-dev-1515-gc869490.tar.bz2, I built
  with the same script as yesterday, but
  removing the LDFLAGS=-Wl, stuff:

$ ./configure
--prefix=/home/ariebs/mic/mpi-nightly
CC="icc -mmic" CXX="icpc -mmic" \
  --build=x86_64-unknown-linux-gnu
--host=x86_64-k1om-linux \
   AR=x86_64-k1om-linux-ar
RANLIB=x86_64-k1om-linux-ranlib
LD=x86_64-k1om-linux-ld \
   --enable-mpirun-prefix-by-default
--disable-io-romio --disable-mpi-fortran \
      --enable-debug
  --enable-mca-no-build=btl-usnic,btl-openib,common-verbs,oob-ud
$ make
$ make install
    ...
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490/test'
  make[1]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make  install-exec-hook
  make[3]: Entering directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[3]:
./config/find_common_syms: Command not found
  make[3]: [install-exec-hook] Error 127
  (ignored)
  make[3]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[2]: Nothing to be done for
  `install-data-am'.
  make[2]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  make[1]: Leaving directory
  `/home/ariebs/mic/openmpi-dev-1515-gc869490'
  $
  
  But it seems to finish the install.
  
  I then tried to run, adding the new mca
  arguments:

[OMPI users] Using Open MPI with PBS Pro

2016-08-23 Thread Andy Riebs

I gleaned from the web that I need to comment out 
"opal_event_include=epoll" in /etc/openmpi-mca-params.conf 
in order to use Open MPI with PBS Pro.


Can we also disable that in other cases, like Slurm, or is this 
something specific to PBS Pro?


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using Open MPI with PBS Pro

2016-08-24 Thread Andy Riebs


Hi Ralph,

I think I found that information at 
<https://github.com/open-mpi/ompi/issues/341> :-)


In any case, thanks for the information about the default params file -- 
I won't worry too much about modifying it then.


Andy

I
On 08/23/2016 08:08 PM, r...@open-mpi.org wrote:

Iâ€™ve never heard of that, and cannot imagine what it has to do with the 
resource manager. Can you point to where you heard that one?

FWIW: we donâ€™t ship OMPI with anything in the default mca params file, so 
somebody must have put it in there for you.



On Aug 23, 2016, at 4:48 PM, Andy Riebs  wrote:

I gleaned from the web that I need to comment out "opal_event_include=epoll" in 
/etc/openmpi-mca-params.conf in order to use Open MPI with PBS Pro.

Can we also disable that in other cases, like Slurm, or is this something 
specific to PBS Pro?

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Slurm binding not propagated to MPI jobs

2016-10-27 Thread Andy Riebs


Hi All,

We are running Open MPI version 1.10.2, built with support for Slurm 
version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to 
bind by core, which segv's if there are more processes than cores.


The user reports:

What I found is that

% srun --ntasks-per-node=8 --cpu_bind=none  \
 env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0

will have the problem, but:

% srun --ntasks-per-node=8 --cpu_bind=none  \
 env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0

Will run as expected and print out the usage message because I didn’t 
provide the right arguments to the code.


So, it appears that the binding has something to do with the issue. My 
binding script is as follows:


% cat bindit.sh
#!/bin/bash

#echo SLURM_LOCALID=$SLURM_LOCALID

stride=1

if [ ! -z "$SLURM_LOCALID" ]; then
   let bindCPU=$SLURM_LOCALID*$stride
   exec numactl --membind=0 --physcpubind=$bindCPU $*
fi

$*

%


--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Slurm binding not propagated to MPI jobs

2016-10-27 Thread Andy Riebs


Hi Ralph,

I think I've found the magic keys...

$ srun --ntasks-per-node=2 -N1 --cpu_bind=none env | grep BIND
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_TYPE=none
SLURM_CPU_BIND_LIST=
SLURM_CPU_BIND=quiet,none
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_TYPE=none
SLURM_CPU_BIND_LIST=
SLURM_CPU_BIND=quiet,none
$ srun --ntasks-per-node=2 -N1 --cpu_bind=core env | grep BIND
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPU_BIND_LIST=0x,0x
SLURM_CPU_BIND=quiet,mask_cpu:0x,0x
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPU_BIND_LIST=0x,0x
SLURM_CPU_BIND=quiet,mask_cpu:0x,0x

Andy

On 10/27/2016 11:57 AM, r...@open-mpi.org wrote:

Hey Andy

Is there a SLURM envar that would tell us the binding option from the srun cmd 
line? We automatically bind when direct launched due to user complaints of poor 
performance if we donâ€™t. If the user specifies a binding option, then we 
detect that we were already bound and donâ€™t do it.

However, if the user specifies that they not be bound, then we think they 
simply didnâ€™t specify anything - and that isnâ€™t the case. If we can see 
something that tells us â€œthey explicitly said not to do itâ€, then we can 
avoid the situation.

Ralph


On Oct 27, 2016, at 8:48 AM, Andy Riebs  wrote:

Hi All,

We are running Open MPI version 1.10.2, built with support for Slurm version 16.05.0. 
When a user specifies "--cpu_bind=none", MPI tries to bind by core, which 
segv's if there are more processes than cores.

The user reports:

What I found is that

% srun --ntasks-per-node=8 --cpu_bind=none  \
 env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0

will have the problem, but:

% srun --ntasks-per-node=8 --cpu_bind=none  \
 env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe 0

Will run as expected and print out the usage message because I didnâ€™t provide 
the right arguments to the code.

So, it appears that the binding has something to do with the issue. My binding 
script is as follows:

% cat bindit.sh
#!/bin/bash

#echo SLURM_LOCALID=$SLURM_LOCALID

stride=1

if [ ! -z "$SLURM_LOCALID" ]; then
   let bindCPU=$SLURM_LOCALID*$stride
   exec numactl --membind=0 --physcpubind=$bindCPU $*
fi

$*

%


--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Slurm binding not propagated to MPI jobs

2016-10-27 Thread Andy Riebs


  
  
Yes, they still exist:
$ srun --ntasks-per-node=2 -N1 env | grep BIND | sort -u
  SLURM_CPU_BIND_LIST=0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_VERBOSE=quiet

Here are the relevant Slurm configuration options that could
  conceivably change the behavior from system to system:

SelectType  = select/cons_res
  SelectTypeParameters    = CR_CPU
  


On 10/27/2016 01:17 PM,
  r...@open-mpi.org wrote:


  
  And if there is no --cpu_bind on the cmd line? Do these not exist?
  

  
On Oct 27, 2016, at 10:14 AM, Andy Riebs <andy.ri...@hpe.com> wrote:

Hi
Ralph,
  
  I
think I've found the magic keys...
  
  $
srun --ntasks-per-node=2 -N1 --cpu_bind=none env | grep
BIND
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=none
  SLURM_CPU_BIND_LIST=
  SLURM_CPU_BIND=quiet,none
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=none
  SLURM_CPU_BIND_LIST=
  SLURM_CPU_BIND=quiet,none
  $
srun --ntasks-per-node=2 -N1 --cpu_bind=core env | grep
BIND
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_LIST=0x,0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x,0x
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_LIST=0x,0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x,0x
  
  Andy
  
  On
10/27/2016 11:57 AM, r...@open-mpi.org wrote:
  Hey Andy

Is there a SLURM envar that would tell us the binding
option from the srun cmd line? We automatically bind
when direct launched due to user complaints of poor
performance if we donÃ¢â‚¬â„¢t. If the user specifies a
binding option, then we detect that we were already
bound and donÃ¢â‚¬â„¢t do it.

However, if the user specifies that they not be bound,
then we think they simply didnÃ¢â‚¬â„¢t specify anything
- and that isnÃ¢â‚¬â„¢t the case. If we can see
something that tells us Ã¢â‚¬Å“they explicitly said not
to do itÃ¢â‚¬, then we can avoid the situation.

Ralph

On Oct 27, 2016, at
  8:48 AM, Andy Riebs <andy.ri...@hpe.com>
  wrote:
  
  Hi All,
  
  We are running Open MPI version 1.10.2, built with
  support for Slurm version 16.05.0. When a user
  specifies "--cpu_bind=none", MPI tries to bind by
  core, which segv's if there are more processes than
  cores.
  
  The user reports:
  
  What I found is that
  
  % srun --ntasks-per-node=8 --cpu_bind=none  \
  env SHMEM_SYMMETRIC_HEAP_SIZE=1024M
  bin/all2all.shmem.exe 0
  
  will have the problem, but:
  
  % srun --ntasks-per-node=8 --cpu_bind=none  \
  env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh
  bin/all2all.shmem.exe 0
  
  Will run as expected and print out the usage message
  because I didnÃ¢â‚¬â„¢t provide the right arguments to
  the code.
  
  So, it appears that the binding has something to do
  with the issue. My binding script is as follows:
  
  % cat bindit.sh
  #!/bin/bash
  
  #echo SLURM_LOCALID=$SLURM_LOCALID
  
  stride=1
  
  if [ ! -z "$SLURM_LOCALID" ]; then
    let bindCPU=$SLURM_LOCALID*$stride
    exec numactl --membind=0 --physcpubind=$bindCPU $*
  fi
  
  $*
  
  %
  
  
      -- 
  Andy Riebs
  andy.ri...@hpe.com
  Hewlett-Packard Enterprise
  High Performance Computing Software Engineering
  +1 404 648 9024
  My

Re: [OMPI users] Slurm binding not propagated to MPI jobs

2016-10-27 Thread Andy Riebs


  
  
Hi Ralph,
I haven't played around in this code, so I'll flip the question
  over to the Slurm list, and report back here when I learn
  anything.
Cheers
Andy

On 10/27/2016 01:44 PM,
  r...@open-mpi.org wrote:


  
  Sigh - of course it wouldn’t be simple :-(
  
  
  All right, let’s suppose we look for SLURM_CPU_BIND:
  
  
  * if it includes the word “none”, then we know the
user specified that they don’t want us to bind
  
  
  * if it includes the word mask_cpu, then we have to
check the value of that option.
  
  
  
* If it is all F’s, then they didn’t specify a
  binding and we should do our thing.


* If it is anything else, then we assume they
  _did_ specify a binding, and we leave it alone
  
  
  
  Would that make sense? Is there anything else that
could be in that envar which would trip us up?
  
  
  

  
On Oct 27, 2016, at 10:37 AM, Andy Riebs <andy.ri...@hpe.com> wrote:


  
Yes, they still exist:
$ srun --ntasks-per-node=2 -N1 env | grep
  BIND | sort -u
  SLURM_CPU_BIND_LIST=0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_VERBOSE=quiet

Here are the relevant Slurm configuration
  options that could conceivably change the behavior
  from system to system:

SelectType  = select/cons_res
  SelectTypeParameters    = CR_CPU
  


On 10/27/2016 01:17 PM, r...@open-mpi.org
  wrote:

 And if there is no --cpu_bind on
  the cmd line? Do these not exist?
  

  
On Oct 27, 2016, at 10:14 AM, Andy
      Riebs <andy.ri...@hpe.com>
  wrote:

Hi
Ralph,
  
  I
think I've found the magic keys...
  
  $ srun
--ntasks-per-node=2 -N1 --cpu_bind=none env
| grep BIND
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=none
  SLURM_CPU_BIND_LIST=
  SLURM_CPU_BIND=quiet,none
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=none
  SLURM_CPU_BIND_LIST=
  SLURM_CPU_BIND=quiet,none
  $ srun
--ntasks-per-node=2 -N1 --cpu_bind=core env
| grep BIND
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_LIST=0x,0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x,0x
  SLURM_CPU_BIND_VERBOSE=quiet
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_LIST=0x,0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x,0x
  
  Andy
  
  On
10/27/2016 11:57 AM, r...@open-mpi.org wrote:
  Hey Andy

Is there a SLURM envar that would tell us
the binding option from the srun cmd line?
We automatically bind when direct launched
due to user complaints of poor performance
if we donÃ¢â‚¬â„¢t. If the user specifies a
binding option, then we detect that we were
already bound and donÃ¢â‚¬â„¢t do it.

However, if the user specifies that they not
be bound, then we think they simply
didnÃ¢â‚¬â„¢t specify anything - and that
isnÃ¢â‚¬â„¢t the case. If we can see
something that tells us Ã¢â‚¬Å“they
explicitly said not to do itÃ¢â‚¬, then we
can avoid the situation.

Re: [OMPI users] Slurm binding not propagated to MPI jobs

2016-11-03 Thread Andy Riebs


  
  
Getting that support into 2.1 would be terrific -- and might save
  us from having to write some Slurm prolog scripts to effect that.

Thanks Ralph!

On 11/01/2016 11:36 PM,
  r...@open-mpi.org wrote:


  
  Ah crumby!! We already solved this on master, but it cannot be
  backported to the 1.10 series without considerable pain. For some
  reason, the support for it has been removed from the 2.x series as
  well. Iâ€™ll try to resolve that issue and get the support
  reinstated there (probably not until 2.1).
  
  
  Can you manage until then? I think the v2 RMâ€™s are
thinking Dec/Jan for 2.1.
  Ralph
  
  
  

  
On Nov 1, 2016, at 11:38 AM, Riebs, Andy <andy.ri...@hpe.com> wrote:


  
To
close the thread hereâ€¦ I got the following
information:
 
Looking at SLURM_CPU_BIND is the right idea,
  but there are quite a few more options. It misses
  map_cpu, rank, plus the NUMA-based options:
rank_ldom, map_ldom, and mask_ldom. See the
  srun man pages for documentation.
 
 

  
From: Riebs,
Andy 
Sent: Thursday,
October 27, 2016 1:53 PM
To: users@lists.open-mpi.org
Subject: Re:
[OMPI users] Slurm binding not propagated to MPI
jobs
  

 
Hi Ralph,
I haven't played around in this code,
  so I'll flip the question over to the Slurm list, and
  report back here when I learn anything.
Cheers
  Andy

  On
10/27/2016 01:44 PM, r...@open-mpi.org wrote:


  Sigh
- of course it wouldnâ€™t be simple :-( 
  
 
  
  
All right, letâ€™s suppose we look for
  SLURM_CPU_BIND:
  
  
 
  
  
* if it includes the word â€œnoneâ€,
  then we know the user specified that they donâ€™t
  want us to bind
  
  
 
  
  
* if it includes the word mask_cpu, then
  we have to check the value of that option.
  
  
 
  
  

  * If it is all Fâ€™s, then they
didnâ€™t specify a binding and we should do our
thing.


   


  * If it is anything else, then we
assume they _did_ specify a binding, and we
leave it alone

  
  
 
  
  
Would that make sense? Is there anything
  else that could be in that envar which would trip
  us up?
  
  
 
  
  
 

  

  On Oct 27, 2016, at
    10:37 AM, Andy Riebs <andy.ri...@hpe.com>
wrote:

 

  
Yes, they still
  exist:
$ srun
  --ntasks-per-node=2 -N1 env | grep BIND |
  sort -u
  SLURM_CPU_BIND_LIST=0x
  SLURM_CPU_BIND=quiet,mask_cpu:0x
  SLURM_CPU_BIND_TYPE=mask_cpu:
  SLURM_CPU_BIND_VERBOSE=quiet
Here are the
  relevant Slurm configuration options that
  could conceivably change the behavior from
  system to system:
SelectType

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Andy Riebs

erestingly, I participated in the discussion that lead to 
that
workaround, stating that I had no problem compiling Open MPI 
with

PGI v9. I'm assuming the problem now is that I'm specifying
--enable-mpi-thread-multiple, which I'm doing because a user
requested that feature.

It's been exactly 8 years and 2 days since that workaround was
posted to the list. Please tell me a better way of dealing with
this issue than writing a 'fakepgf90' script. Any suggestions?


--
Prentice

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
<https://rfd.newmexicoconsortium.org/mailman/listinfo/users>



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users











___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Build problem

2017-05-24 Thread Andy Riebs


Hi,

I'm trying to build OMPI on RHEL 7.2 with MOFED on an x86_64 system, and 
I'm seeing


   =
   Open MPI gitclone: test/datatype/test-suite.log
   =

   # TOTAL: 9
   # PASS:  8
   # SKIP:  0
   # XFAIL: 0
   # FAIL:  1
   # XPASS: 0
   # ERROR: 0

   .. contents:: :depth: 2

   FAIL: external32
   

   
/data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32:
   symbol lookup error:
   
/data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32:
   undefined symbol: ompi_datatype_pack_external_size
   FAIL external32 (exit status: 127)

I'm probably missing an obvious library or package, but 
libc++-devel.i686 and glibc-devel.i686 didn't cover this for me.


Alex, I'd like to buy a clue, please?

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Build problem

2017-05-24 Thread Andy Riebs


Exactly the hint that I needed -- thanks Gilles!

Andy

On 05/24/2017 10:33 PM, Gilles Gouaillardet wrote:

Andy,


it looks like some MPI libraries are being mixed in your environment


from the test/datatype directory, what if you

ldd .libs/lt-external32

does it resolve the the libmpi.so you expect ?


Cheers,


Gilles


On 5/25/2017 11:02 AM, Andy Riebs wrote:

Hi,

I'm trying to build OMPI on RHEL 7.2 with MOFED on an x86_64 system, 
and I'm seeing


=
   Open MPI gitclone: test/datatype/test-suite.log
=

# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: external32


/data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32:
symbol lookup error:
/data/swstack/packages/shmem-mellanox/openmpi-gitclone/test/datatype/.libs/lt-external32:
undefined symbol: ompi_datatype_pack_external_size
FAIL external32 (exit status: 127)

I'm probably missing an obvious library or package, but 
libc++-devel.i686 and glibc-devel.i686 didn't cover this for me.


Alex, I'd like to buy a clue, please?

Andy
--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
 May the source be with you!


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x

2017-10-27 Thread Andy Riebs

We have built a version of Open MPI 3.0.x that works with Slurm (our 
primary use case), but it fails when executed without Slurm.


If I srun an MPI "hello world" program, it works just fine. Likewise, if 
I salloc a couple of nodes and use mpirun from there, life is good. But 
if I just try to mpirun the program without Slurm support, the program 
appears to run to completion, and then segv's. A bit of good news is 
that this can be reproduced with a single process.


Sample output and configuration information below:

[tests]$ cat gdb.cmd
set follow-fork-mode child
r
[tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols 
found)...done.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x74be8700 (LWP 21386)]
[New Thread 0x73f70700 (LWP 21387)]
[New Thread 0x7fffeacac700 (LWP 21393)]
[Thread 0x7fffeacac700 (LWP 21393) exited]
[New Thread 0x7fffeacac700 (LWP 21394)]
Hello world! I'm 0 of 1 on node04
[Thread 0x7fffeacac700 (LWP 21394) exited]
[Thread 0x73f70700 (LWP 21387) exited]
[Thread 0x74be8700 (LWP 21386) exited]
[Inferior 1 (process 21382) exited normally]
Missing separate debuginfos, use: debuginfo-install 
glibc-2.17-157.el7.x86_64 libevent-2.0.21-4.el7.x86_64 
libgcc-4.8.5-11.el7.x86_6
4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 
libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 
libibverbs-1.2.1mlnx1-OFED
.3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 
libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3.
el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 
libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open

sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) q
[node04:21373] *** Process received signal ***
[node04:21373] Signal: Segmentation fault (11)
[node04:21373] Signal code:  (128)
[node04:21373] Failing at address: (nil)
[node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x760c4370]
[node04:21373] [ 1] 
/opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7365104b]
[node04:21373] [ 2] 
/lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x764e4a14]
[node04:21373] [ 3] 
/opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7363f5cd]

[node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x760bcdc5]
[node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x75deb73d]
[node04:21373] *** End of error message ***
bash: line 1: 21373 Segmentation fault 
/opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess "env" -mca ess_base_jobid
"399966208" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca 
orte_node_regex "node[2:73],node[4:0]04@0(2)" -mca orte_hnp_uri "3
99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca 
coll_tuned_use_dynamic_rules "1" -mca scoll "^mpi" -mca pml "ucx"
 -mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem 
"mmap" -mca spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath
er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix 
"^s1,s2,cray,isolated"


[tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort
OMPI_MCA_atomic=ucx
OMPI_MCA_coll=^hcoll
OMPI_MCA_coll_tuned_allgather_algorithm=2
OMPI_MCA_coll_tuned_allgatherv_algorithm=2
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_pml=ucx
OMPI_MCA_scoll=^mpi
OMPI_MCA_spml=ucx
OMPI_MCA_spml_ucx_heap_reg_nb=1
OMPI_MCA_sshmem=mmap
OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d
OPENMPI_VER=3.0.x.4ca1c4d
SLURM_DISTRIBUTION=block:block
SLURM_HINT=nomultithread
SLURM_SRUN_REDUCE_TASK_EXIT=1
SLURM_TEST_EXEC=1
SLURM_UNBUFFEREDIO=1
SLURM_VER=17.11.0-0pre2
UCX_TLS=dc_x
UCX_ZCOPY_THRESH=131072
[tests]$

OS: CentOS 7.3
HW: x86_64 (KNL)
OMPI version: 3.0.x.4ca1c4d
Configuration options:
    --prefix=/opt/local/shmem/3.0.x.4ca1c4d
--with-hcoll=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/hcoll 


    --with-hwloc=/opt/local/hwloc/1.11.4
--with-knem=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/knem
    --with-libevent=/usr
--with-mxm=/opt/mellanox/hpcx-v2.0.0-gcc-MLNX_OFED_LINUX-3.4-2.1.8.0-redhat7.3-x86_64/mxm
    --with-platform=cont

Re: [OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x

2017-10-31 Thread Andy Riebs


As always, thanks for your help Ralph!

Cutting over to PMIx 1.2.4 solved the problem for me. (Slurm wasn't 
happy building with PMIx v2.)


And yes, I had ssh access to node04.

(And Gilles, thanks for your note, as well.)

Andy

On 10/27/2017 04:31 PM, r...@open-mpi.org wrote:

Two questions:

1. are you running this on node04? Or do you have ssh access to node04?

2. I note you are building this against an old version of PMIx for some reason. 
Does it work okay if you build it with the embedded PMIx (which is 2.0)? Does 
it work okay if you use PMIx v1.2.4, the latest release in that series?



On Oct 27, 2017, at 1:24 PM, Andy Riebs  wrote:

We have built a version of Open MPI 3.0.x that works with Slurm (our primary 
use case), but it fails when executed without Slurm.

If I srun an MPI "hello world" program, it works just fine. Likewise, if I 
salloc a couple of nodes and use mpirun from there, life is good. But if I just try to 
mpirun the program without Slurm support, the program appears to run to completion, and 
then segv's. A bit of good news is that this can be reproduced with a single process.

Sample output and configuration information below:

[tests]$ cat gdb.cmd
set follow-fork-mode child
r
[tests]$ mpirun -host node04 -np 1 gdb -x gdb.cmd ./mpi_hello
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/riebs/tests/mpi_hello...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x74be8700 (LWP 21386)]
[New Thread 0x73f70700 (LWP 21387)]
[New Thread 0x7fffeacac700 (LWP 21393)]
[Thread 0x7fffeacac700 (LWP 21393) exited]
[New Thread 0x7fffeacac700 (LWP 21394)]
Hello world! I'm 0 of 1 on node04
[Thread 0x7fffeacac700 (LWP 21394) exited]
[Thread 0x73f70700 (LWP 21387) exited]
[Thread 0x74be8700 (LWP 21386) exited]
[Inferior 1 (process 21382) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 
libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-11.el7.x86_6
4 libibcm-1.0.5mlnx2-OFED.3.4.0.0.4.34100.x86_64 
libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.x86_64 
libibverbs-1.2.1mlnx1-OFED
.3.4.2.1.4.34218.x86_64 libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34218.x86_64 
libmlx5-1.2.1mlnx1-OFED.3.4.2.1.4.34218.x86_64 libnl-1.1.4-3.
el7.x86_64 librdmacm-1.1.0mlnx-OFED.3.4.0.0.4.34218.x86_64 
libtool-ltdl-2.4.2-21.el7_2.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 open
sm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34218.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) q
[node04:21373] *** Process received signal ***
[node04:21373] Signal: Segmentation fault (11)
[node04:21373] Signal code:  (128)
[node04:21373] Failing at address: (nil)
[node04:21373] [ 0] /lib64/libpthread.so.0(+0xf370)[0x760c4370]
[node04:21373] [ 1] 
/opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x3a04b)[0x7365104b]
[node04:21373] [ 2] 
/lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x764e4a14]
[node04:21373] [ 3] 
/opt/local/pmix/1.2.1/lib/libpmix.so.2(+0x285cd)[0x7363f5cd]
[node04:21373] [ 4] /lib64/libpthread.so.0(+0x7dc5)[0x760bcdc5]
[node04:21373] [ 5] /lib64/libc.so.6(clone+0x6d)[0x75deb73d]
[node04:21373] *** End of error message ***
bash: line 1: 21373 Segmentation fault /opt/local/shmem/3.0.x.4ca1c4d/bin/orted -mca ess 
"env" -mca ess_base_jobid
"399966208" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex 
"node[2:73],node[4:0]04@0(2)" -mca orte_hnp_uri "3
99966208.0;tcp://16.95.253.128,10.4.0.6:52307" -mca plm "rsh" -mca coll_tuned_use_dynamic_rules "1" 
-mca scoll "^mpi" -mca pml "ucx"
  -mca coll_tuned_allgatherv_algorithm "2" -mca atomic "ucx" -mca sshmem "mmap" -mca 
spml_ucx_heap_reg_nb "1" -mca coll_tuned_allgath
er_algorithm "2" -mca spml "ucx" -mca coll "^hcoll" -mca pmix 
"^s1,s2,cray,isolated"

[tests]$ env | grep -E -e MPI -e UCX -e SLURM | sort
OMPI_MCA_atomic=ucx
OMPI_MCA_coll=^hcoll
OMPI_MCA_coll_tuned_allgather_algorithm=2
OMPI_MCA_coll_tuned_allgatherv_algorithm=2
OMPI_MCA_coll_tuned_use_dynamic_rules=1
OMPI_MCA_pml=ucx
OMPI_MCA_scoll=^mpi
OMPI_MCA_spml=ucx
OMPI_MCA_spml_ucx_heap_reg_nb=1
OMPI_MCA_sshmem=mmap
OPENMPI_PATH=/opt/local/shmem/3.0.x.4ca1c4d
OPENMPI_VER=3.0.x.4ca1c4d
SLURM_DISTRIBUTION=block:block
SLURM_HINT=nomultithread
SLUR

Re: [OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-09 Thread Andy Riebs


Noam,

Start with the FAQ, etc., under "Getting Help/Support" in the 
left-column menu at https://www.open-mpi.org/


Andy


*From:* Noam Bernstein 
*Sent:* Tuesday, October 09, 2018 2:26PM
*To:* Open Mpi Users 
*Cc:*
*Subject:* [OMPI users] no openmpi over IB on new CentOS 7 system

Hi - I’m trying to get OpenMPI working on a newly configured CentOS 7 
system, and I’m not even sure what information would be useful to 
provide.  I’m using the CentOS built in libibverbs and/or libfabric, and 
I configure openmpi with just

—with-verbs —with-ofi —prefix=$DEST
also tried —without-ofi, no change.  Basically, I can run with “—mca btl 
self,vader”, but if I try “—mca btl,openib” I get an error from each 
process:


   
[compute-0-0][[24658,1],5][connect/btl_openib_connect_udcm.c:1245:udcm_rc_qp_to_rtr]
   error modifing QP to RTR errno says Invalid argument

If I don’t specify the btl it appears to try to set up openib with the 
same errors, then crashes on some free() related segfault, presumably 
when it tries to actually use vader.


The machine seems to be able to see its IB interface, as reported by 
things like ibstatus or ibv_devinfo.  I’m not sure what else to look 
for.  I also confirmed that “ulimit -l” reports unlimited.


Does anyone have any suggestions as to how to diagnose this issue?

thanks,
Noam


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPIRUN SEGMENTATION FAULT

2016-04-23 Thread Andy Riebs


  
  
The challenge for the MPI experts here (of which I am NOT one!) is
that the problem appears to be in your program; MPI is simply
reporting that your program failed. If you got the program from
someone else, you will need to solicit their help. If you wrote it,
well, it is never a bad time to learn to use gdb!

Best regards
Andy

On 04/23/2016 10:41 AM, Elio Physics
  wrote:


  
  
  
I am not really an expert with gdb. What is the core file?
  and how to use gdb? I have got three files as an output when
  the executable is used. One is the actual output which stops
  and the other two are error files (from which I knew about the
  segmentation fault).




thanks




  
  From:
  users  on behalf of
  Ralph Castain 
  Sent: Saturday, April 23, 2016 11:39 AM
  To: Open MPI Users
  Subject: Re: [OMPI users] MPIRUN SEGMENTATION FAULT
 
  
  valgrind isn’t going to help here - there are multiple
reasons why your application could be segfaulting. Take a
look at the core file with gdb and find out where it is
failing.

  

  On Apr 22, 2016, at 10:20 PM, Elio
Physics 
wrote:
  
  

  One more thing i forgot to
mention in my previous e-mail. In the output
file I get the following message:
  
  
  
2 total processes killed (some possibly by
mpirun during cleanup)
  
  
  Thanks

  
  
  
  

From: users


on behalf of Elio Physics 
Sent: Saturday,
April 23, 2016 3:07 AM
To: Open
MPI Users
Subject: Re:
[OMPI users] MPIRUN SEGMENTATION FAULT
   


  
I have used valgrind
  and this is what i got:


valgrind mpirun
  ~/Elie/SPRKKR/bin/kkrscf6.3MPI Fe_SCF.inp
  > scf-51551.jlborges.fisica.ufmg.br.out
  ==8135== Memcheck, a memory error detector
  ==8135== Copyright (C) 2002-2012, and GNU
  GPL'd, by Julian Seward et al.
  ==8135== Using Valgrind-3.8.1 and LibVEX;
  rerun with -h for copyright info
  ==8135== Command: mpirun
  /home/emoujaes/Elie/SPRKKR/bin/kkrscf6.3MPI
  Fe_SCF.inp
  ==8135==
--
  mpirun noticed that process rank 0 with
  PID 8147 on node jlborges.fisica.ufmg.br exited
  on signal 11 (Segmentation fault).
--
  ==8135==
  ==8135== HEAP SUMMARY:
  ==8135== in use at exit: 485,683 bytes
  in 1,899 blocks
  ==8135==   total heap usage: 7,723 allocs,
  5,824 frees, 12,185,660 bytes allocated
  ==8135==
  ==8135== LEAK SUMMARY:
  ==8135==    definitely lost: 34,944 bytes
  in 34 blocks
  ==8135==    indirectly lost: 26,613 bytes
  in 58 blocks
  ==8135==  possibly lost: 0 bytes in 0
  blocks
  ==8135==    still reachable: 424,126 bytes
  in 1,807 blocks
  ==8135== suppressed: 0 bytes in 0
  blocks

[OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

2016-05-05 Thread Andy Riebs

I've built 1.10.2 with all my favorite configuration options, but I get 
messages such as this (one for each rank with 
orte_base_help_aggregate=0) when I try to run on a MOFED system:


$ shmemrun -H hades02,hades03 $PWD/shmem.out
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   hades03
  Local device: mlx4_0
  Local port:   2
  CPCs attempted:   rdmacm, udcm
--

My configure options:
config_opts="--prefix=${INSTALL_DIR} \
--without-mpi-param-check \
--with-knem=/opt/mellanox/hpcx/knem \
--with-mxm=/opt/mellanox/mxm  \
--with-mxm-libdir=/opt/mellanox/mxm/lib \
--with-fca=/opt/mellanox/fca \
--with-pmi=${INSTALL_ROOT}/slurm \
--without-psm --disable-dlopen \
--disable-vt \
--enable-orterun-prefix-by-default \
--enable-debug-symbols"


There aren't any obvious error messages in the build log -- what am I 
missing?


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE

Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

2016-05-05 Thread Andy Riebs


  
  
For anyone like me who happens to google this in the future, the
solution was to set OMPI_MCA_pml=yalla

Many thanks Josh!

On 05/05/2016 12:52 PM, Joshua Ladd
  wrote:


  
  
We are working with Andy offline.
  

Josh
  
  
On Thu, May 5, 2016 at 7:32 AM, Andy
  Riebs <andy.ri...@hpe.com>
  wrote:
  I've built
1.10.2 with all my favorite configuration options, but I get
messages such as this (one for each rank with
orte_base_help_aggregate=0) when I try to run on a MOFED
system:

$ shmemrun -H hades02,hades03 $PWD/shmem.out
--
No OpenFabrics connection schemes reported that they were
able to be
used on a specific port.  As such, the openib BTL
(OpenFabrics
support) will be disabled for this port.

  Local host:           hades03
  Local device:         mlx4_0
  Local port:           2
  CPCs attempted:       rdmacm, udcm
--

My configure options:
config_opts="--prefix=${INSTALL_DIR} \
        --without-mpi-param-check \
        --with-knem=/opt/mellanox/hpcx/knem \
        --with-mxm=/opt/mellanox/mxm  \
        --with-mxm-libdir=/opt/mellanox/mxm/lib \
        --with-fca=/opt/mellanox/fca \
        --with-pmi=${INSTALL_ROOT}/slurm \
        --without-psm --disable-dlopen \
        --disable-vt \
        --enable-orterun-prefix-by-default \
        --enable-debug-symbols"


There aren't any obvious error messages in the build log --
what am I missing?

Andy

    -- 
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE

___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29094.php
  


  
  
  
  
  ___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29100.php

Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

2016-05-05 Thread Andy Riebs


  
  
Sorry, my output listing was incomplete -- the program did run after
the "No OpenFabrics" message, but (I presume) ran over Ethernet
rather than InfiniBand. So I can't really say what was causing it to
fail.

Andy

On 05/05/2016 06:09 PM, Nathan Hjelm
  wrote:


  
It should work fine with ob1 (the default). Did you determine what was
causing it to fail?

-Nathan

On Thu, May 05, 2016 at 06:04:55PM -0400, Andy Riebs wrote:

  
   For anyone like me who happens to google this in the future, the solution
   was to set OMPI_MCA_pml=yalla

   Many thanks Josh!

   On 05/05/2016 12:52 PM, Joshua Ladd wrote:

 We are working with Andy offline.

 Josh
 On Thu, May 5, 2016 at 7:32 AM, Andy Riebs  wrote:

   I've built 1.10.2 with all my favorite configuration options, but I
   get messages such as this (one for each rank with
   orte_base_help_aggregate=0) when I try to run on a MOFED system:

   $ shmemrun -H hades02,hades03 $PWD/shmem.out
   --
   No OpenFabrics connection schemes reported that they were able to be
   used on a specific port.  As such, the openib BTL (OpenFabrics
   support) will be disabled for this port.

 Local host:   hades03
 Local device: mlx4_0
 Local port:   2
 CPCs attempted:   rdmacm, udcm
   --

   My configure options:
   config_opts="--prefix=${INSTALL_DIR} \
   --without-mpi-param-check \
   --with-knem=/opt/mellanox/hpcx/knem \
   --with-mxm=/opt/mellanox/mxm  \
   --with-mxm-libdir=/opt/mellanox/mxm/lib \
   --with-fca=/opt/mellanox/fca \
   --with-pmi=${INSTALL_ROOT}/slurm \
   --without-psm --disable-dlopen \
   --disable-vt \
   --enable-orterun-prefix-by-default \
   --enable-debug-symbols"

   There aren't any obvious error messages in the build log -- what am I
   missing?

   Andy

   --
   Andy Riebs
   andy.ri...@hpe.com
   Hewlett-Packard Enterprise
   High Performance Computing Software Engineering
   +1 404 648 9024
   My opinions are not necessarily those of HPE

   ___
   users mailing list
   us...@open-mpi.org
   Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
   Link to this post:
   http://www.open-mpi.org/community/lists/users/2016/05/29094.php

 ___
 users mailing list
 us...@open-mpi.org
 Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29100.php

  
  

  
___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29101.php

  
  

  
  
  
  ___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29102.php

Re: [OMPI users] Open MPI does not work when MPICH or intel MPI are installed

2016-05-23 Thread Andy Riebs


  
  
Hi,

The short answer: Environment module files are probably the best
solution for your problem.

The long answer: See
,
which pretty much addresses your question.

Andy

On 05/23/2016 07:40 AM, Megdich Islem
  wrote:


  
  
Hi,


I am
  using 2 software, one is called Open Foam and the other called
  EMPIRE that need to run together at the same time.
Open
  Foam uses  Open MPI implementation and EMPIRE uses either
  MPICH or intel mpi.
The
  version of Open MPI that comes with Open Foam is 1.6.5.
I am
  using Intel (R) MPI Library for linux * OS, version 5.1.3 and
  MPICH 3.0.4.


My
  problem is when I have the environment variables of  either
  mpich or Intel MPI  sourced to bashrc, I fail to run a case of
  Open Foam with parallel processing ( You find attached a
  picture of the error I got ) 
This
  is an example of a command line I use to run Open Foam
mpirun
  -np 4 interFoam -parallel


Once I
  keep the environment variable of OpenFoam only, the parallel
  processing works without any problem, so I won't be able to
  run EMPIRE.


I am
  sourcing the environment variables in this way:


For
  Open Foam:
source
  /opt/openfoam30/etc/bashrc


For
  MPICH 3.0.4


export
  PATH=/home/islem/Desktop/mpich/bin:$PATH
export
LD_LIBRARY_PATH="/home/islem/Desktop/mpich/lib/:$LD_LIBRARY_PATH"
export
  MPICH_F90=gfortran
export
  MPICH_CC=/opt/intel/bin/icc
export
  MPICH_CXX=/opt/intel/bin/icpc
export
  MPICH-LINK_CXX="-L/home/islem/Desktop/mpich/lib/ -Wl,-rpath
  -Wl,/home/islem/Desktop/mpich/lib -lmpichcxx -lmpich -lopa
  -lmpl -lrt -lpthread"


For
  intel


export
  PATH=$PATH:/opt/intel/bin/
LD_LIBRARY_PATH="/opt/intel/lib/intel64:$LD_LIBRARY_PATH"
export
  LD_LIBRARY_PATH
source
  /opt/intel/compilers_and_libraries_2016.3.210/linux/mpi/intel64/bin/mpivars.sh
  intel64


If
  Only Open Foam is sourced, mpirun --version gives OPEN MPI
  (1.6.5)
If
  Open Foam and MPICH are sourced, mpirun --version gives mpich
  3.0.1
If
  Open Foam and intel MPI are sourced, mpirun --version gives
  intel (R) MPI libarary for linux, version 5.1.3 


My
  question is why I can't have two MPI implementation installed
  and sourced together. How can I solve the problem ?


Regards,
Islem
  Megdiche








  
  
  
  
  ___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/05/29279.php

[OMPI users] Experience with SHMEM with OpenMP

2019-02-27 Thread Andy Riebs

The web suggests that OpenMP should work just fine with OpenMPI/MPI -- 
does this also work with OpenMPI/SHMEM?


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Building PMIx and Slurm support

2019-03-03 Thread Andy Riebs


Daniel,

I think you need to have "--with-pmix=" point to a specific directory; 
either "/usr" if you installed it in /usr/lib and /usr/include, or the 
specific directory, like "--with-pmix=/usr/local/pmix-3.0.2"


Andy


*From:* Daniel Letai 
*Sent:* Sunday, March 03, 2019 8:54AM
*To:* Users 
*Cc:*
*Subject:* Re: [OMPI users] Building PMIx and Slurm support

Hello,


I have built the following stack :

1. centos 7.5 (gcc 4.8.5-28, libevent 2.0.21-4)
2. MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tgz built with --all
   --without-32bit (this includes ucx 1.5.0)
3. hwloc from centos 7.5 : 1.11.8-4.el7
4. pmix 3.1.2
5. slurm 18.08.5-2 built --with-ucx --with-pmix
6. openmpi 4.0.0 : configure --with-slurm --with-pmix=external
   --with-pmi --with-libevent=external --with-hwloc=external
   --with-knem=/opt/knem-1.1.3.90mlnx1 --with-hcoll=/opt/mellanox/hcoll

The configure part succeeds, however 'make' errors out with:

*ext3x.c: *In function '*ext3x_value_unload*':

*ext3x.c:1109:10: error: 'PMIX_MODEX' *undeclared (first use in this 
function)



And same for *'PMIX_INFO_ARRAY'*


However, both are declared in the 
opal/mca/pmix/pmix3x/pmix/include/pmix_common.h file.


opal/mca/pmix/ext3x/ext3x.c does include pmix_common.h but as a system 
include #include  , while ext3x.h includes it as a local 
include #include "pmix_common". Neither seem to pull from the correct path.



Regards,

Dani_L.


On 2/24/19 3:09 AM, Gilles Gouaillardet wrote:

Passant,

you have to manually download and apply
https://github.com/pmix/pmix/commit/2e2f4445b45eac5a3fcbd409c81efe318876e659.patch
to PMIx 2.2.1
that should likely fix your problem.

As a side note,  it is a bad practice to configure --with-FOO=/usr
since it might have some unexpected side effects.
Instead, you can replace

configure --with-slurm --with-pmix=/usr --with-pmi=/usr --with-libevent=/usr

with

configure --with-slurm --with-pmix=external --with-pmi --with-libevent=external

to be on the safe side I also invite you to pass --with-hwloc=external
to the configure command line


Cheers,

Gilles

On Sun, Feb 24, 2019 at 1:54 AM Passant A. Hafez
  wrote:

Hello Gilles,

Here are some details:

Slurm 18.08.4

PMIx 2.2.1 (as shown in /usr/include/pmix_version.h)

Libevent 2.0.21

srun --mpi=list
srun: MPI types are...
srun: none
srun: openmpi
srun: pmi2
srun: pmix
srun: pmix_v2

Open MPI versions tested: 4.0.0 and 3.1.2


For each installation to be mentioned a different MPI Hello World program was 
compiled.
Jobs were submitted by sbatch, 2 node * 2 tasks per node then srun --mpi=pmix 
program

File 400ext_2x2.out (attached) is for OMPI 4.0.0 installation with configure 
options:
--with-slurm --with-pmix=/usr --with-pmi=/usr --with-libevent=/usr
and configure log:
Libevent support: external
PMIx support: External (2x)

File 400int_2x2.out (attached) is for OMPI 4.0.0 installation with configure 
options:
--with-slurm --with-pmix
and configure log:
Libevent support: internal (external libevent version is less that internal 
version 2.0.22)
PMIx support: Internal

Tested also different installations for 3.1.2 and got errors similar to 
400ext_2x2.out
(NOT-SUPPORTED in file event/pmix_event_registration.c at line 101)





All the best,
--
Passant A. Hafez | HPC Applications Specialist
KAUST Supercomputing Core Laboratory (KSL)
King Abdullah University of Science and Technology
Building 1, Al-Khawarizmi, Room 0123
Mobile : +966 (0) 55-247-9568
Mobile : +20 (0) 106-146-9644
Office  : +966 (0) 12-808-0367


From: users  on behalf of Gilles 
Gouaillardet
Sent: Saturday, February 23, 2019 5:17 PM
To: Open MPI Users
Subject: Re: [OMPI users] Building PMIx and Slurm support

Hi,

PMIx has cross-version compatibility, so as long as the PMIx library
used by SLURM is compatible with the one (internal or external) used
by Open MPI, you should be fine.
If you want to minimize the risk of cross-version incompatibility,
then I encourage you to use the same (and hence external) PMIx that
was used to build SLURM with Open MPI.

Can you tell a bit more than "it didn't work" ?
(Open MPI version, PMIx version used by SLURM, PMIx version used by
Open MPI, error message, ...)

Cheers,

Gilles

On Sat, Feb 23, 2019 at 9:46 PM Passant A. Hafez
  wrote:

Good day everyone,

I've trying to build and use the PMIx support for Open MPI but I tried many 
things that I can list if needed, but with no luck.
I was able to test the PMIx client but when I used OMPI specifying srun 
--mpi=pmix it didn't work.

So if you please advise me with the versions of each PMIx and Open MPI that 
should be working well with Slurm 18.08, it'd be great.

Also, what is the difference between using internal vs external PMIx 
installations?



All the best,

--

Passant A. Hafez | HPC Applications Specialist
KAUST Supercomputing Core Laboratory (KSL)
King Abdullah University of Science and Technology
Build

[OMPI users] Using PLFS with Open MPI 1.8

[OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] One-sided communication, a missing/non-existing API call

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

Re: [OMPI users] Problems using Open MPI 1.8.4 OSHMEM on Intel Xeon Phi/MIC

[OMPI users] Using Open MPI with PBS Pro

Re: [OMPI users] Using Open MPI with PBS Pro

[OMPI users] Slurm binding not propagated to MPI jobs

Re: [OMPI users] Slurm binding not propagated to MPI jobs

Re: [OMPI users] Slurm binding not propagated to MPI jobs

Re: [OMPI users] Slurm binding not propagated to MPI jobs

Re: [OMPI users] Slurm binding not propagated to MPI jobs

Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

[OMPI users] Build problem

Re: [OMPI users] Build problem

[OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x

Re: [OMPI users] Problem with MPI jobs terminating when using OMPI 3.0.x

Re: [OMPI users] no openmpi over IB on new CentOS 7 system

Re: [OMPI users] MPIRUN SEGMENTATION FAULT

[OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

Re: [OMPI users] Problems using 1.10.2 with MOFED 3.1-1.1.0.1

Re: [OMPI users] Open MPI does not work when MPICH or intel MPI are installed

[OMPI users] Experience with SHMEM with OpenMP

Re: [OMPI users] Building PMIx and Slurm support

34 matches

Site Navigation

Mail list logo

Footer information