Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread Åke Sandgren

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile 
flags is broken.


And if hcoll include files are under $HCOLL_HOME/include/hcoll (and 
hcoll/api) then the include directives in the source should be

#include 
and
#include 
respectively.

I.e. one should never add -I$HCOLL_HOME/include/hcoll to CPPFLAGS, only 
-I$HCOLL_HOME/include.


Doing otherwise is bad design and a big cause for problems with include 
files from different packages having the same name...


My opinion at least...

On 08/11/2015 01:57 AM, Gilles Gouaillardet wrote:

David,

the configure help is misleading about hcoll ...

  --with-hcoll(=DIR)  Build hcoll (Mellanox Hierarchical Collectives)
   support, searching for libraries in DIR

the =DIR is not really optional ...
configure will not complain if you configure with --with-hcoll or
--with-hcoll=yes
but hcoll will be disabled in this case

FWIW, here is a snippet of my config.status generate with
--with-hcoll=$HCOLL_HOME
/* i manually 'unexpaned' $HCOLL_HOME */
S["coll_hcoll_LIBS"]="-lhcoll "
S["coll_hcoll_LDFLAGS"]=" -L$HCOLL_HOME/lib"
S["coll_hcoll_CPPFLAGS"]=" -I$HCOLL_HOME/include"
S["coll_hcoll_CFLAGS"]=""
S["coll_hcoll_HOME"]="$HCOLL_HOME"
S["coll_hcoll_extra_CPPFLAGS"]="-I$HCOLL_HOME/include/hcoll
-I$HCOLL_HOME/include/hcoll/api"

bottom line, if you configure with --with-hcoll=/usr it will add some
useless flags such as -L/usr/lib (or -L/usr/lib64, i am not sure about
that) and -I/usr/include
but it will also add the required -I/usr/include/hcoll
-I/usr/include/hcoll/api flags

if you believe this is an issue, i can revamp the hcoll detection (e.g.
configure --with-hcoll) but you might
need to manually set CPPFLAGS='-I/usr/include/hcoll
-I/usr/include/hcoll/api'
if not, i guess i will simply update the configure help message ...

Cheers,

Gilles

On 8/11/2015 7:39 AM, David Shrader wrote:

Hello All,

I'm having some trouble getting Open MPI 1.8.8 to configure correctly
when hcoll is installed in system space. That is, hcoll is installed
to /usr/lib64 and /usr/include/hcoll. I get an error during configure:

$> Konsole output ./configure --with-hcoll
...output snipped...
Konsole output configure:219976: checking for MCA component coll:hcoll
compile mode
configure:219982: result: static
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. Aborting

I have also tried using "--with-hcoll=yes" and gotten the same
behavior. Has anyone else gotten the hcoll component to build when
hcoll itself is in system space? I am using hcoll-3.2.748.

I did take a look at configure, and it looks like there is a test on
"with_hcoll" to see if it is not empty and not yes on line 220072. In
my case, this test fails, so the else clause gets invoked. The else
clause is several hundred lines below on line 220822 and simply sets
Konsole output ompi_check_hcoll_happy="no". Configure doesn't try to
do anything to figure out if hcoll is usable, but it does quit soon
after with the above error because ompi_check_hcoll_happy isn't "yes."

In case it helps, here is the output from config.log for that area:

...output snipped...
configure:219976: checking for MCA component coll:hcoll compile mode
configure:219982: result: dso
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. Aborting

##  ##
## Cache variables. ##
##  ##
...output snipped...

Have I missed something in specifying --with-hcoll? I would prefer not
to use "--with-hcoll=/usr" as I am pretty sure that spurious linker
flags to that area will work their way in when they shouldn't.

Thanks,
David
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov


___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27418.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27419.php



--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread Gilles Gouaillardet

i do not know the context, so i should not jump to any conclusion ...
if xxx.h is in $HCOLL_HOME/include/hcoll in hcoll version Y, but in 
$HCOLL_HOME/include/hcoll/api in hcoll version Z, then the relative path 
to $HCOLL_HOME/include cannot be hard coded.


anyway, let's assume it is ok to hard code the relative paths ...
i made PR 796 for that https://github.com/open-mpi/ompi/pull/796

if hcoll/mxm/fca is in /usr, then you can simply run
./configure --with-mxm --with-fca --with-hcoll

could you please give it a try ?
(since this is in git, you will need the right autotools to invoke 
autogen.pl)


Cheers,

Gilles

On 8/11/2015 2:39 PM, Åke Sandgren wrote:

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any 
compile flags is broken.


And if hcoll include files are under $HCOLL_HOME/include/hcoll (and 
hcoll/api) then the include directives in the source should be

#include 
and
#include 
respectively.

I.e. one should never add -I$HCOLL_HOME/include/hcoll to CPPFLAGS, 
only -I$HCOLL_HOME/include.


Doing otherwise is bad design and a big cause for problems with 
include files from different packages having the same name...


My opinion at least...

On 08/11/2015 01:57 AM, Gilles Gouaillardet wrote:

David,

the configure help is misleading about hcoll ...

  --with-hcoll(=DIR)  Build hcoll (Mellanox Hierarchical 
Collectives)

   support, searching for libraries in DIR

the =DIR is not really optional ...
configure will not complain if you configure with --with-hcoll or
--with-hcoll=yes
but hcoll will be disabled in this case

FWIW, here is a snippet of my config.status generate with
--with-hcoll=$HCOLL_HOME
/* i manually 'unexpaned' $HCOLL_HOME */
S["coll_hcoll_LIBS"]="-lhcoll "
S["coll_hcoll_LDFLAGS"]=" -L$HCOLL_HOME/lib"
S["coll_hcoll_CPPFLAGS"]=" -I$HCOLL_HOME/include"
S["coll_hcoll_CFLAGS"]=""
S["coll_hcoll_HOME"]="$HCOLL_HOME"
S["coll_hcoll_extra_CPPFLAGS"]="-I$HCOLL_HOME/include/hcoll
-I$HCOLL_HOME/include/hcoll/api"

bottom line, if you configure with --with-hcoll=/usr it will add some
useless flags such as -L/usr/lib (or -L/usr/lib64, i am not sure about
that) and -I/usr/include
but it will also add the required -I/usr/include/hcoll
-I/usr/include/hcoll/api flags

if you believe this is an issue, i can revamp the hcoll detection (e.g.
configure --with-hcoll) but you might
need to manually set CPPFLAGS='-I/usr/include/hcoll
-I/usr/include/hcoll/api'
if not, i guess i will simply update the configure help message ...

Cheers,

Gilles

On 8/11/2015 7:39 AM, David Shrader wrote:

Hello All,

I'm having some trouble getting Open MPI 1.8.8 to configure correctly
when hcoll is installed in system space. That is, hcoll is installed
to /usr/lib64 and /usr/include/hcoll. I get an error during configure:

$> Konsole output ./configure --with-hcoll
...output snipped...
Konsole output configure:219976: checking for MCA component coll:hcoll
compile mode
configure:219982: result: static
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. 
Aborting


I have also tried using "--with-hcoll=yes" and gotten the same
behavior. Has anyone else gotten the hcoll component to build when
hcoll itself is in system space? I am using hcoll-3.2.748.

I did take a look at configure, and it looks like there is a test on
"with_hcoll" to see if it is not empty and not yes on line 220072. In
my case, this test fails, so the else clause gets invoked. The else
clause is several hundred lines below on line 220822 and simply sets
Konsole output ompi_check_hcoll_happy="no". Configure doesn't try to
do anything to figure out if hcoll is usable, but it does quit soon
after with the above error because ompi_check_hcoll_happy isn't "yes."

In case it helps, here is the output from config.log for that area:

...output snipped...
configure:219976: checking for MCA component coll:hcoll compile mode
configure:219982: result: dso
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. 
Aborting


##  ##
## Cache variables. ##
##  ##
...output snipped...

Have I missed something in specifying --with-hcoll? I would prefer not
to use "--with-hcoll=/usr" as I am pretty sure that spurious linker
flags to that area will work their way in when they shouldn't.

Thanks,
David
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov


___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27418.php




___
users mailing list
us...@open-mpi.org
Subscription: http

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread Åke Sandgren



On 08/11/2015 10:22 AM, Gilles Gouaillardet wrote:

i do not know the context, so i should not jump to any conclusion ...
if xxx.h is in $HCOLL_HOME/include/hcoll in hcoll version Y, but in
$HCOLL_HOME/include/hcoll/api in hcoll version Z, then the relative path
to $HCOLL_HOME/include cannot be hard coded.


It can be done, by using version detection of hcoll and #if, #else 
around the includes. But the risk of files moving in or out of a "api" 
include dir (relative to another include dir in the same package) should 
be fairly small i think, regardless of it being hcoll or some other package.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread Jeff Squyres (jsquyres)
On Aug 11, 2015, at 1:39 AM, Åke Sandgren  wrote:
> 
> Please fix the hcoll test (and code) to be correct.
> 
> Any configure test that adds /usr/lib and/or /usr/include to any compile 
> flags is broken.

+1

Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some 
comments to it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-11 Thread Jeremia Bär

Hi!

In my current application, MPI_Send/MPI_Recv hangs when using buffers in 
GPU device memory of a Nvidia GPU. I realized this is due to the fact 
that OpenMPI uses the synchronous cuMempcy rather than the asynchornous 
cuMemcpyAsync (see stacktrace at the bottom). However, in my 
application, synchronous copies cannot be used.


I scanned through the source and saw support for async memcpy's are 
available. It's controlled by 'mca_common_cuda_cumemcpy_async' in

./ompi/mca/common/cuda/common_cuda.c
However, I can't find a way to enable it. It's not exposed in 
'ompi_info' (but registered?). How can I enforce the use of 
cuMemcpyAsync in OpenMPI? Version used is OpenMPI 1.8.5.


Thank you,
Jeremia

(gdb) bt
#0  0x2aaaba11 in clock_gettime ()
#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from 
/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_cuda.so.1
#11 0x2c992544 in opal_cuda_memcpy () from 
/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
#12 0x2c98adf7 in opal_convertor_pack () from 
/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from 
/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pml_ob1.so
#14 0x2aaab167353f in mca_pml_ob1_send () from 
/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pml_ob1.so
#15 0x2bf4f322 in PMPI_Send () from 
/users/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmpi.so.1




Re: [OMPI users] SGE problems w/OpenMPI 1.8.7

2015-08-11 Thread Dave Love
"Lane, William"  writes:

> I'm running a mixed cluster of Blades (HS21 and HS22 chassis), x3550-M3 and 
> X3550-M4 systems, some of which support hyperthreading, while others
> don't (specifically the HS21 blades) all on CentOS 6.3 w/SGE.

Do you mean jobs are split across nodes which have hyperthreading on,
and ones which don't (and you're trying to use the threads where they're
on)?  That doesn't seem a good idea.  (You could turn off threads
per-job in a root-privileged prolog, pe_starter, or shepherd_cmd; or it
would probably work to set the slot count to the core count and bind to
cores.)

> I have no problems running my simple OpenMPI 1.8.7 test code outside of SGE 
> (with or without the --bind-to core switch, but can only run the jobs within
> SGE via qrsh on a limited number of slots (4 at most) successfully. The 
> errors are very similar to the ones I was getting running OpenMPI 1.8.5 - 
> 1.8.6 outside of SGE
> on this same cluster.
>
> Strangely, when running the test code outside of SGE w/the --bind-to core 
> switch, mpirun still binds processes to hyperthreading cores. Additionally,
> the --bind-to core switch prevents the OpenMPI 1.8.7 test code from running 
> at all within SGE (it outputs warnings about missing NUMA libraries reducing 
> performance
> then exits).

Are you doing SGE core binding
?

> We would rather run out OpenMPI jobs from within SGE so that we can get 
> accounting data on OpenMPI jobs for administrative purposes.
>
> The orte PE I'm been using seems to meet all the requirements for previous 
> versions of OpenMPI:
> the allocation rule is fill-up, rather than round-robin (I'm not sure if this 
> makes a difference or not)

If you're really going to have heterogeneous threading, I'd guess you
best allocate only whole nodes and let openmpi do the binding.

[procenv is recommended for comparing the job's generalized environment
with the environment outside the resource manager
.]


Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 1.8.7

2015-08-11 Thread Dave Love
"Lane, William"  writes:

> I read @
>
> https://www.open-mpi.org/faq/?category=sge
>
> that for OpenMPI Parallel Environments there's
> a special consideration for Son of Grid Engine:
>
>'"qsort_args" is necessary with the Son of Grid Engine distribution,
>version 8.1.1 and later, and probably only applicable to it.  For
>very old versions of SGE, omit "accounting_summary" too.'
>
> Does this requirement still hold true for OpenMPI 1.8.7? Because
> the webpage above only refers to much older versions of OpenMPI.

That's actually unrelated to OMPI, and the current distribution contains
an "mpi" PE for tight integration which should work with OMPI and modern
MPICH-y startup (hydra?), at least.


Re: [OMPI users] What Red Hat Enterprise/CentOS NUMA libraries are recommended/required for OpenMPI?

2015-08-11 Thread Dave Love
Ralph Castain  writes:

> Hi Bill
>
> You need numactl-devel on the nodes. Not having them means we cannot ensure
> memory is bound local to the procs, which will hurt performance but not
> much else. There is an MCA param to turn off the warnings if you choose not
> to install the libs: hwloc_base_mem_bind_failure_action=silent

Why should you need the -devel package on the compute nodes?  (It only
contains the .h and .so files.)  The RHEL and Fedora packages don't
require it and work.

[For an up-to-date OMPI, you can rebuild the package against the current
tarball, at least after the chaos caused by RHEL 6.6 updating
incompatibly to 1.8.  Otherwise use the Fedora packaging, which is kept
quite current.]


Re: [OMPI users] What Red Hat Enterprise/CentOS NUMA libraries are recommended/required for OpenMPI?

2015-08-11 Thread Ralph Castain
Because only the devel package includes the necessary pieces to set memory
affinity.


On Tue, Aug 11, 2015 at 9:37 AM, Dave Love  wrote:

> Ralph Castain  writes:
>
> > Hi Bill
> >
> > You need numactl-devel on the nodes. Not having them means we cannot
> ensure
> > memory is bound local to the procs, which will hurt performance but not
> > much else. There is an MCA param to turn off the warnings if you choose
> not
> > to install the libs: hwloc_base_mem_bind_failure_action=silent
>
> Why should you need the -devel package on the compute nodes?  (It only
> contains the .h and .so files.)  The RHEL and Fedora packages don't
> require it and work.
>
> [For an up-to-date OMPI, you can rebuild the package against the current
> tarball, at least after the chaos caused by RHEL 6.6 updating
> incompatibly to 1.8.  Otherwise use the Fedora packaging, which is kept
> quite current.]
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27427.php
>


Re: [OMPI users] What Red Hat Enterprise/CentOS NUMA libraries are recommended/required for OpenMPI?

2015-08-11 Thread Jeff Squyres (jsquyres)
I think Dave's point is that numactl-devel (and numactl) is only needed for 
*building* Open MPI.  Users only need numactl to *run* Open MPI.

Specifically, numactl-devel contains the .h files we need to compile OMPI 
against libnumactl:

$ rpm -ql numactl-devel
/usr/include/numa.h
/usr/include/numacompat1.h
/usr/include/numaif.h
/usr/lib64/libnuma.a
/usr/lib64/libnuma.so
/usr/share/man/man3/numa.3.gz

Note that the .so is a sym link to .so.1, in the main numactl package:

$ rpm -ql numactl
/usr/bin/memhog
/usr/bin/migratepages
/usr/bin/migspeed
/usr/bin/numactl
/usr/bin/numademo
/usr/bin/numastat
/usr/lib64/libnuma.so.1
/usr/share/man/man8/migratepages.8.gz
/usr/share/man/man8/migspeed.8.gz
/usr/share/man/man8/numactl.8.gz
/usr/share/man/man8/numastat.8.gz


> On Aug 11, 2015, at 12:42 PM, Ralph Castain  wrote:
> 
> Because only the devel package includes the necessary pieces to set memory 
> affinity.
> 
> 
> On Tue, Aug 11, 2015 at 9:37 AM, Dave Love  wrote:
> Ralph Castain  writes:
> 
> > Hi Bill
> >
> > You need numactl-devel on the nodes. Not having them means we cannot ensure
> > memory is bound local to the procs, which will hurt performance but not
> > much else. There is an MCA param to turn off the warnings if you choose not
> > to install the libs: hwloc_base_mem_bind_failure_action=silent
> 
> Why should you need the -devel package on the compute nodes?  (It only
> contains the .h and .so files.)  The RHEL and Fedora packages don't
> require it and work.
> 
> [For an up-to-date OMPI, you can rebuild the package against the current
> tarball, at least after the chaos caused by RHEL 6.6 updating
> incompatibly to 1.8.  Otherwise use the Fedora packaging, which is kept
> quite current.]
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27427.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27428.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] Problem in using openmpi-1.8.7

2015-08-11 Thread Amos Leffler
Dear Users,
I have run into a problem with openmpi-1.8.7.  It configures and 
installs properly but when I tested it using examples it gave me numerous 
errors with mpicc as shown in the output below.  Have I made an error in the 
process?

Amoss-MacBook-Pro:openmpi-1.8.7 amosleff$ cd examples
Amoss-MacBook-Pro:examples amosleff$ mpicc hello_c.c -o hello_c -g
Amoss-MacBook-Pro:examples amosleff$ mpiexec hello_c
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_ess_slurmd: 
dlopen(/usr/local/lib/openmpi/mca_ess_slurmd.so, 9): Symbol not found: 
_orte_jmap_t_class
  Referenced from: /usr/local/lib/openmpi/mca_ess_slurmd.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_ess_slurmd.so (ignored)
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_errmgr_default: 
dlopen(/usr/local/lib/openmpi/mca_errmgr_default.so, 9): Symbol not found: 
_orte_errmgr_base_error_abort
  Referenced from: /usr/local/lib/openmpi/mca_errmgr_default.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_errmgr_default.so (ignored)
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_routed_cm: 
dlopen(/usr/local/lib/openmpi/mca_routed_cm.so, 9): Symbol not found: 
_orte_message_event_t_class
  Referenced from: /usr/local/lib/openmpi/mca_routed_cm.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_routed_cm.so (ignored)
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_routed_linear: 
dlopen(/usr/local/lib/openmpi/mca_routed_linear.so, 9): Symbol not found: 
_orte_message_event_t_class
  Referenced from: /usr/local/lib/openmpi/mca_routed_linear.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_routed_linear.so (ignored)
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_grpcomm_basic: 
dlopen(/usr/local/lib/openmpi/mca_grpcomm_basic.so, 9): Symbol not found: 
_opal_profile
  Referenced from: /usr/local/lib/openmpi/mca_grpcomm_basic.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_grpcomm_basic.so (ignored)
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_grpcomm_hier: 
dlopen(/usr/local/lib/openmpi/mca_grpcomm_hier.so, 9): Symbol not found: 
_orte_daemon_cmd_processor
  Referenced from: /usr/local/lib/openmpi/mca_grpcomm_hier.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_grpcomm_hier.so (ignored)
[Amoss-MacBook-Pro.local:61027] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_filem_rsh: 
dlopen(/usr/local/lib/openmpi/mca_filem_rsh.so, 9): Symbol not found: 
_opal_uses_threads
  Referenced from: /usr/local/lib/openmpi/mca_filem_rsh.so
  Expected in: flat namespace
 in /usr/local/lib/openmpi/mca_filem_rsh.so (ignored)
[Amoss-MacBook-Pro:61027] *** Process received signal ***
[Amoss-MacBook-Pro:61027] Signal: Segmentation fault: 11 (11)
[Amoss-MacBook-Pro:61027] Signal code: Address not mapped (1)
[Amoss-MacBook-Pro:61027] Failing at address: 0x10013
[Amoss-MacBook-Pro:61027] [ 0] 0   libsystem_platform.dylib
0x7fff92aebf1a _sigtramp + 26
[Amoss-MacBook-Pro:61027] [ 1] 0   ??? 
0x7fff508ce0af 0x0 + 140734544797871
[Amoss-MacBook-Pro:61027] [ 2] 0   libopen-rte.7.dylib 
0x00010f386e45 orte_rmaps_base_map_job + 2789
[Amoss-MacBook-Pro:61027] [ 3] 0   libopen-pal.6.dylib 
0x00010f3ffaed opal_libevent2021_event_base_loop + 2333
[Amoss-MacBook-Pro:61027] [ 4] 0   mpiexec 
0x00010f333288 orterun + 6440
[Amoss-MacBook-Pro:61027] [ 5] 0   mpiexec 
0x00010f331942 main + 34
[Amoss-MacBook-Pro:61027] [ 6] 0   libdyld.dylib   
0x7fff94d455c9 start + 1
[Amoss-MacBook-Pro:61027] [ 7] 0   ??? 
0x0002 0x0 + 2
[Amoss-MacBook-Pro:61027] *** End of error message ***
Segmentation fault: 11

Your help would be much appreciated.
Amos Leffler

Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-11 Thread Rolf vandeVaart
I talked with Jeremia off list and we figured out what was going on.  There is 
the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the 
cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get 
that behavior you need the following:

--mca mpi_common_cuda_cumemcpy_async 1

It is too late to change this in 1.8 but it will be made the default behavior 
in 1.10 and all future versions.  In addition, he is right about not being able 
to see these variables in the Open MPI 1.8 series.  This was a bug and it has 
been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that 
back into 1.10.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär
>Sent: Tuesday, August 11, 2015 9:17 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
>
>Hi!
>
>In my current application, MPI_Send/MPI_Recv hangs when using buffers in
>GPU device memory of a Nvidia GPU. I realized this is due to the fact that
>OpenMPI uses the synchronous cuMempcy rather than the asynchornous
>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,
>synchronous copies cannot be used.
>
>I scanned through the source and saw support for async memcpy's are
>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in
>./ompi/mca/common/cuda/common_cuda.c
>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but
>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?
>Version used is OpenMPI 1.8.5.
>
>Thank you,
>Jeremia
>
>(gdb) bt
>#0  0x2aaaba11 in clock_gettime ()
>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c
>uda.so.1
>#11 0x2c992544 in opal_cuda_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#12 0x2c98adf7 in opal_convertor_pack () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#14 0x2aaab167353f in mca_pml_ob1_send () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#15 0x2bf4f322 in PMPI_Send () from
>/users/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmpi.so.1
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/08/27424.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread David Shrader
I have cloned Gilles' topic/hcoll_config branch and, after running 
autogen.pl, have found that './configure --with-hcoll' does indeed work 
now. I used Gilles' branch as I wasn't sure how best to get the pull 
request changes in to my own clone of master. It looks like the proper 
checks are happening, too:


Konsole output
--- MCA component coll:hcoll(m4 configuration macro)
checking for MCA component coll:hcollcompile mode... dso
checking --with-hcollvalue... simple ok (unspecified)
checking hcoll/api/hcoll_api.h usability... yes
checking hcoll/api/hcoll_api.h presence... yes
checking for hcoll/api/hcoll_api.h... yes
looking for library without search path
checking for library containing hcoll_get_version... -lhcoll
checking if MCA component coll:hcollcan compile... yes

I haven't checked whether or not Open MPI builds successfully as I don't 
have much experience running off of the latest source. For now, I think 
I will try to generate a patch to the 1.8.8 configure script and see if 
that works as expected.


Thanks,
David

On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:

On Aug 11, 2015, at 1:39 AM, Åke Sandgren  wrote:

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile flags 
is broken.

+1

Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some 
comments to it.



--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



Re: [OMPI users] Son of Grid Engine, Parallel Environments and OpenMPI 1.8.7

2015-08-11 Thread Lane, William
I can successfully run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine but 
not via qrsh. We're
using CentOS 6.3 and a heterogeneous cluster of hyperthreaded and 
non-hyperthreaded blades
and x3550 chassis. OpenMPI 1.8.7 has been built w/the debug switch as well.

Here's my latest errors:
qrsh -V -now yes -pe mpi 209 mpirun -np 209 -display-devel-map --prefix 
/hpc/apps/mpi/openmpi/1.8.7/ --mca btl ^sm --hetero-nodes --bind-to core 
/hpc/home/lanew/mpi/openmpi/ProcessColors3
error: executing task of job 211298 failed: execution daemon on host 
"csclprd3-0-4" didn't accept task
error: executing task of job 211298 failed: execution daemon on host 
"csclprd3-4-1" didn't accept task
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--

NOTE: the hosts that "didn't accept task" were different in two different runs 
but the errors were the same.

Here's the definition of the mpi Parallel Environment on our Son-of-Gridengine 
cluster:

pe_namempi
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/opt/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule$fill_up
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary TRUE
qsort_args NONE

Qsort_args is set to NONE, but it's supposed to be set to TRUE right?

-Bill L.

If I can run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine w/no issues it 
has to be Son-of-Gridengine that's
the issue right?

-Bill L.

From: users [users-boun...@open-mpi.org] on behalf of Dave Love 
[d.l...@liverpool.ac.uk]
Sent: Tuesday, August 11, 2015 9:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] Son of Grid Engine,   Parallel Environments and 
OpenMPI 1.8.7

"Lane, William"  writes:

> I read @
>
> https://www.open-mpi.org/faq/?category=sge
>
> that for OpenMPI Parallel Environments there's
> a special consideration for Son of Grid Engine:
>
>'"qsort_args" is necessary with the Son of Grid Engine distribution,
>version 8.1.1 and later, and probably only applicable to it.  For
>very old versions of SGE, omit "accounting_summary" too.'
>
> Does this requirement still hold true for OpenMPI 1.8.7? Because
> the webpage above only refers to much older versions of OpenMPI.

That's actually unrelated to OMPI, and the current distribution contains
an "mpi" PE for tight integration which should work with OMPI and modern
MPICH-y startup (hydra?), at least.
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27426.php
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.