from:"Rolf Vandevaart"

Re: [OMPI users] static OpenMPI with GNU

2015-11-13 Thread Rolf vandeVaart

A workaround is to add –disable-vt to your configure line if you do not care 
about having vampirtrace support.
Not a solution, but might help you make progress.

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ilias Miroslav
Sent: Friday, November 13, 2015 11:30 AM
To: us...@open-mpi.org
Subject: [OMPI users] static OpenMPI with GNU


Greeting,



I am trying to compile the static version of OpenMPI, with GNU.



The configuration command :


mil...@login.grid.umb.sk:~/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/../configure
 --prefix=/home/milias/bin/openmpi-1.10.1-gnu-static  CXX=g++ CC=gcc 
F77=gfortran FC=gfortran  LDFLAGS="--static" LIBS="-ldl -lrt" --disable-shared 
--enable-static

But the compilation end with error below.

I though that the -lrt should fix it (/usr/lib64/librt.a), but no way. Any help 
please ?

Miro

make[10]: Entering directory 
`/home/milias/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi'
  CC   otfmerge_mpi-handler.o
  CC   otfmerge_mpi-otfmerge.o
  CCLD otfmerge-mpi
/home/milias/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/opal/.libs/libopen-pal.a(memory_linux_munmap.o):
 In function `opal_memory_linux_free_ptmalloc2_munmap':
memory_linux_munmap.c:(.text+0x3d): undefined reference to `__munmap'
/home/milias/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/opal/.libs/libopen-pal.a(memory_linux_munmap.o):
 In function `munmap':
memory_linux_munmap.c:(.text+0x87): undefined reference to `__munmap'
collect2: ld returned 1 exit status
make[10]: *** [otfmerge-mpi] Error 1







---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread Rolf vandeVaart

Ed, how large are the messages that you are sending and receiving?
Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ed Blosch
Sent: Thursday, June 27, 2013 9:01 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Application hangs on mpi_waitall

It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.


Sent via the Samsung Galaxy S™ III, an AT&T 4G LTE smartphone



 Original message 
From: George Bosilca mailto:bosi...@icl.utk.edu>>
Date:
To: Open MPI Users mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Application hangs on mpi_waitall


Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
>
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
>
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
>
> Thanks again,
>
> Ed
>
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>>
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>>
>> Thanks,
>>
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Rolf vandeVaart

With respect to the CUDA-aware support, Ralph is correct.  The ability to send 
and receive GPU buffers is in the Open MPI 1.7 series.  And incremental 
improvements will be added to the Open MPI 1.7 series.  CUDA 5.0 is supported.

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Saturday, July 06, 2013 5:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 
1.7.2

There was discussion of this on a prior email thread on the OMPI devel mailing 
list:

http://www.open-mpi.org/community/lists/devel/2013/05/12354.php

On Jul 6, 2013, at 2:01 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>> wrote:

thanks,
Do you guys have any plan to support Intel Phi in the future? That is, running 
MPI code on the Phi cards or across the multicore and Phi, as Intel MPI does?
thanks...
Michael

On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Rolf will have to answer the question on level of support. The CUDA code is not 
in the 1.6 series as it was developed after that series went "stable". It is in 
the 1.7 series, although the level of support will likely be incrementally 
increasing as that "feature" series continues to evolve.

On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
mailto:drmichaelt7...@gmail.com>> wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on 
> OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it 
> seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without 
> copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do 
> you support SDK 5.0 and above?
>
> Cheers ...
> Michael
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35

2013-08-14 Thread Rolf vandeVaart

It is looking for the libcuda.so file, not the libcudart.so file.   So, maybe 
--with-libdir=/usr/lib64
You need to be on a machine with the CUDA driver installed.  What was your 
configure command?

http://www.open-mpi.org/faq/?category=building#build-cuda

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ray
>Sheppard
>Sent: Wednesday, August 14, 2013 2:49 PM
>To: Open MPI Users
>Subject: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35
>
>Hello,
>   When I try to run my configure script, it dies with the following.
>Below it are the actual libraries in the directory. Could the solution be as
>simple as adding "rt" somewhere in the configure script?  Thanks.
>  Ray
>
>checking if --with-cuda-libdir is set... not found
>configure: WARNING: Expected file
>/opt/nvidia/cudatoolkit/5.0.35/lib64/libcuda.* not found
>configure: error: Cannot continue
>rsheppar@login1:/N/dc/projects/ray/br2/openmpi-1.7.2> ls -l
>/opt/nvidia/cudatoolkit/5.0.35/lib64/
>total 356284
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libcublas.so ->
>libcublas.so.5.0
>lrwxrwxrwx 1 root root19 Mar 18 14:35 libcublas.so.5.0 ->
>libcublas.so.5.0.35
>-rwxr-xr-x 1 root root  58852880 Sep 26  2012 libcublas.so.5.0.35
>-rw-r--r-- 1 root root  21255400 Sep 26  2012 libcublas_device.a
>-rw-r--r-- 1 root root456070 Sep 26  2012 libcudadevrt.a
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libcudart.so ->
>libcudart.so.5.0
>lrwxrwxrwx 1 root root19 Mar 18 14:35 libcudart.so.5.0 ->
>libcudart.so.5.0.35
>-rwxr-xr-x 1 root root375752 Sep 26  2012 libcudart.so.5.0.35
>lrwxrwxrwx 1 root root15 Mar 18 14:35 libcufft.so -> libcufft.so.5.0
>lrwxrwxrwx 1 root root18 Mar 18 14:35 libcufft.so.5.0 ->
>libcufft.so.5.0.35
>-rwxr-xr-x 1 root root  30787712 Sep 26  2012 libcufft.so.5.0.35
>lrwxrwxrwx 1 root root17 Mar 18 14:35 libcuinj64.so ->
>libcuinj64.so.5.0
>lrwxrwxrwx 1 root root20 Mar 18 14:35 libcuinj64.so.5.0 ->
>libcuinj64.so.5.0.35
>-rwxr-xr-x 1 root root   1306496 Sep 26  2012 libcuinj64.so.5.0.35
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libcurand.so ->
>libcurand.so.5.0
>lrwxrwxrwx 1 root root19 Mar 18 14:35 libcurand.so.5.0 ->
>libcurand.so.5.0.35
>-rwxr-xr-x 1 root root  25281224 Sep 26  2012 libcurand.so.5.0.35
>lrwxrwxrwx 1 root root18 Mar 18 14:35 libcusparse.so ->
>libcusparse.so.5.0
>lrwxrwxrwx 1 root root21 Mar 18 14:35 libcusparse.so.5.0 ->
>libcusparse.so.5.0.35
>-rwxr-xr-x 1 root root 132455240 Sep 26  2012 libcusparse.so.5.0.35
>lrwxrwxrwx 1 root root13 Mar 18 14:35 libnpp.so -> libnpp.so.5.0
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libnpp.so.5.0 ->
>libnpp.so.5.0.35
>-rwxr-xr-x 1 root root  93602912 Sep 26  2012 libnpp.so.5.0.35
>lrwxrwxrwx 1 root root20 Mar 18 14:35 libnvToolsExt.so ->
>libnvToolsExt.so.5.0
>lrwxrwxrwx 1 root root23 Mar 18 14:35 libnvToolsExt.so.5.0 ->
>libnvToolsExt.so.5.0.35
>-rwxr-xr-x 1 root root 31280 Sep 26  2012 libnvToolsExt.so.5.0.35
>
>
>
>--
>  Respectfully,
>Ray Sheppard
>rshep...@iu.edu
>http://pti.iu.edu/sciapt
>317-274-0016
>
>Principal Analyst
>Senior Technical Lead
>Scientific Applications and Performance Tuning
>Research Technologies
>University Information Technological Services
>IUPUI campus
>Indiana University
>
>My "pithy" saying:  Science is the art of translating the world
>into language. Unfortunately, that language is mathematics.
>Bumper sticker wisdom: Make it idiot-proof and they will make a
>better idiot.
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35

2013-08-14 Thread Rolf vandeVaart

Check to see if you have libcuda.so in /usr/lib64.  If so, then this should 
work:

--with-cuda=/opt/nvidia/cudatoolkit/5.0.35 

The configure will find the libcuda.so in /usr/lib64.

>-Original Message-
>From: Ray Sheppard [mailto:rshep...@iu.edu]
>Sent: Wednesday, August 14, 2013 2:59 PM
>To: Open MPI Users
>Cc: Rolf vandeVaart
>Subject: Re: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35
>
>Thank you for the quick reply Rolf,
>   I personally don't know the Cuda libraries. I was hoping there had been a
>name change.  I am on a Cray XT-7.
>Here is my configure command:
>
>./configure CC=gcc FC=gfortran CFLAGS="-O2" F77=gfortran FCFLAGS="-O2"
>--enable-static --disable-shared  --disable-vt --with-threads=posix --with-gnu-
>ld --with-alps --with-cuda=/opt/nvidia/cudatoolkit/5.0.35
>--with-cuda-libdir=/opt/nvidia/cudatoolkit/5.0.35/lib64
>--prefix=/N/soft/cle4/openmpi/gnu/1.7.2/cuda
>
>Ray
>
>On 8/14/2013 2:50 PM, Rolf vandeVaart wrote:
>> It is looking for the libcuda.so file, not the libcudart.so file.   So, 
>> maybe --
>with-libdir=/usr/lib64
>> You need to be on a machine with the CUDA driver installed.  What was your
>configure command?
>>
>> http://www.open-mpi.org/faq/?category=building#build-cuda
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ray
>>> Sheppard
>>> Sent: Wednesday, August 14, 2013 2:49 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35
>>>
>>> Hello,
>>>When I try to run my configure script, it dies with the following.
>>> Below it are the actual libraries in the directory. Could the
>>> solution be as simple as adding "rt" somewhere in the configure script?
>Thanks.
>>>   Ray
>>>
>>> checking if --with-cuda-libdir is set... not found
>>> configure: WARNING: Expected file
>>> /opt/nvidia/cudatoolkit/5.0.35/lib64/libcuda.* not found
>>> configure: error: Cannot continue
>>> rsheppar@login1:/N/dc/projects/ray/br2/openmpi-1.7.2> ls -l
>>> /opt/nvidia/cudatoolkit/5.0.35/lib64/
>>> total 356284
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libcublas.so ->
>>> libcublas.so.5.0
>>> lrwxrwxrwx 1 root root19 Mar 18 14:35 libcublas.so.5.0 ->
>>> libcublas.so.5.0.35
>>> -rwxr-xr-x 1 root root  58852880 Sep 26  2012 libcublas.so.5.0.35
>>> -rw-r--r-- 1 root root  21255400 Sep 26  2012 libcublas_device.a
>>> -rw-r--r-- 1 root root456070 Sep 26  2012 libcudadevrt.a
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libcudart.so ->
>>> libcudart.so.5.0
>>> lrwxrwxrwx 1 root root19 Mar 18 14:35 libcudart.so.5.0 ->
>>> libcudart.so.5.0.35
>>> -rwxr-xr-x 1 root root375752 Sep 26  2012 libcudart.so.5.0.35
>>> lrwxrwxrwx 1 root root15 Mar 18 14:35 libcufft.so -> libcufft.so.5.0
>>> lrwxrwxrwx 1 root root18 Mar 18 14:35 libcufft.so.5.0 ->
>>> libcufft.so.5.0.35
>>> -rwxr-xr-x 1 root root  30787712 Sep 26  2012 libcufft.so.5.0.35
>>> lrwxrwxrwx 1 root root17 Mar 18 14:35 libcuinj64.so ->
>>> libcuinj64.so.5.0
>>> lrwxrwxrwx 1 root root20 Mar 18 14:35 libcuinj64.so.5.0 ->
>>> libcuinj64.so.5.0.35
>>> -rwxr-xr-x 1 root root   1306496 Sep 26  2012 libcuinj64.so.5.0.35
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libcurand.so ->
>>> libcurand.so.5.0
>>> lrwxrwxrwx 1 root root19 Mar 18 14:35 libcurand.so.5.0 ->
>>> libcurand.so.5.0.35
>>> -rwxr-xr-x 1 root root  25281224 Sep 26  2012 libcurand.so.5.0.35
>>> lrwxrwxrwx 1 root root18 Mar 18 14:35 libcusparse.so ->
>>> libcusparse.so.5.0
>>> lrwxrwxrwx 1 root root21 Mar 18 14:35 libcusparse.so.5.0 ->
>>> libcusparse.so.5.0.35
>>> -rwxr-xr-x 1 root root 132455240 Sep 26  2012 libcusparse.so.5.0.35
>>> lrwxrwxrwx 1 root root13 Mar 18 14:35 libnpp.so -> libnpp.so.5.0
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libnpp.so.5.0 ->
>>> libnpp.so.5.0.35
>>> -rwxr-xr-x 1 root root  93602912 Sep 26  2012 libnpp.so.5.0.35
>>> lrwxrwxrwx 1 root root20 Mar 18 14:35 libnvToolsExt.so ->
>>> libnvToolsExt.so.5.0
>>> lrwxrwxrwx 1 root root23 Mar 18 14:35 libnvToolsExt.so.5.0 ->
>>> libnvToolsExt.so.5.0.35
>>> -rwxr-xr-x 1 root root 31280 Sep 26  2012 libnvToolsE

[OMPI users] CUDA-aware usage

2013-10-01 Thread Rolf vandeVaart

We have done some work over the last year or two to add some CUDA-aware support 
into the Open MPI library.  Details on building and using the feature are here.

http://www.open-mpi.org/faq/?category=building#build-cuda
http://www.open-mpi.org/faq/?category=running#mpi-cuda-support

I am looking for any feedback on this feature from anyone who has taken 
advantage of it.  You can send just send the response to me if you want and I 
will compile the feedback.

Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11

2013-10-07 Thread Rolf vandeVaart

That might be a bug.  While I am checking, you could try configuring with this 
additional flag:

--enable-mca-no-build=pml-bfo

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Hammond,
>Simon David (-EXP)
>Sent: Monday, October 07, 2013 3:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11
>
>Hey everyone,
>
>I am trying to build OpenMPI 1.7.2 with CUDA enabled, OpenMPI will
>configure successfully but I am seeing a build error relating to the inclusion 
>of
>the CUDA options (at least I think so). Do you guys know if this is a bug or
>whether something is wrong with how we are configuring OpenMPI for our
>cluster.
>
>Configure Line: ./configure
>--prefix=/home/projects/openmpi/1.7.2/gnu/4.7.2 --enable-shared --enable-
>static --disable-vt --with-cuda=/home/projects/cuda/5.5.11
>CC=`which gcc` CXX=`which g++` FC=`which gfortran`
>
>Running make V=1 gives:
>
>make[2]: Entering directory `/tmp/openmpi-1.7.2/ompi/tools/ompi_info'
>/bin/sh ../../../libtool  --tag=CC   --mode=link
>/home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>DOPAL_CONFIGURE_USER="\"\"" -
>DOPAL_CONFIGURE_HOST="\"k20-0007\""
>-DOPAL_CONFIGURE_DATE="\"Mon Oct  7 13:16:12 MDT 2013\""
>-DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\""
>-DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -
>DNDEBUG -finline-functions -fno-strict-aliasing -pthread\""
>-DOMPI_BUILD_CPPFLAGS="\"-I../../..
>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>-I/usr/include/infiniband -I/usr/include/infiniband -I/usr/include/infiniband -
>I/usr/include/infiniband -I/usr/include/infiniband\"" -
>DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -pthread\"" -
>DOMPI_BUILD_CXXCPPFLAGS="\"-I../../..  \""
>-DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\""
>-DOMPI_BUILD_LDFLAGS="\"-export-dynamic  \"" -DOMPI_BUILD_LIBS="\"-
>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE="\"\""
>-DOMPI_CXX_ABSOLUTE="\"none\"" -O3 -DNDEBUG -finline-functions
>-fno-strict-aliasing -pthread  -export-dynamic   -o ompi_info ompi_info.o
>param.o components.o version.o ../../../ompi/libmpi.la -lrt -lnsl  -lutil -lm
>libtool: link: /home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>DOPAL_CONFIGURE_USER=\"\" -
>DOPAL_CONFIGURE_HOST=\"k20-0007\"
>"-DOPAL_CONFIGURE_DATE=\"Mon Oct  7 13:16:12 MDT 2013\""
>-DOMPI_BUILD_USER=\"\" -DOMPI_BUILD_HOST=\"k20-0007\"
>"-DOMPI_BUILD_DATE=\"Mon Oct  7 13:26:23 MDT 2013\""
>"-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -fno-strict-
>aliasing -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../..
>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>-I/usr/include/infiniband -I/usr/include/infiniband -I/usr/include/infiniband -
>I/usr/include/infiniband -I/usr/include/infiniband\"" "-
>DOMPI_BUILD_CXXFLAGS=\"-O3 -DNDEBUG -finline-functions -pthread\"" "-
>DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \""
>-DOMPI_BUILD_FFLAGS=\"\" -DOMPI_BUILD_FCFLAGS=\"\"
>"-DOMPI_BUILD_LDFLAGS=\"-export-dynamic  \"" "-DOMPI_BUILD_LIBS=\"-
>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE=\"\" -
>DOMPI_CXX_ABSOLUTE=\"none\"
>-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -o
>.libs/ompi_info ompi_info.o param.o components.o version.o -Wl,--export-
>dynamic  ../../../ompi/.libs/libmpi.so -L/usr/lib64 -lrdmacm -losmcomp -
>libverbs /tmp/openmpi-1.7.2/orte/.libs/libopen-rte.so
>/tmp/openmpi-1.7.2/opal/.libs/libopen-pal.so -lcuda -lnuma -ldl -lrt -lnsl 
>-lutil -
>lm -pthread -Wl,-rpath -Wl,/home/projects/openmpi/1.7.2/gnu/4.7.2/lib
>../../../ompi/.libs/libmpi.so: undefined reference to
>`mca_pml_bfo_send_request_start_cuda'
>../../../ompi/.libs/libmpi.so: undefined reference to
>`mca_pml_bfo_cuda_need_buffers'
>collect2: error: ld returned 1 exit status
>
>
>
>Thanks.
>
>S.
>
>--
>Simon Hammond
>Scalable Computer Architectures (CSRI/146, 01422) Sandia National
>Laboratories, NM, USA
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] [EXTERNAL] Re: Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11

2013-10-07 Thread Rolf vandeVaart

Good.  This is fixed in Open MPI 1.7.3 by the way.  I will add note to FAQ on 
building Open MPI 1.7.2.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Hammond,
>Simon David (-EXP)
>Sent: Monday, October 07, 2013 4:17 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] [EXTERNAL] Re: Build Failing for OpenMPI 1.7.2 and
>CUDA 5.5.11
>
>Thanks Rolf, that seems to have made the code compile and make
>successfully.
>
>S.
>
>--
>Simon Hammond
>Scalable Computer Architectures (CSRI/146, 01422) Sandia National
>Laboratories, NM, USA
>
>
>
>
>
>
>On 10/7/13 1:47 PM, "Rolf vandeVaart"  wrote:
>
>>That might be a bug.  While I am checking, you could try configuring with
>>this additional flag:
>>
>>--enable-mca-no-build=pml-bfo
>>
>>Rolf
>>
>>>-Original Message-
>>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>Hammond,
>>>Simon David (-EXP)
>>>Sent: Monday, October 07, 2013 3:30 PM
>>>To: us...@open-mpi.org
>>>Subject: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11
>>>
>>>Hey everyone,
>>>
>>>I am trying to build OpenMPI 1.7.2 with CUDA enabled, OpenMPI will
>>>configure successfully but I am seeing a build error relating to the
>>>inclusion of
>>>the CUDA options (at least I think so). Do you guys know if this is a
>>>bug or
>>>whether something is wrong with how we are configuring OpenMPI for our
>>>cluster.
>>>
>>>Configure Line: ./configure
>>>--prefix=/home/projects/openmpi/1.7.2/gnu/4.7.2 --enable-shared --
>enable-
>>>static --disable-vt --with-cuda=/home/projects/cuda/5.5.11
>>>CC=`which gcc` CXX=`which g++` FC=`which gfortran`
>>>
>>>Running make V=1 gives:
>>>
>>>make[2]: Entering directory `/tmp/openmpi-1.7.2/ompi/tools/ompi_info'
>>>/bin/sh ../../../libtool  --tag=CC   --mode=link
>>>/home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>>>DOPAL_CONFIGURE_USER="\"\"" -
>>>DOPAL_CONFIGURE_HOST="\"k20-0007\""
>>>-DOPAL_CONFIGURE_DATE="\"Mon Oct  7 13:16:12 MDT 2013\""
>>>-DOMPI_BUILD_USER="\"$USER\"" -
>DOMPI_BUILD_HOST="\"`hostname`\""
>>>-DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -
>>>DNDEBUG -finline-functions -fno-strict-aliasing -pthread\""
>>>-DOMPI_BUILD_CPPFLAGS="\"-I../../..
>>>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>>>-I/usr/include/infiniband -I/usr/include/infiniband
>>>-I/usr/include/infiniband -
>>>I/usr/include/infiniband -I/usr/include/infiniband\"" -
>>>DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -
>pthread\"" -
>>>DOMPI_BUILD_CXXCPPFLAGS="\"-I../../..  \""
>>>-DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\""
>>>-DOMPI_BUILD_LDFLAGS="\"-export-dynamic  \"" -
>DOMPI_BUILD_LIBS="\"-
>>>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE="\"\""
>>>-DOMPI_CXX_ABSOLUTE="\"none\"" -O3 -DNDEBUG -finline-functions
>>>-fno-strict-aliasing -pthread  -export-dynamic   -o ompi_info ompi_info.o
>>>param.o components.o version.o ../../../ompi/libmpi.la -lrt -lnsl
>>>-lutil -lm
>>>libtool: link: /home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>>>DOPAL_CONFIGURE_USER=\"\" -
>>>DOPAL_CONFIGURE_HOST=\"k20-0007\"
>>>"-DOPAL_CONFIGURE_DATE=\"Mon Oct  7 13:16:12 MDT 2013\""
>>>-DOMPI_BUILD_USER=\"\" -DOMPI_BUILD_HOST=\"k20-
>0007\"
>>>"-DOMPI_BUILD_DATE=\"Mon Oct  7 13:26:23 MDT 2013\""
>>>"-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -fno-strict-
>>>aliasing -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../..
>>>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>>>-I/usr/include/infiniband -I/usr/include/infiniband
>>>-I/usr/include/infiniband -
>>>I/usr/include/infiniband -I/usr/include/infiniba

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread Rolf vandeVaart

Let me try this out and see what happens for me.  But yes, please go ahead and 
send me the complete backtrace.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 11:34 AM
To: us...@open-mpi.org
Cc: KESTENER Pierre
Subject: [OMPI users] OpenMPI-1.7.3 - cuda support

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]

Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread Rolf vandeVaart

The CUDA-aware support is only available when running with the verbs interface 
to Infiniband.  It does not work with the PSM interface which is being used in 
your installation.
To verify this, you need to disable the usage of PSM.  This can be done in a 
variety of ways, but try running like this:

mpirun -mca pml ob1 .

This will force the use of the verbs support layer (openib) with the CUDA-aware 
support.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 12:07 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60
Pierre.




De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.



---
This email message is for the sole use of the intended recipient(s) and may 
contain

Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Rolf vandeVaart

Thanks for the report.  CUDA-aware Open MPI does not currently support doing 
reduction operations on GPU memory.
Is this a feature you would be interested in?

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>Sent: Friday, November 29, 2013 11:24 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>
>Hi users list,
>
>I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>
>Attached, you will find an example code file, to reproduce the bug.
>The point is that MPI_Reduce with normal CPU memory fully works but the
>use of GPU memory leads to a segfault. (GPU memory is used when defining
>USE_GPU).
>
>The segfault looks like this:
>
>[peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>Signal: Segmentation fault (11) [peak64g-36:25527] Signal code: Invalid
>permissions (2) [peak64g-36:25527] Failing at address: 0x600100200 [peak64g-
>36:25527] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>[0x7ff2abdb24a0]
>[peak64g-36:25527] [ 1]
>/data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>[0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>/data/zaspel/openmpi-
>1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_
>basic_linear+0x371)
>[0x7ff2a5987531]
>[peak64g-36:25527] [ 3]
>/data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>[0x7ff2ac499d55]
>[peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction() [0x400ca0]
>[peak64g-36:25527] [ 5]
>/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7ff2abd9d76d]
>[peak64g-36:25527] [ 6] /home/zaspel/testMPI/test_reduction() [0x400af9]
>[peak64g-36:25527] *** End of error message ***
>--
>mpirun noticed that process rank 0 with PID 25527 on node peak64g-36 exited
>on signal 11 (Segmentation fault).
>--
>
>Best regards,
>
>Peter
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Rolf vandeVaart

Hi Peter:
The reason behind not having the reduction support (I believe) was just the 
complexity of adding it to the code.  I will at least submit a ticket so we can 
look at it again.

Here is a link to FAQ which lists the APIs which are CUDA-aware.  
http://www.open-mpi.org/faq/?category=running#mpi-cuda-support

Regards,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>Sent: Monday, December 02, 2013 8:29 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>
>* PGP Signed by an unknown key
>
>Hi Rolf,
>
>OK, I didn't know that. Sorry.
>
>Yes, it would be a pretty important feature in cases when you are doing
>reduction operations on many, many entries in parallel. Therefore, each
>reduction is not very complex or time-consuming but potentially hundreds of
>thousands reductions are done at the same time. This is definitely a point
>where a CUDA-aware implementation can give some performance
>improvements.
>
>I'm curious: Rather complex operations like allgatherv are CUDA-aware, but a
>reduction is not. Is there a reasoning for this? Is there some documentation,
>which MPI calls are CUDA-aware and which not?
>
>Best regards
>
>Peter
>
>
>
>On 12/02/2013 02:18 PM, Rolf vandeVaart wrote:
>> Thanks for the report.  CUDA-aware Open MPI does not currently support
>doing reduction operations on GPU memory.
>> Is this a feature you would be interested in?
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter
>>> Zaspel
>>> Sent: Friday, November 29, 2013 11:24 AM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>>>
>>> Hi users list,
>>>
>>> I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>>> implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>>>
>>> Attached, you will find an example code file, to reproduce the bug.
>>> The point is that MPI_Reduce with normal CPU memory fully works but
>>> the use of GPU memory leads to a segfault. (GPU memory is used when
>>> defining USE_GPU).
>>>
>>> The segfault looks like this:
>>>
>>> [peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>>> Signal: Segmentation fault (11) [peak64g-36:25527] Signal code:
>>> Invalid permissions (2) [peak64g-36:25527] Failing at address:
>>> 0x600100200 [peak64g- 36:25527] [ 0]
>>> /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>>> [0x7ff2abdb24a0]
>>> [peak64g-36:25527] [ 1]
>>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>>> [0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>>> /data/zaspel/openmpi-
>>>
>1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intr
>>> a_
>>> basic_linear+0x371)
>>> [0x7ff2a5987531]
>>> [peak64g-36:25527] [ 3]
>>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>>> [0x7ff2ac499d55]
>>> [peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction()
>>> [0x400ca0] [peak64g-36:25527] [ 5]
>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
>>> [0x7ff2abd9d76d] [peak64g-36:25527] [ 6]
>>> /home/zaspel/testMPI/test_reduction() [0x400af9] [peak64g-36:25527]
>>> *** End of error message ***
>>> -
>>> - mpirun noticed that process rank 0 with PID 25527 on node
>>> peak64g-36 exited on signal 11 (Segmentation fault).
>>> -
>>> -
>>>
>>> Best regards,
>>>
>>> Peter
>> --
>> - This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information.  Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> --
>> - ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>--
>Dipl.-Inform. Peter Zaspel
>Institut fuer Numerische Simulation, Universitaet Bonn Wegelerstr.6, 53115
>Bonn, Germany
>tel: +49 228 73-2748   mailto:zas...@ins.uni-bonn.de
>fax: +49 228 73-7527   http://wissrech.ins.uni-bonn.de/people/zaspel.html
>
>* Unknown Key
>* 0x8611E59B(L)
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Cuda Aware MPI Problem

2013-12-13 Thread Rolf vandeVaart

Yes, this was a bug with Open MPI 1.7.3.  I could not reproduce it, but it was 
definitely an issue in certain configurations.
Here was the fix.   https://svn.open-mpi.org/trac/ompi/changeset/29762

We fixed it in Open MPI 1.7.4 and the trunk version, so as you have seen, they 
do not have the problem.

Rolf


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Özgür Pekçagliyan
Sent: Friday, December 13, 2013 8:03 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Cuda Aware MPI Problem

Hello again,

I have compiled openmpi--1.9a1r29873 from nightly build trunk and so far 
everything looks alright. But I have not test the cuda support yet.

On Fri, Dec 13, 2013 at 2:38 PM, Özgür Pekçağlıyan 
mailto:ozgur.pekcagli...@gmail.com>> wrote:
Hello,

I am having difficulties with compiling openMPI with CUDA support. I have 
followed this (http://www.open-mpi.org/faq/?category=building#build-cuda) faq 
entry. As below;

$ cd openmpi-1.7.3/
$ ./configure --with-cuda=/urs/local/cuda-5.5
$ make all install

everything goes perfect during compilation. But when I try to execute simplest 
mpi hello world application I got following error;

$ mpicc hello.c -o hello
$ mpirun -np 2 hello

hello: symbol lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined 
symbol: progress_one_cuda_htod_event
hello: symbol lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined 
symbol: progress_one_cuda_htod_event
--
mpirun has exited due to process rank 0 with PID 30329 on
node cudalab1 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--

$ mpirun -np 1 hello

hello: symbol lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined 
symbol: progress_one_cuda_htod_event
--
mpirun has exited due to process rank 0 with PID 30327 on
node cudalab1 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--


Any suggestions?
I have two PCs with Intel I3 CPUs and Geforce GTX 480 GPUs.


And here is the hello.c file;
#include 
#include 


int main (int argc, char **argv)
{
  int rank, size;

  MPI_Init (&argc, &argv); /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
  printf( "Hello world from process %d of %d\n", rank, size );
  MPI_Finalize();
  return 0;
}



--
Özgür Pekçağlıyan
B.Sc. in Computer Engineering
M.Sc. in Computer Engineering



--
Özgür Pekçağlıyan
B.Sc. in Computer Engineering
M.Sc. in Computer Engineering

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy

Re: [OMPI users] Configure issue with/without HWLOC when PGI used and CUDA support enabled

2014-02-14 Thread Rolf vandeVaart

I assume your first issue is happening because you configured hwloc with cuda 
support which creates a dependency on libcudart.so.  Not sure why that would 
mess up Open MPI.  Can you send me how you configured hwloc?

I am not sure I understand the second issue.  Open MPI puts everything in lib 
even though you may be building for 64 bits.  So all of these are fine.
  -I/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC/lib 
 -Wl,-rpath 
-Wl,/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC/lib 
 -L/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC/lib

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Friday, February 14, 2014 9:44 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Configure issue with/without HWLOC when PGI used
>and CUDA support enabled
>
>Dear Open MPI developers,
>
>I just want to point to a weird behavior of the configure procedure I
>discovered. I am compiling Open MPI 1.7.4 with CUDA support (CUDA 6.0 RC)
>and PGI 14.1
>
>If I explicitly compile against a self-compiled version of HWLOC (1.8.1) using
>this configure line ../configure CC=pgcc CXX=pgCC FC=pgf90 F90=pgf90 --
>prefix=/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC  --
>enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR --with-hwloc=/usr/local/Cluster-
>Users/fs395/hwlock-1.8.1/gcc-4.4.7_cuda-6.0RC --with-
>slurm=/usr/local/Cluster-Apps/slurm  --with-cuda=/usr/local/Cluster-
>Users/fs395/cuda/6.0-RC
>
>make fails telling me that it cannot find "-lcudart".
>
>
>If I compile without HWLOC using this configure line:
>../configure CC=pgcc CXX=pgCC FC=pgf90 F90=pgf90 --
>prefix=/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC  --
>enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-slurm=/usr/local/Cluster-
>Apps/slurm  --with-cuda=/usr/local/Cluster-Users/fs395/cuda/6.0-RC
>
>make succeeds and I have Open MPI compiled properly.
>
>$ mpif90 -show
>pgf90 -I/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-
>6.0RC/include -I/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-
>6.0RC/lib -Wl,-rpath -Wl,/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-
>14.1_cuda-6.0RC/lib -L/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-
>14.1_cuda-6.0RC/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -
>lmpi $ ompi_info --all | grep btl_openib_have_cuda_gdr
> MCA btl: informational "btl_openib_have_cuda_gdr" (current 
> value:
>"true", data source: default, level: 5 tuner/detail, type: bool)
>
>I wonder why the configure picks up lib instead of lib64. I will test the build
>using real codes.
>
>Cheers,
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart

Can you try running with --mca coll ^ml and see if things work? 

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:14 PM
>To: Open MPI Users
>Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited
>with error."
>
>Dear Open MPI developers,
>
>I hit an expected error running OSU osu_alltoall benchmark using Open MPI
>1.7.5rc1. Here the error:
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed
>In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed
>[tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>
>[tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and mpool
>was not successfully setup!
>[tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>
>[tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and mpool
>was not successfully setup!
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size   Avg Latency(us)
>--
>mpirun noticed that process rank 3 with PID 4508 on node tesla51 exited on
>signal 11 (Segmentation fault).
>--
>2 total processes killed (some possibly by mpirun during cleanup)
>
>Any idea where this come from?
>
>I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA 6.0RC.
>Attached outputs grabbed from configure, make and run. The configure was
>
>export MXM_DIR=/opt/mellanox/mxm
>export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*" -print0)
>export FCA_DIR=/opt/mellanox/fca export HCOLL_DIR=/opt/mellanox/hcoll
>
>../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2 -axAVX -ip -
>O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias" --prefix=<...>
>--enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>hwloc=internal --with-verbs 2>&1 | tee config.out
>
>
>Thanks in advance,
>Regards
>
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart

There is something going wrong with the ml collective component.  So, if you 
disable it, things work.
I just reconfigured without any CUDA-aware support, and I see the same failure 
so it has nothing to do with CUDA.

Looks like Jeff Squyres just made a bug for it.

https://svn.open-mpi.org/trac/ompi/ticket/4331



>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:32 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>exited with error."
>
>Dear Rolf,
>
>your suggestion works!
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core  --mca coll ^ml osu_alltoall
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size   Avg Latency(us)
>1   8.02
>2   2.96
>4   2.91
>8   2.91
>16  2.96
>32  3.07
>64  3.25
>128 3.74
>256 3.85
>512 4.11
>10244.79
>20485.91
>4096   15.84
>8192   24.88
>16384  35.35
>32768  56.20
>65536  66.88
>131072114.89
>262144209.36
>524288396.12
>1048576   765.65
>
>
>Can you clarify exactly where the problem come from?
>
>Regards,
>Filippo
>
>
>On Mar 4, 2014, at 12:17 AM, Rolf vandeVaart 
>wrote:
>> Can you try running with --mca coll ^ml and see if things work?
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo
>>> Spiga
>>> Sent: Monday, March 03, 2014 7:14 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>>> exited with error."
>>>
>>> Dear Open MPI developers,
>>>
>>> I hit an expected error running OSU osu_alltoall benchmark using Open
>>> MPI 1.7.5rc1. Here the error:
>>>
>>> $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>>> failed
>>> [tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> [tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> # OSU MPI All-to-All Personalized Exchange Latency Test v4.2
>>> # Size   Avg Latency(us)
>>> -
>>> - mpirun noticed that process rank 3 with PID 4508 on node
>>> tesla51 exited on signal 11 (Segmentation fault).
>>> -
>>> -
>>> 2 total processes killed (some possibly by mpirun during cleanup)
>>>
>>> Any idea where this come from?
>>>
>>> I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA
>6.0RC.
>>> Attached outputs grabbed from configure, make and run. The configure
>>> was
>>>
>>> export MXM_DIR=/opt/mellanox/mxm
>>> export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*"
>>> -print0) export FCA_DIR=/opt/mellanox/fca export
>>> HCOLL_DIR=/opt/mellanox/hcoll
>>>
>>> ../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2
>>> -axAVX -ip -
>>> O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias"
>>> --prefix=<...> --enable-mpirun-prefix-by-default --with-fca=$FCA_DIR
>>> --with- mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>>> cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>>> hwloc=internal --with-verbs 2>&1 | tee config.out
>>>
>>>
>>> Thanks in advance,
>>> Regards
>>>
>>> Filippo
>>>
>>> --
>>> Mr. Filippo SPIGA, M.Sc.
>

Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI

2014-05-27 Thread Rolf vandeVaart

Answers inline...
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Friday, May 23, 2014 4:31 PM
>To: Open MPI Users
>Subject: [OMPI users] Advices for parameter tuning for CUDA-aware MPI
>
>Hi,
>I am currently configuring a GPU cluster. The cluster has 8 K20 GPUs per node
>on two sockets, 4 PCIe bus (2 K20 per bus, 4 K20 per socket), with a single QDR
>InfiniBand card on each node. We have the latest NVidia drivers and Cuda 6.0.
>
>I am wondering if someone could tell me if all the default MCA parameters are
>optimal for cuda. More precisely, I am interrested about GDR and IPC. It
>seems from the parameters (see below) that they are both available
>(although GDR is disabled by default). However, my notes from
>GTC14 mention the btl_openib_have_driver_gdr parameter, which I do not
>see at all.
>
>So, I guess, my questions :
>1) Why is GDR disabled by default when available ?
It was disabled by default because it did not always give optimum performance.  
That may change in the future but for now, as you mentioned, you have to turn 
on the feature explicitly.

>2) Is the absence of btl_openib_have_driver_gdr an indicator of something
>missing ?
Yes, that means that somehow the GPU Direct RDMA is not installed correctly. 
All that check does is make sure that the file 
/sys/kernel/mm/memory_peers/nv_mem/version exists.  Does that exist?

>3) Are the default parameters, especially the rdma limits and such, optimal for
>our configuration ?
That is hard to say.  GPU Direct RDMA does not work well when the GPU and IB 
card are not "close" on the system. Can you run "nvidia-smi topo -m" on your 
system? 

>4) Do I want to enable or disable IPC by default (my notes state that bandwith
>is much better with MPS than IPC).
Yes, you should leave IPC enabled by default.  That should give good 
performance.  They were some issues with earlier CUDA versions, but they were 
all fixed in CUDA 6.
>
>Thanks,
>
>Here is what I get from
>ompi_info --all | grep cuda
>
>[mboisson@login-gpu01 ~]$ ompi_info --all | grep cuda [login-
>gpu01.calculquebec.ca:11486] mca: base: components_register:
>registering filem components
>[login-gpu01.calculquebec.ca:11486] mca: base: components_register:
>found loaded component raw
>[login-gpu01.calculquebec.ca:11486] mca: base: components_register:
>component raw register function successful [login-
>gpu01.calculquebec.ca:11486] mca: base: components_register:
>registering snapc components
>   Prefix: /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37
>  Exec_prefix: /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37
>   Bindir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/bin
>  Sbindir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/sbin
>   Libdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib
>   Incdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/include
>   Mandir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/man
>Pkglibdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/openmpi
>   Libexecdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/libexec
>  Datarootdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share
>  Datadir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share
>   Sysconfdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/etc
>   Sharedstatedir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/com
>Localstatedir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/var
>  Infodir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/info
>   Pkgdatadir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/openmpi
>Pkglibdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/openmpi
>Pkgincludedir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/include/openmpi
>  MCA mca: parameter "mca_param_files" (current value:
>"/home/mboisson/.openmpi/mca-params.conf:/software-
>gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/etc/openmpi-mca-params.conf",
>data source: default, level: 2 user/detail, type: string, deprecated, synonym
>of: mca_base_param_files)
>  MCA mca: parameter "mca_component_path" (current
>value:
>"/software-
>gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/openmpi:/home/mboisson/.o
>penmpi/components",
>data source: default, level: 9 dev/all, type: string, deprecated, synonym of:
>mca_base_component_path)
>  MCA mca: parameter "mca_base_param_files" (current
>value:
>"/home/mboisson/.openmpi/mca-params.conf:/software-
>gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/etc/openmpi-mca-params.conf",
>data source: default, level: 2 user/detail, type: string, synonyms:
>mca_param_files)
>  MCA mca: informational "mca_base_overrid

Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI

2014-05-27 Thread Rolf vandeVaart

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, May 27, 2014 4:07 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI
>
>Answers inline too.
>>> 2) Is the absence of btl_openib_have_driver_gdr an indicator of
>>> something missing ?
>> Yes, that means that somehow the GPU Direct RDMA is not installed
>correctly. All that check does is make sure that the file
>/sys/kernel/mm/memory_peers/nv_mem/version exists.  Does that exist?
>>
>It does not. There is no
>/sys/kernel/mm/memory_peers/
>
>>> 3) Are the default parameters, especially the rdma limits and such,
>>> optimal for our configuration ?
>> That is hard to say.  GPU Direct RDMA does not work well when the GPU
>and IB card are not "close" on the system. Can you run "nvidia-smi topo -m"
>on your system?
>nvidia-smi topo -m
>gives me the error
>[mboisson@login-gpu01 ~]$ nvidia-smi topo -m Invalid combination of input
>arguments. Please run 'nvidia-smi -h' for help.
Sorry, my mistake.  That may be a future feature.

>
>I could not find anything related to topology in the help. However, I can tell
>you the following which I believe to be true
>- GPU0 and GPU1 are on PCIe bus 0, socket 0
>- GPU2 and GPU3 are on PCIe bus 1, socket 0
>- GPU4 and GPU5 are on PCIe bus 2, socket 1
>- GPU6 and GPU7 are on PCIe bus 3, socket 1
>
>There is one IB card which I believe is on socket 0.
>
>
>I know that we do not have the Mellanox Ofed. We use the Linux RDMA from
>CentOS 6.5. However, should that completely disable GDR within a single
>node ? i.e. does GDR _have_ to go through IB ? I would assume that our lack
>of Mellanox OFED would result in no-GDR inter-node, but GDR intra-node.

Without Mellanox OFED, then GPU Direct RDMA is unavailable.  However, the term 
GPU Direct is a somewhat overloaded term and I think that is where I was 
getting confused.  GPU Direct (also known as CUDA IPC) will work between GPUs 
that do not cross a QPI connection.  That means that I believe GPU0,1,2,3 
should be able to use GPU Direct between them and GPU4,5,6,7 can also between 
them.   In this case, this means that GPU memory does not need to get staged 
through host memory for transferring between the GPUs.  With Open MPI, there is 
a mca parameter you can set that will allow you to see whether GPU Direct is 
being used between the GPUs.

--mca btl_smcuda_cuda_ipc_verbose 100

 Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] deprecated cuptiActivityEnqueueBuffer

2014-06-16 Thread Rolf vandeVaart

Do you need the vampire support in your build?  If not, you could add this to 
configure.
--disable-vt
  
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>jcabe...@computacion.cs.cinvestav.mx
>Sent: Monday, June 16, 2014 1:40 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] deprecated cuptiActivityEnqueueBuffer
>
>Hi all:
>
>I'm having trouble to compile OMPI from trunk svn with the new 6.0 nvidia
>SDK because deprecated cuptiActivityEnqueueBuffer
>
>this is the problem:
>
>  CC   libvt_la-vt_cupti_activity.lo
>  CC   libvt_la-vt_iowrap_helper.lo
>  CC   libvt_la-vt_libwrap.lo
>  CC   libvt_la-vt_mallocwrap.lo
>vt_cupti_activity.c: In function 'vt_cuptiact_queueNewBuffer':
>vt_cupti_activity.c:203:3: error: implicit declaration of function
>'cuptiActivityEnqueueBuffer' [-Werror=implicit-function-declaration]
>   VT_CUPTI_CALL(cuptiActivityEnqueueBuffer(cuCtx, 0,
>ALIGN_BUFFER(buffer, 8),
>
>Does any body known any patch?
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/06/24652.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Help with multirail configuration

2014-07-21 Thread Rolf vandeVaart

With Open MPI 1.8.1, the library will use the NIC that is "closest" to the CPU. 
There was a bug in earlier versions of Open MPI 1.8 so that did not happen.  
You can see this by running with some verbosity using the "btl_base_verbose" 
flag.  For example, this is what I observed on a two node cluster with two NICs 
on each node.

[rvandevaart@ivy0] $ mpirun --mca btl_base_verbose 1 -host ivy0,ivy1 -np 4 
--mca pml ob1 --mca btl_openib_warn_default_gid_prefix 0 MPI_Alltoall_c
[ivy0.nvidia.com:28896] [rank=0] openib: using device mlx5_0
[ivy0.nvidia.com:28896] [rank=0] openib: skipping device mlx5_1; it is too far 
away
[ivy0.nvidia.com:28897] [rank=1] openib: using device mlx5_1
[ivy0.nvidia.com:28897] [rank=1] openib: skipping device mlx5_0; it is too far 
away
[ivy1.nvidia.com:04652] [rank=2] openib: using device mlx5_0
[ivy1.nvidia.com:04652] [rank=2] openib: skipping device mlx5_1; it is too far 
away
[ivy1.nvidia.com:04653] [rank=3] openib: using device mlx5_1
[ivy1.nvidia.com:04653] [rank=3] openib: skipping device mlx5_0; it is too far 
away

So, maybe the right thing is happening by default?  Or you are looking for more 
fine-grained control?

Rolf

From: users [users-boun...@open-mpi.org] On Behalf Of Tobias Kloeffel 
[tobias.kloef...@fau.de]
Sent: Sunday, July 20, 2014 12:33 PM
To: Open MPI Users
Subject: Re: [OMPI users] Help with multirail configuration

I found no option in 1.6.5 and 1.8.1...


Am 7/20/2014 6:29 PM, schrieb Ralph Castain:
> What version of OMPI are you talking about?
>
> On Jul 20, 2014, at 9:11 AM, Tobias Kloeffel  wrote:
>
>> Hello everyone,
>>
>> I am trying to get the maximum performance out of my two node testing setup. 
>> Each node consists of 4 Sandy Bridge CPUs and each CPU has one directly 
>> attached Mellanox QDR card. Both nodes are connected via a 8-port Mellanox 
>> switch.
>> So far I found no option that allows binding mpi-ranks to a specific card, 
>> as it is available in MVAPICH2. Is there a way to change the round robin 
>> behavior of openMPI?
>> Maybe something like "btl_tcp_if_seq" that I have missed?
>>
>>
>> Kind regards,
>> Tobias
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/07/24822.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24825.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24827.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Rolf vandeVaart

Just to help reduce the scope of the problem, can you retest with a non 
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the 
configure line to help with the stack trace?


>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Monday, August 18, 2014 4:23 PM
>To: Open MPI Users
>Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
>derailed into two problems, one of which has been addressed, I figured I
>would start a new, more precise and simple one.
>
>I reduced the code to the minimal that would reproduce the bug. I have
>pasted it here :
>http://pastebin.com/1uAK4Z8R
>Basically, it is a program that initializes MPI and cudaMalloc memory, and then
>free memory and finalize MPI. Nothing else.
>
>When I compile and run this on a single node, everything works fine.
>
>When I compile and run this on more than one node, I get the following stack
>trace :
>[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal:
>Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped
>(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
>/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
>[gpu-k20-07:40041] [ 1]
>/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
>[gpu-k20-07:40041] [ 2]
>/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
>[gpu-k20-07:40041] [ 3]
>/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
>[gpu-k20-07:40041] [ 4]
>/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
>[gpu-k20-07:40041] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
>[gpu-k20-07:40041] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
>[gpu-k20-07:40041] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
>[gpu-k20-07:40041] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
>[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
>[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] ***
>End of error message ***
>
>
>The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or
>OpenMPI 1.8.1 (cuda aware).
>
>I know this is more than likely a problem with Cuda than with OpenMPI (since
>it does the same for two different versions), but I figured I would ask here if
>somebody has a clue of what might be going on. I have yet to be able to fill a
>bug report on NVidia's website for Cuda.
>
>
>Thanks,
>
>
>--
>-
>Maxime Boissonneault
>Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/08/25064.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Rolf vandeVaart

Hi:
This problem does not appear to have anything to do with MPI. We are getting a 
SEGV during the initial call into the CUDA driver.  Can you log on to 
gpu-k20-08, compile your simple program without MPI, and run it there?  Also, 
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?  

Also, does your program run if you just run it on gpu-k20-07?  

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, August 19, 2014 8:55 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me
>much more information.
>[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
>   Prefix:
>/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
>   Internal debug support: yes
>Memory debugging support: no
>
>
>Is there something I need to do at run time to get more information out
>of it ?
>
>
>[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
>ppr:1:node
>cudampi_simple
>[gpu-k20-08:46045] *** Process received signal ***
>[gpu-k20-08:46045] Signal: Segmentation fault (11)
>[gpu-k20-08:46045] Signal code: Address not mapped (1)
>[gpu-k20-08:46045] Failing at address: 0x8
>[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
>[gpu-k20-08:46045] [ 1]
>/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
>[gpu-k20-08:46045] [ 2]
>/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
>[gpu-k20-08:46045] [ 3]
>/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
>[gpu-k20-08:46045] [ 4]
>/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
>[gpu-k20-08:46045] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
>[gpu-k20-08:46045] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
>[gpu-k20-08:46045] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
>[gpu-k20-08:46045] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]
>[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
>[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
>[gpu-k20-07:61816] Signal: Segmentation fault (11)
>[gpu-k20-07:61816] Signal code: Address not mapped (1)
>[gpu-k20-07:61816] Failing at address: 0x8
>[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
>[gpu-k20-07:61816] [ 1]
>/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
>[gpu-k20-07:61816] [ 2]
>/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
>[gpu-k20-07:61816] [ 3]
>/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
>[gpu-k20-07:61816] [ 4]
>/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
>[gpu-k20-07:61816] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
>[gpu-k20-07:61816] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
>[gpu-k20-07:61816] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
>[gpu-k20-07:61816] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647
>]
>[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
>[gpu-k20-07:61816] [10]
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
>[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
>[gpu-k20-07:61816] *** End of error message ***
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
>[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
>[gpu-k20-08:46045] *** End of error message ***
>--
>mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08
>exited on signal 11 (Segmentation fault).
>--
>
>
>Thanks,
>
>Maxime
>
>
>Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
>> Just to help reduce the scope of the problem, can you retest with a non
>CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
>configure line to help with the stack trace?
>>
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>>> Boissonneault
>>> Sent: Monday, August 18, 2014 4:23 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>
>>> Hi,
>>> Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
>>> derailed into two problems, one o

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Rolf vandeVaart

Glad it was solved.  I will submit a bug at NVIDIA as that does not seem like a 
very friendly way to handle that error.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, August 19, 2014 10:39 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>I believe I found what the problem was. My script set the
>CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the
>GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had
>CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>
>instead of
>CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
>
>Sorry for the false bug and thanks for directing me toward the solution.
>
>Maxime
>
>
>Le 2014-08-19 09:15, Rolf vandeVaart a écrit :
>> Hi:
>> This problem does not appear to have anything to do with MPI. We are
>getting a SEGV during the initial call into the CUDA driver.  Can you log on to
>gpu-k20-08, compile your simple program without MPI, and run it there?  Also,
>maybe run dmesg on gpu-k20-08 and see if there is anything in the log?
>>
>> Also, does your program run if you just run it on gpu-k20-07?
>>
>> Can you include the output from nvidia-smi on each node?
>>
>> Thanks,
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>>> Boissonneault
>>> Sent: Tuesday, August 19, 2014 8:55 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>
>>> Hi,
>>> I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not
>>> give me much more information.
>>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
>>>Prefix:
>>> /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
>>>Internal debug support: yes
>>> Memory debugging support: no
>>>
>>>
>>> Is there something I need to do at run time to get more information
>>> out of it ?
>>>
>>>
>>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
>>> ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received
>>> signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11)
>>> [gpu-k20-08:46045] Signal code: Address not mapped (1)
>>> [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
>>> [gpu-k20-08:46045] [ 1]
>>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
>>> [gpu-k20-08:46045] [ 2]
>>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
>>> [gpu-k20-08:46045] [ 3]
>>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
>>> [gpu-k20-08:46045] [ 4]
>>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
>>> [gpu-k20-08:46045] [ 5]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df
>>> 4965]
>>> [gpu-k20-08:46045] [ 6]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df
>>> 4a0a]
>>> [gpu-k20-08:46045] [ 7]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df
>>> 4a3b]
>>> [gpu-k20-08:46045] [ 8]
>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0
>>> f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
>>> [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received
>>> signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11)
>>> [gpu-k20-07:61816] Signal code: Address not mapped (1)
>>> [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
>>> [gpu-k20-07:61816] [ 1]
>>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
>>> [gpu-k20-07:61816] [ 2]
>>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
>>> [gpu-k20-07:61816] [ 3]
>>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
>>> [gpu-k20-07:61816] [ 4]
>>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
>>> [gpu-k20-07:61816] [ 5]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b
>>> 6965]
>>> [gpu-k20-07:61816] [ 6]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b
>>> 6a0a]
>>> [gpu-k20-07:61816] [ 7]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2

Re: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

2014-08-26 Thread Rolf vandeVaart

Hi Christoph:
I will try and reproduce this issue and will let you know what I find.  There 
may be an issue with CUDA IPC support with certain traffic patterns.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter
Sent: Tuesday, August 26, 2014 2:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

Hey all,

to test the performance of my application I duplicated the call to the function 
that will issue the computation on two GPUs 5 times. During the 4th and 5th run 
of the algorithm, however, the algorithm yields different results (9 instead of 
20):

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 820 9
121.* 1000 820 9

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with 
cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can be used 
(because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose 
100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work since 
the first GPU  identifies 9 clusters, the second GPU identifies 11 clusters 
(makes 20 clusters total). Debugging the application shows, that all clusters 
are identified correctly, however, the exchange of the identified clusters 
seems not to work: Each MPI process stores its identified clusters in an 
buffer, that both processes exchange using MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that will 
receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems to 
cause the problems (synchronisation and/or fail-silent error) and indeed, 
disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0 
-np 2 ./double_test ../data/similarities2.double.-300 
ex.2.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20

Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is 
there a way to force synchronisation (I tried MPI_Barrier() without success), 
has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] enable-cuda with disable-dlopen

2014-09-05 Thread Rolf vandeVaart

Yes, I have reproduced.  And I agree with your thoughts on configuring vs 
runtime error.  I will look into this.
Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Brock Palen
>Sent: Friday, September 05, 2014 5:22 PM
>To: Open MPI Users
>Subject: [OMPI users] enable-cuda with disable-dlopen
>
>* PGP Signed by an unknown key
>
>We found with 1.8.[1,2]  that is you compile with
>
>--with-mxm
>--with-cuda
>--disable-dlopen
>
>OMPI will compile install and run, but if you run disabling mxm (say to debug
>something)
>
>mpirun --mca mtl ^mxm ./a.out
>
>You will get a notice saying that you cannot have enable cuda with disable
>dlopen, and the job dies.
>
>Is there any reason this shouldn't be a configure time error?  Rather than a
>runtime error?
>
>Note we have moved to --disable-mca-dso to get the result we want (less
>iops to the server hosting the install)
>
>Our full configure was:
>
>PREFIX=/home/software/rhel6/openmpi-1.8.2/gcc-4.8.0
>MXM=/opt/mellanox/mxm
>FCA=/opt/mellanox/fca
>KNEM=/opt/knem-1.1.1.90mlnx
>CUDA=/home/software/rhel6/cuda/5.5/cuda
>COMPILERS='CC=gcc CXX=g++ FC=gfortran'
>./configure \
>--prefix=$PREFIX \
>--mandir=${PREFIX}/man \
>--disable-dlopen \
>--enable-shared \
>--enable-java   \
>--enable-mpi-java \
>--with-cuda=$CUDA \
>--with-hwloc=internal \
>--with-verbs \
>--with-psm   \
>--with-tm=/usr/local/torque \
>--with-fca=$FCA \
>--with-mxm=$MXM \
>--with-knem=$KNEM \
>--with-io-romio-flags='--with-file-system=testfs+ufs+nfs+lustre' \
>   $COMPILERS
>
>Brock Palen
>www.umich.edu/~brockp
>CAEN Advanced Computing
>XSEDE Campus Champion
>bro...@umich.edu
>(734)936-1985
>
>
>
>
>* Unknown Key
>* 0x806FCF94
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

[OMPI users] CUDA-aware Users

2014-10-09 Thread Rolf vandeVaart

If you are utilizing the CUDA-aware support in Open MPI, can you send me an 
email with some information about the application and the cluster you are on.  
I will consolidate information.

Thanks,

Rolf (rvandeva...@nvidia.com)

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CuEventCreate Failed...

2014-10-19 Thread Rolf vandeVaart

The error 304 corresponds to CUDA_ERRROR_OPERATNG_SYSTEM which means an OS call 
failed.  However, I am not sure how that relates to the call that is getting 
the error.
Also, the last error you report is from MVAPICH2-GDR, not from Open MPI.  I 
guess then I have a few questions.


1.   Can you supply your configure line for Open MPI?

2.   Are you making use of CUDA-aware support?

3.   Are you set up so that users can use both Open MPI and MVAPICH2?

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Friday, October 17, 2014 6:48 PM
To: us...@open-mpi.org
Subject: [OMPI users] CuEventCreate Failed...

Hi All,

We have run into issues, that don't really seem to materialize into incorrect 
results, nonetheless, we hope to figure out why we are getting them.

We have several environments with test from one machine, with say 1-16 
processes per node, to several machines with 1-16 processes. All systems are 
certified from Nvidia and use Nvidia Tesla k40 GPUs.

We notice frequent situations of the following,

--

The call to cuEventCreate failed. This is a unrecoverable error and will

cause the program to abort.

  Hostname: aHost

  cuEventCreate return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will

cause the program to abort.

  cuIpcGetEventHandle return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol

cannot be used.

  cuIpcGetMemHandle return value:   304

  address: 0x700fd0400

Check the cuda.h file for what the return value means. Perhaps a reboot

of the node will clear the problem.

--

Now, our test suite still verifies results but this does cause the following 
when it happens,

The call to cuEventDestory failed. This is a unrecoverable error and will

cause the program to abort.

  cuEventDestory return value:   400

Check the cuda.h file for what the return value means.

--

---

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

---

--

mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:



  Process name: [[37290,1],2]

  Exit code:1


We have traced the code back to the following files:
-ompi/mca/common/cuda/common_cuda.c :: 
mca_common_cuda_construct_event_and_handle()

We also know the the following:
-it happens on every machine on the very first entry to the function previously 
mentioned,
-does not happen if the buffer size is under 128 bytes... likely a different 
mech. Used for the IPC,

Last, here is an intermittent one and it produces a lot failed tests in our 
suite... when in fact they are solid, besides this error. Cause notification, 
annoyances and it would be nice to clean it up.

mpi_rank_3][cudaipc_allocate_ipc_region] 
[src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_ipc.c:487] cuda failed with 
mapping of buffer object failed


We have not been able to duplicate these errors in other MPI libs,

Thank you for your time & looking forward to your response,


Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CuEventCreate Failed...

2014-10-20 Thread Rolf vandeVaart

Hi:
I just tried running a program similar to yours with CUDA 6.5 and Open MPI and 
I could not reproduce.  Just to make sure I am doing things correctly, your 
example below is running with np=5 and on a single node? Which version of CUDA 
are you using?  Can you also send the output from nvidia-smi?  Also, based on 
the usage of -allow-run-as-root I assume you are running the program as root?


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Monday, October 20, 2014 1:59 PM
To: Open MPI Users
Subject: Re: [OMPI users] CuEventCreate Failed...

Thanks for your quick response,

1)mpiexec --allow-run-as-root --mca btl_openib_want_cuda_gdr 1 --mca 
btl_openib_cuda_rdma_limit 6 --mca mpi_common_cuda_event_max 1000 -n 5 
test/RunTests
2)Yes, cuda aware support using Mellanox IB,
3)Yes, we have the ability to use several version of OpenMPI, Mvapich2, etc.

Also, our defaults for openmpi-mca-params.conf are:

mtl=^mxm

btl=^usnic,tcp

btl_openib_flags=1


service nv_peer_mem status

nv_peer_mem module is loaded.

Kindest Regards,
-
Steven Eliuk,


From: Rolf vandeVaart mailto:rvandeva...@nvidia.com>>
Reply-To: Open MPI Users mailto:us...@open-mpi.org>>
List-Post: users@lists.open-mpi.org
Date: Sunday, October 19, 2014 at 7:33 PM
To: Open MPI Users mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] CuEventCreate Failed...

The error 304 corresponds to CUDA_ERRROR_OPERATNG_SYSTEM which means an OS call 
failed.  However, I am not sure how that relates to the call that is getting 
the error.
Also, the last error you report is from MVAPICH2-GDR, not from Open MPI.  I 
guess then I have a few questions.


1.  Can you supply your configure line for Open MPI?

2.  Are you making use of CUDA-aware support?

3.  Are you set up so that users can use both Open MPI and MVAPICH2?

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Friday, October 17, 2014 6:48 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] CuEventCreate Failed...

Hi All,

We have run into issues, that don't really seem to materialize into incorrect 
results, nonetheless, we hope to figure out why we are getting them.

We have several environments with test from one machine, with say 1-16 
processes per node, to several machines with 1-16 processes. All systems are 
certified from Nvidia and use Nvidia Tesla k40 GPUs.

We notice frequent situations of the following,

--

The call to cuEventCreate failed. This is a unrecoverable error and will

cause the program to abort.

  Hostname: aHost

  cuEventCreate return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will

cause the program to abort.

  cuIpcGetEventHandle return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol

cannot be used.

  cuIpcGetMemHandle return value:   304

  address: 0x700fd0400

Check the cuda.h file for what the return value means. Perhaps a reboot

of the node will clear the problem.

--

Now, our test suite still verifies results but this does cause the following 
when it happens,

The call to cuEventDestory failed. This is a unrecoverable error and will

cause the program to abort.

  cuEventDestory return value:   400

Check the cuda.h file for what the return value means.

--

---

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

---

--

mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:



  Process name: [[37290,1],2]

  Exit code:1


We have traced the code back to the following files:
-ompi/mca/common/cuda/common_cuda.c :: 
mca_common_cuda_construct_event_and_handle()

We also know the the following:
-it happens on every machine on the very first entry to the function previously 
mentioned,
-does not happen if the buffer size is under 128 bytes... likely a different 
mech. Used for the IPC,

Last, here is an intermittent one and it produces a lot failed tests in ou

Re: [OMPI users] Randomly long (100ms vs 7000+ms) fulfillment of MPI_Ibcast

2014-11-06 Thread Rolf vandeVaart

The CUDA person is now responding.  I will try and reproduce.  I looked through 
the zip file but did not see the mpirun command.   Can this be reproduced with 
-np 4 running across four nodes?
Also, in your original message you wrote "Likewise, it doesn't matter if I 
enable CUDA support or not. "  Can you provide more detail about what that 
means?
Thanks

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, November 06, 2014 1:05 PM
To: Open MPI Users
Subject: Re: [OMPI users] Randomly long (100ms vs 7000+ms) fulfillment of 
MPI_Ibcast

I was hoping our CUDA person would respond, but in the interim - I would 
suggest trying the nightly 1.8.4 tarball as we are getting ready to release it, 
and I know there were some CUDA-related patches since 1.8.1

http://www.open-mpi.org/nightly/v1.8/


On Nov 5, 2014, at 4:45 PM, Steven Eliuk 
mailto:s.el...@samsung.com>> wrote:

OpenMPI: 1.8.1 with CUDA RDMA...

Thanks sir and sorry for the late response,

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.


From: Ralph Castain mailto:rhc.open...@gmail.com>>
Reply-To: Open MPI Users mailto:us...@open-mpi.org>>
List-Post: users@lists.open-mpi.org
Date: Monday, November 3, 2014 at 10:02 AM
To: Open MPI Users mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] Randomly long (100ms vs 7000+ms) fulfillment of 
MPI_Ibcast

Which version of OMPI were you testing?

On Nov 3, 2014, at 9:14 AM, Steven Eliuk 
mailto:s.el...@samsung.com>> wrote:

Hello,

We were using OpenMPI for some testing, everything works fine but randomly, 
MPI_Ibcast()
takes long time to finish. We have a standalone program just to test it.  The 
following
is the profiling results of the simple test program on our cluster:

Ibcast 604 mb takes 103 ms
Ibcast 608 mb takes 106 ms
Ibcast 612 mb takes 105 ms
Ibcast 616 mb takes 105 ms
Ibcast 620 mb takes 107 ms
Ibcast 624 mb takes 107 ms
Ibcast 628 mb takes 108 ms
Ibcast 632 mb takes 110 ms
Ibcast 636 mb takes 110 ms
Ibcast 640 mb takes 7437 ms
Ibcast 644 mb takes 115 ms
Ibcast 648 mb takes 111 ms
Ibcast 652 mb takes 112 ms
Ibcast 656 mb takes 112 ms
Ibcast 660 mb takes 114 ms
Ibcast 664 mb takes 114 ms
Ibcast 668 mb takes 115 ms
Ibcast 672 mb takes 116 ms
Ibcast 676 mb takes 116 ms
Ibcast 680 mb takes 116 ms
Ibcast 684 mb takes 122 ms
Ibcast 688 mb takes 7385 ms
Ibcast 692 mb takes 8729 ms
Ibcast 696 mb takes 120 ms
Ibcast 700 mb takes 124 ms
Ibcast 704 mb takes 121 ms
Ibcast 708 mb takes 8240 ms
Ibcast 712 mb takes 122 ms
Ibcast 716 mb takes 123 ms
Ibcast 720 mb takes 123 ms
Ibcast 724 mb takes 124 ms
Ibcast 728 mb takes 125 ms
Ibcast 732 mb takes 125 ms
Ibcast 736 mb takes 126 ms

As you can see, Ibcast takes a long to finish and it's totally random.
The same program was compiled and tested with MVAPICH2-gdr but it went smoothly.
Both tests were running exclusively on our four nodes cluster without 
contention. Likewise, it doesn't matter
if I enable CUDA support or not.  The followings are the configuration of our 
server:

We have four nodes in this test, each with one K40 GPU and connected with 
mellanox IB.

Please find attached config details and some sample code...

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25662.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25695.php


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Segmentation fault when using CUDA Aware feature

2015-01-12 Thread Rolf vandeVaart

That is strange, not sure why that is happening.  I will try to reproduce with 
your program on my system.  Also, perhaps you could rerun with –mca 
mpi_common_cuda_verbose 100 and send me that output.

Thanks

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Xun Gong
Sent: Sunday, January 11, 2015 11:41 PM
To: us...@open-mpi.org
Subject: [OMPI users] Segmentation fault when using CUDA Aware feature

Hi,

The OpenMpi I used is 1.8.4. I just tried to run a test program to see if the 
CUDA aware feature works. But I got the following errors.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 s1
[ss-Inspiron-5439:32514] *** Process received signal ***
[ss-Inspiron-5439:32514] Signal: Segmentation fault (11)
[ss-Inspiron-5439:32514] Signal code: Address not mapped (1)
[ss-Inspiron-5439:32514] Failing at address: 0x3
[ss-Inspiron-5439:32514] [ 0] 
/lib/x86_64-linux-gnu/libc.so.6(+0x36c30)[0x7f74d7048c30]
[ss-Inspiron-5439:32514] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(+0x98a70)[0x7f74d70aaa70]
[ss-Inspiron-5439:32514] [ 2] 
/usr/local/openmpi-1.8.4/lib/libopen-pal.so.6(opal_convertor_pack+0x187)[0x7f74d673f097]
[ss-Inspiron-5439:32514] [ 3] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_btl_self.so(mca_btl_self_prepare_src+0xb8)[0x7f74ce196888]
[ss-Inspiron-5439:32514] [ 4] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x4c)[0x7f74cd2c183c]
[ss-Inspiron-5439:32514] [ 5] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5ba)[0x7f74cd2b78aa]
[ss-Inspiron-5439:32514] [ 6] 
/usr/local/openmpi-1.8.4/lib/libmpi.so.1(PMPI_Send+0xf2)[0x7f74d79602a2]
[ss-Inspiron-5439:32514] [ 7] s1[0x408b1e]
[ss-Inspiron-5439:32514] [ 8] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f74d7033ec5]
[ss-Inspiron-5439:32514] [ 9] s1[0x4088e9]
[ss-Inspiron-5439:32514] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 32514 on node ss-Inspiron-5439 
exited on signal 11 (Segmentation fault).

Looks like MPI_Send can not send CUDA buffer. But I already did  the command
  ./configure --with-cuda for OpenMPI.


The command I uesd is.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ nvcc -c k1.cu
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -c main.cc
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -o s1 main.o k1.o 
-L/usr/local/cuda/lib64/ -lcudart
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 ./s1



The code I'm running is

main.cc file
#include
using namespace std;
#include
#include"k1.h"
#define vect_len 16
const int blocksize = 16;

int main(int argv, char *argc[])
{
  int numprocs, myid;
  MPI_Status status;
  const int vect_size = vect_len*sizeof(int);

  int *vect1 = new int[vect_size];
  int *vect2 = new int[vect_size];
  int *result = new int[vect_size];
  bool flag;

  int *ad;
  int *bd;

  MPI_Init(&argv, &argc);
  MPI_Comm_rank(MPI_COMM_WORLD, &myid);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  if(myid == 0)
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = i;
  vect2[i] = 2 * i;
  }
  }
  else
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = 2 * i;
  vect2[i] = i;
  }
  }

  initializeGPU(vect1, vect2, ad, bd, vect_size);

  if(myid == 0)
  {
  for(int i = 0; i < numprocs; i++)
  {
  MPI_Send(ad,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  MPI_Send(bd,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  }
  }
  else
  {
  MPI_Recv(ad,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
&status );
  MPI_Recv(bd,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
&status );
  }



  computeGPU(blocksize, vect_len, ad, bd, result, vect_size);

  //Verify
  flag = true;

  for(int i = 0; i < vect_len; i++)
  {
  if (i < 8)
  vect1[i] += vect2[i];
  else
  vect1[i] -= vect2[i];

  }

  for(int i = 0; i < vect_len; i++)
  {
  if( result[i] != vect1[i] )
  {
  cout<<"the result [" file

#include"k1.h"

__global__ void vect_add(int *a, int *b, int n)
{

  int id = threadIdx.x;

  if (id < n)
  a[id] = a[id] + b[id];
  e

Re: [OMPI users] Segmentation fault when using CUDA Aware feature

2015-01-12 Thread Rolf vandeVaart

I think I found a bug in your program with how you were allocating the GPU 
buffers.  I will send you a version offlist with the fix.
Also, there is no need to rerun with the flags I had mentioned below.
Rolf


From: Rolf vandeVaart
Sent: Monday, January 12, 2015 9:38 AM
To: us...@open-mpi.org
Subject: RE: [OMPI users] Segmentation fault when using CUDA Aware feature

That is strange, not sure why that is happening.  I will try to reproduce with 
your program on my system.  Also, perhaps you could rerun with –mca 
mpi_common_cuda_verbose 100 and send me that output.

Thanks

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Xun Gong
Sent: Sunday, January 11, 2015 11:41 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] Segmentation fault when using CUDA Aware feature

Hi,

The OpenMpi I used is 1.8.4. I just tried to run a test program to see if the 
CUDA aware feature works. But I got the following errors.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 s1
[ss-Inspiron-5439:32514] *** Process received signal ***
[ss-Inspiron-5439:32514] Signal: Segmentation fault (11)
[ss-Inspiron-5439:32514] Signal code: Address not mapped (1)
[ss-Inspiron-5439:32514] Failing at address: 0x3
[ss-Inspiron-5439:32514] [ 0] 
/lib/x86_64-linux-gnu/libc.so.6(+0x36c30)[0x7f74d7048c30]
[ss-Inspiron-5439:32514] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(+0x98a70)[0x7f74d70aaa70]
[ss-Inspiron-5439:32514] [ 2] 
/usr/local/openmpi-1.8.4/lib/libopen-pal.so.6(opal_convertor_pack+0x187)[0x7f74d673f097]
[ss-Inspiron-5439:32514] [ 3] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_btl_self.so(mca_btl_self_prepare_src+0xb8)[0x7f74ce196888]
[ss-Inspiron-5439:32514] [ 4] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x4c)[0x7f74cd2c183c]
[ss-Inspiron-5439:32514] [ 5] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5ba)[0x7f74cd2b78aa]
[ss-Inspiron-5439:32514] [ 6] 
/usr/local/openmpi-1.8.4/lib/libmpi.so.1(PMPI_Send+0xf2)[0x7f74d79602a2]
[ss-Inspiron-5439:32514] [ 7] s1[0x408b1e]
[ss-Inspiron-5439:32514] [ 8] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f74d7033ec5]
[ss-Inspiron-5439:32514] [ 9] s1[0x4088e9]
[ss-Inspiron-5439:32514] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 32514 on node ss-Inspiron-5439 
exited on signal 11 (Segmentation fault).

Looks like MPI_Send can not send CUDA buffer. But I already did  the command
  ./configure --with-cuda for OpenMPI.


The command I uesd is.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ nvcc -c k1.cu<http://k1.cu>
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -c main.cc
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -o s1 main.o k1.o 
-L/usr/local/cuda/lib64/ -lcudart
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 ./s1



The code I'm running is

main.cc file
#include
using namespace std;
#include
#include"k1.h"
#define vect_len 16
const int blocksize = 16;

int main(int argv, char *argc[])
{
  int numprocs, myid;
  MPI_Status status;
  const int vect_size = vect_len*sizeof(int);

  int *vect1 = new int[vect_size];
  int *vect2 = new int[vect_size];
  int *result = new int[vect_size];
  bool flag;

  int *ad;
  int *bd;

  MPI_Init(&argv, &argc);
  MPI_Comm_rank(MPI_COMM_WORLD, &myid);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  if(myid == 0)
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = i;
  vect2[i] = 2 * i;
  }
  }
  else
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = 2 * i;
  vect2[i] = i;
  }
  }

  initializeGPU(vect1, vect2, ad, bd, vect_size);

  if(myid == 0)
  {
  for(int i = 0; i < numprocs; i++)
  {
  MPI_Send(ad,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  MPI_Send(bd,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  }
  }
  else
  {
  MPI_Recv(ad,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
&status );
  MPI_Recv(bd,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
&status );
  }



  computeGPU(blocksize, vect_len, ad, bd, result, vect_size);

  //Verify
  flag = true;

  for(int i = 0; i < vect_len; i++)
  {
  if (i < 8)
  vect1[i] += vect2[i];

Re: [OMPI users] GPUDirect with OpenMPI

2015-02-11 Thread Rolf vandeVaart

Let me try to reproduce this.  This should not have anything to do with GPU 
Direct RDMA.  However, to eliminate it, you could run with:
--mca btl_openib_want_cuda_gdr 0.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Aulwes, Rob
Sent: Wednesday, February 11, 2015 2:17 PM
To: us...@open-mpi.org
Subject: [OMPI users] GPUDirect with OpenMPI

Hi,

I built OpenMPI 1.8.3 using PGI 14.7 and enabled CUDA support for CUDA 6.0.  I 
have a Fortran test code that tests GPUDirect and have included it here.  When 
I run it across 2 nodes using 4 MPI procs, sometimes it fails with incorrect 
results.  Specifically, sometimes rank 1 does not receive the correct value 
from one of the neighbors.

The code was compiled using PGI 14.7:
mpif90 -o direct.x -acc acc_direct.f90

and executed with:
mpirun -np 4 -npernode 2 -mca btl_openib_want_cudagdr 1 ./direct.x

Does anyone know if I'm missing something when using GPUDirect?

Thanks,Rob Aulwes


program acc_direct



 include 'mpif.h'





 integer :: ierr, rank, nranks

integer, dimension(:), allocatable :: i_ra



   call mpi_init(ierr)



   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

   rank = rank + 1

   write(*,*) 'hello from rank ',rank



   call MPI_COMM_SIZE(MPI_COMM_WORLD, nranks, ierr)



   allocate( i_ra(nranks) )



   call nb_exchange



   call mpi_finalize(ierr)





 contains



 subroutine nb_exchange



   integer :: i, j

   integer, dimension(nranks - 1) :: sendreq, recvreq

   logical :: done

   integer :: stat(MPI_STATUS_SIZE)



   i_ra = -1

   i_ra(rank) = rank



   !$acc data copy(i_ra(1:nranks))



   !$acc host_data use_device(i_ra)



   cnt = 0

   do i = 1,nranks

  if ( i .ne. rank ) then

 cnt = cnt + 1



 call MPI_ISEND(i_ra(rank), 1, MPI_INTEGER, i - 1, rank, &

MPI_COMM_WORLD, sendreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'isend call failed.'



 call MPI_IRECV(i_ra(i), 1, MPI_INTEGER, i - 1, i, &

MPI_COMM_WORLD, recvreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'irecv call failed.'



  endif



   enddo



   !$acc end host_data



   i = 0

   do while ( i .lt. 2*cnt )

 do j = 1, cnt

if ( recvreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(recvreq(j), done, stat, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif



if ( sendreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(sendreq(j), done, MPI_STATUS_IGNORE, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif

 enddo

   enddo



   write(*,*) rank,': nb_exchange: Updating host...'

   !$acc update host(i_ra(1:nranks))





   do j = 1, nranks

 if ( i_ra(j) .ne. j ) then

   write(*,*) 'isend/irecv failed.'

   write(*,*) 'rank', rank,': i_ra(',j,') = ',i_ra(j)

 endif

   enddo



   !$acc end data



 end subroutine





end program


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] GPUDirect with OpenMPI

2015-03-03 Thread Rolf vandeVaart

Hi Rob:
Sorry for the slow reply but it took me a while to figure this out.  It turns 
out that this issue had to do with how some of the memory within the smcuda BTL 
was being registered with CUDA.  This was fixed a few weeks ago and will be 
available in the 1.8.5 release.  Perhaps you could retry with a pre-release 
version of Open MPI 1.8.5 that is available here and confirm it fixes your 
issue.  Any of the ones listed on that page should be fine.

http://www.open-mpi.org/nightly/v1.8/

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Wednesday, February 11, 2015 3:50 PM
To: Open MPI Users
Subject: Re: [OMPI users] GPUDirect with OpenMPI

Let me try to reproduce this.  This should not have anything to do with GPU 
Direct RDMA.  However, to eliminate it, you could run with:
--mca btl_openib_want_cuda_gdr 0.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Aulwes, Rob
Sent: Wednesday, February 11, 2015 2:17 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] GPUDirect with OpenMPI

Hi,

I built OpenMPI 1.8.3 using PGI 14.7 and enabled CUDA support for CUDA 6.0.  I 
have a Fortran test code that tests GPUDirect and have included it here.  When 
I run it across 2 nodes using 4 MPI procs, sometimes it fails with incorrect 
results.  Specifically, sometimes rank 1 does not receive the correct value 
from one of the neighbors.

The code was compiled using PGI 14.7:
mpif90 -o direct.x -acc acc_direct.f90

and executed with:
mpirun -np 4 -npernode 2 -mca btl_openib_want_cudagdr 1 ./direct.x

Does anyone know if I'm missing something when using GPUDirect?

Thanks,Rob Aulwes


program acc_direct



 include 'mpif.h'





 integer :: ierr, rank, nranks

integer, dimension(:), allocatable :: i_ra



   call mpi_init(ierr)



   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

   rank = rank + 1

   write(*,*) 'hello from rank ',rank



   call MPI_COMM_SIZE(MPI_COMM_WORLD, nranks, ierr)



   allocate( i_ra(nranks) )



   call nb_exchange



   call mpi_finalize(ierr)





 contains



 subroutine nb_exchange



   integer :: i, j

   integer, dimension(nranks - 1) :: sendreq, recvreq

   logical :: done

   integer :: stat(MPI_STATUS_SIZE)



   i_ra = -1

   i_ra(rank) = rank



   !$acc data copy(i_ra(1:nranks))



   !$acc host_data use_device(i_ra)



   cnt = 0

   do i = 1,nranks

  if ( i .ne. rank ) then

 cnt = cnt + 1



 call MPI_ISEND(i_ra(rank), 1, MPI_INTEGER, i - 1, rank, &

MPI_COMM_WORLD, sendreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'isend call failed.'



 call MPI_IRECV(i_ra(i), 1, MPI_INTEGER, i - 1, i, &

MPI_COMM_WORLD, recvreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'irecv call failed.'



  endif



   enddo



   !$acc end host_data



   i = 0

   do while ( i .lt. 2*cnt )

 do j = 1, cnt

if ( recvreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(recvreq(j), done, stat, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif



if ( sendreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(sendreq(j), done, MPI_STATUS_IGNORE, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif

 enddo

   enddo



   write(*,*) rank,': nb_exchange: Updating host...'

   !$acc update host(i_ra(1:nranks))





   do j = 1, nranks

 if ( i_ra(j) .ne. j ) then

   write(*,*) 'isend/irecv failed.'

   write(*,*) 'rank', rank,': i_ra(',j,') = ',i_ra(j)

 endif

   enddo



   !$acc end data



 end subroutine





end program


This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

Re: [OMPI users] issue with openmpi + CUDA

2015-03-26 Thread Rolf vandeVaart

Hi Jason:
The issue is that Open MPI is (presumably) a 64 bit application and it is 
trying to load up a 64-bit libcuda.so.1 but not finding one.  Making the link 
as you did will not fix the problem (as you saw).  In all my installations, I 
also have a 64-bit driver installed in /usr/lib64/libcuda.so.1 and everything 
works fine.

Let me investigate some more and get back to you.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Zhisong Fu
>Sent: Wednesday, March 25, 2015 10:31 PM
>To: Open MPI Users
>Subject: [OMPI users] issue with openmpi + CUDA
>
>Hi,
>
>I just started to use openmpi and am trying to run a MPI/GPU code. My code
>compiles but when I run, I get this error:
>The library attempted to open the following supporting CUDA libraries, but
>each of them failed.  CUDA-aware support is disabled.
>/usr/lib/libcuda.so.1: wrong ELF class: ELFCLASS32
>/usr/lib/libcuda.so.1: wrong ELF class: ELFCLASS32 If you are not interested in
>CUDA-aware support, then run with --mca mpi_cuda_support 0 to suppress
>this message.  If you are interested in CUDA-aware support, then try setting
>LD_LIBRARY_PATH to the location of libcuda.so.1 to get passed this issue.
>
>I could not find a libcuda.so.1 in my system but I do find libcuda.so in
>/usr/local/cuda/lib64/stubs. Why is openmpi looking for libcuda.so.1 instead
>of libcuda.so?
>I created a symbolic link to libcuda.so, now I get CUDA error 35: CUDA driver
>version is insufficient for CUDA runtime version.
>I am not sure if this is related to libcuda.so or the driver since I could run 
>this
>code using mvapich.
>
>Any input on the issue is really appreciated.
>My openmpi version is 1.8.4, my cuda version is 6.5, driver version is 340.65.
>
>Thanks.
>Jason
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/03/26537.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-27 Thread Rolf vandeVaart

Hi Lev:
I am not sure what is happening here but there are a few things we can do to 
try and narrow things done.
1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error 
will go away?
2. Do you know if when you see this error it happens on the first pass through 
your communications?  That is, you mention how there are multiple iterations 
through the loop and I am wondering when you see failures if it is the first 
pass through the loop.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Friday, March 27, 2015 3:47 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today)
>built against OpenMPI 1.8.4 with CUDA support activated to asynchronously
>send GPU arrays between multiple Tesla GPUs (Fermi generation). Each MPI
>process is associated with a single GPU; the process has a run loop that starts
>several Isends to transmit the contents of GPU arrays to destination
>processes and several Irecvs to receive data from source processes into GPU
>arrays on the process' GPU. Some of the sends/recvs use one tag, while the
>remainder use a second tag. A single Waitall invocation is used to wait for 
>all of
>these sends and receives to complete before the next iteration of the loop
>can commence. All GPU arrays are preallocated before the run loop starts.
>While this pattern works most of the time, it sometimes fails with a segfault
>that appears to occur during an Isend:
>
>[myhost:05471] *** Process received signal *** [myhost:05471] Signal:
>Segmentation fault (11) [myhost:05471] Signal code:  (128) [myhost:05471]
>Failing at address: (nil) [myhost:05471] [ 0] /lib/x86_64-linux-
>gnu/libpthread.so.0(+0x10340)[0x2ac2bb176340]
>[myhost:05471] [ 1]
>/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x1f6b18)[0x2ac2c48bfb18]
>[myhost:05471] [ 2]
>/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x16dcc3)[0x2ac2c4836cc3]
>[myhost:05471] [ 3]
>/usr/lib/x86_64-linux-
>gnu/libcuda.so.1(cuIpcGetEventHandle+0x5d)[0x2ac2c480bccd]
>[myhost:05471] [ 4]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_common_cuda_construct_event_and_handle+0x27
>)[0x2ac2c27d3087]
>[myhost:05471] [ 5]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(ompi_free_list_grow+0x199)[0x2ac2c277b8e9]
>[myhost:05471] [ 6]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_mpool_gpusm_register+0xf4)[0x2ac2c28c9fd4]
>[myhost:05471] [ 7]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_rdma_cuda_btls+0xcd)[0x2ac2c28f8afd]
>[myhost:05471] [ 8]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_send_request_start_cuda+0xbf)[0x2ac2c
>28f8d5f]
>[myhost:05471] [ 9]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_isend+0x60e)[0x2ac2c28eb6fe]
>[myhost:05471] [10]
>/opt/openmpi-1.8.4/lib/libmpi.so.1(MPI_Isend+0x137)[0x2ac2c27b7cc7]
>[myhost:05471] [11]
>/home/lev/Work/miniconda/envs/MYENV/lib/python2.7/site-
>packages/mpi4py/MPI.so(+0xd3bb2)[0x2ac2c24b3bb2]
>(Python-related debug lines omitted.)
>
>Any ideas as to what could be causing this problem?
>
>I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/03/26553.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-30 Thread Rolf vandeVaart

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Sunday, March 29, 2015 10:11 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
>> >-Original Message-
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Friday, March 27, 2015 3:47 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
>> >arrays between multiple GPUs
>> >
>> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded
>> >today) built against OpenMPI 1.8.4 with CUDA support activated to
>> >asynchronously send GPU arrays between multiple Tesla GPUs (Fermi
>> >generation). Each MPI process is associated with a single GPU; the
>> >process has a run loop that starts several Isends to transmit the
>> >contents of GPU arrays to destination processes and several Irecvs to
>> >receive data from source processes into GPU arrays on the process'
>> >GPU. Some of the sends/recvs use one tag, while the remainder use a
>> >second tag. A single Waitall invocation is used to wait for all of
>> >these sends and receives to complete before the next iteration of the loop
>can commence. All GPU arrays are preallocated before the run loop starts.
>> >While this pattern works most of the time, it sometimes fails with a
>> >segfault that appears to occur during an Isend:
>
>(snip)
>
>> >Any ideas as to what could be causing this problem?
>> >
>> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>>
>> Hi Lev:
>>
>> I am not sure what is happening here but there are a few things we can
>> do to try and narrow things done.
>> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error
>>will go away?
>
>Yes - that appears to be the case.
>
>> 2. Do you know if when you see this error it happens on the first pass
>through
>>your communications?  That is, you mention how there are multiple
>>iterations through the loop and I am wondering when you see failures if it
>>is the first pass through the loop.
>
>When the segfault occurs, it appears to always happen during the second
>iteration of the loop, i.e., at least one slew of Isends (and presumably 
>Irecvs)
>is successfully performed.
>
>Some more details regarding the Isends: each process starts two Isends for
>each destination process to which it transmits data. The Isends use two
>different tags, respectively; one is passed None (by design), while the other 
>is
>passed the pointer to a GPU array with nonzero length. The segfault appears
>to occur during the latter Isend.
>--

Lev, can you send me the test program off list.  I may try to create a C 
version of the test and see if I can reproduce the problem.
Not sure at this point what is happening.

Thanks,
Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-30 Thread Rolf vandeVaart

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rolf
>vandeVaart
>Sent: Monday, March 30, 2015 9:37 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>>-Original Message-
>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>>Sent: Sunday, March 29, 2015 10:11 PM
>>To: Open MPI Users
>>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting
>>GPU arrays between multiple GPUs
>>
>>Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
>>> >-Original Message-
>>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>>> >Givon
>>> >Sent: Friday, March 27, 2015 3:47 PM
>>> >To: us...@open-mpi.org
>>> >Subject: [OMPI users] segfault during MPI_Isend when transmitting
>>> >GPU arrays between multiple GPUs
>>> >
>>> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded
>>> >today) built against OpenMPI 1.8.4 with CUDA support activated to
>>> >asynchronously send GPU arrays between multiple Tesla GPUs (Fermi
>>> >generation). Each MPI process is associated with a single GPU; the
>>> >process has a run loop that starts several Isends to transmit the
>>> >contents of GPU arrays to destination processes and several Irecvs
>>> >to receive data from source processes into GPU arrays on the process'
>>> >GPU. Some of the sends/recvs use one tag, while the remainder use a
>>> >second tag. A single Waitall invocation is used to wait for all of
>>> >these sends and receives to complete before the next iteration of
>>> >the loop
>>can commence. All GPU arrays are preallocated before the run loop starts.
>>> >While this pattern works most of the time, it sometimes fails with a
>>> >segfault that appears to occur during an Isend:
>>
>>(snip)
>>
>>> >Any ideas as to what could be causing this problem?
>>> >
>>> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>>>
>>> Hi Lev:
>>>
>>> I am not sure what is happening here but there are a few things we
>>> can do to try and narrow things done.
>>> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this
>error
>>>will go away?
>>
>>Yes - that appears to be the case.
>>
>>> 2. Do you know if when you see this error it happens on the first
>>> pass
>>through
>>>your communications?  That is, you mention how there are multiple
>>>iterations through the loop and I am wondering when you see failures if 
>>> it
>>>is the first pass through the loop.
>>
>>When the segfault occurs, it appears to always happen during the second
>>iteration of the loop, i.e., at least one slew of Isends (and
>>presumably Irecvs) is successfully performed.
>>
>>Some more details regarding the Isends: each process starts two Isends
>>for each destination process to which it transmits data. The Isends use
>>two different tags, respectively; one is passed None (by design), while
>>the other is passed the pointer to a GPU array with nonzero length. The
>>segfault appears to occur during the latter Isend.
>>--
>
>Lev, can you send me the test program off list.  I may try to create a C 
>version
>of the test and see if I can reproduce the problem.
>Not sure at this point what is happening.
>
>Thanks,
>Rolf
>
We figured out what was going on and I figured I would post here in case others 
see it.

After running for a while, some CUDA files related to CUDA IPC may get left in 
the /dev/shm directory.  These files can sometimes cause problems with later 
runs causing errors (or SEGVs) when calling some CUDA APIs.  The solution is 
clear out that directory periodically.

This issue is fixed in CUDA 7.0
Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Different HCA from different OpenMP threads (same rank using MPI_THREAD_MULTIPLE)

2015-04-06 Thread Rolf vandeVaart

It is my belief that you cannot do this at least with the openib BTL.  The IB 
card to be used for communication is selected during the MPI _Init() phase 
based on where the CPU process is bound to.  You can see some of this selection 
by using the --mca btl_base_verbose 1 flag.  There is a bunch of output (which 
I have deleted), but you will see a few lines like this.

[ivy5] [rank=1] openib: using port mlx5_0:1
[ivy5] [rank=1] openib: using port mlx5_0:2
[ivy4] [rank=0] openib: using port mlx5_0:1
[ivy4] [rank=0] openib: using port mlx5_0:2

And if you have multiple NICs, you may also see some messages like this:
 "[rank=%d] openib: skipping device %s; it is too far away"
(This was lifted from the  code. I do not have a configuration right now where 
I can generate the second message.)

I cannot see how we can make this specific to a thread.  Maybe others have a 
different opinion.
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, April 06, 2015 5:46 AM
>To: Open MPI Users
>Cc: Mohammed Sourouri
>Subject: [OMPI users] Different HCA from different OpenMP threads (same
>rank using MPI_THREAD_MULTIPLE)
>
>Dear Open MPI developers,
>
>I wonder if there is a way to address this particular scenario using MPI_T or
>other strategies in Open MPI. I saw a similar discussion few days ago, I assume
>the same challenges are applied in this case but I just want to check. Here is
>the scenario:
>
>We have a system composed by dual rail Mellanox IB, two distinct Connect-IB
>cards per node each one sitting on a different PCI-E lane out of two distinct
>sockets. We are seeking a way to control MPI traffic thought each one of
>them directly into the application. In specific we have a single MPI rank per
>node that goes multi-threading using OpenMP. MPI_THREAD_MULTIPLE is
>used, each OpenMP thread may initiate MPI communication. We would like to
>assign IB-0 to thread 0 and IB-1 to thread 1.
>
>Via mpirun or env variables we can control which IB interface to use by binding
>it to a specific MPI rank (or by apply a policy that relate IB to MPi ranks). 
>But if
>there is only one MPI rank active, how we can differentiate the traffic across
>multiple IB cards?
>
>Thanks in advance for any suggestion about this matter.
>
>Regards,
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://filippospiga.info ~ skype: filippo.spiga
>
>«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."
>
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/04/26614.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Different HCA from different OpenMP threads (same rank using MPI_THREAD_MULTIPLE)

2015-04-07 Thread Rolf vandeVaart

I still do not believe there is a way for you to steer your traffic based on 
the thread that is calling into Open MPI. While you can spawn your own threads, 
Open MPI is going to figure out what interfaces to use based on the 
characteristics of the process during MPI_Init.  Even if Open MPI decides to 
use two interfaces, the use of these will be done based on the process.  It 
will alternate between them independent of which thread happens to be doing the 
sends or receives.  There is no way of doing this with something like 
MPI_T_cvar_write which I think is what you were looking for.

Rolf  

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Tuesday, April 07, 2015 5:46 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Different HCA from different OpenMP threads
>(same rank using MPI_THREAD_MULTIPLE)
>
>Thanks Rolf and Ralph for the replies!
>
>On Apr 6, 2015, at 10:37 PM, Ralph Castain  wrote:
>> That said, you can certainly have your application specify a thread-level
>binding. You’d have to do the heavy lifting yourself in the app, I’m afraid,
>instead of relying on us to do it for you
>
>Ok, my application must do it and I am fine with it. But how? I mean, does
>Open MPi expose some API that allows such fine grain control?
>
>F
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://filippospiga.info ~ skype: filippo.spiga

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute path to libcuda.so.1

2015-04-29 Thread Rolf vandeVaart

Hi Lev:
Any chance you can try Open MPI 1.8.5rc3 and see if you see the same behavior?  
That code has changed a bit from the 1.8.4 series and I am curious if you will 
still see the same issue.  

http://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.5rc3.tar.gz

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Wednesday, April 29, 2015 10:54 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute
>path to libcuda.so.1
>
>I'm trying to build/package OpenMPI 1.8.4 with CUDA support enabled on
>Linux
>x86_64 so that the compiled software can be downloaded/installed as one of
>the dependencies of a project I'm working on with no further user
>configuration.  I noticed that MPI programs built with the above will try to
>access
>/usr/lib/i386-linux-gnu/libcuda.so.1 (and obviously complain about it being the
>wrong ELF class) if /usr/lib/i386-linux-gnu precedes /usr/lib/x86_64-linux-gnu
>in one's ld.so cache. While one can get around this by modifying one's ld.so
>configuration (or tweaking LD_LIBRARY_PATH), is there some way to compile
>OpenMPI such that programs built with it (on x86_64) look for the full soname
>of
>libcuda.so.1 - i.e., /usr/lib/x86_64-linux-gnu/libcuda.so.1 - rather than fall 
>back
>on ld.so? I tried setting the rpath of MPI programs built with the above (by
>modifying the OpenMPI compiler wrappers to include -Wl,-rpath -
>Wl,/usr/lib/x86_64-linux-gnu), but that doesn't seem to help.
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/04/26809.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Rolf vandeVaart

I am not sure why you are seeing this.  One thing that is clear is that you 
have found a bug in the error reporting.  The error message is a little garbled 
and I see a bug in what we are reporting. I will fix that.

If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My 
expectation is that you will not see any errors, but may lose some performance.

What does your hardware configuration look like?  Can you send me output from 
"nvidia-smi topo -m"

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 6:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>1.8.5 with CUDA 7.0 and Multi-Process Service
>
>I'm encountering intermittent errors while trying to use the Multi-Process
>Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
>by multiple MPI processes that perform GPU-to-GPU communication with
>each other (i.e., GPU pointers are passed to the MPI transmission primitives).
>I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
>which is in turn built against CUDA 7.0. In my current configuration, I have 4
>MPS server daemons running, each of which controls access to one of 4 GPUs;
>the MPI processes spawned by my program are partitioned into 4 groups
>(which might contain different numbers of processes) that each talk to a
>separate daemon. For certain transmission patterns between these
>processes, the program runs without any problems. For others (e.g., 16
>processes partitioned into 4 groups), however, it dies with the following 
>error:
>
>[node05:20562] Failed to register remote memory, rc=-1
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   21199360
>  address: 0x1
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file
>pml_ob1_recvreq.c at line 477
>---
>Child job 2 terminated normally, but 1 process returned a non-zero exit code..
>Per user-direction, the job has been aborted.
>---
>[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
>mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
>[node05:20564] Failed to register remote memory, rc=-1 [node05:20564]
>[[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20566] Failed to register remote memory, rc=-1 [node05:20566]
>[[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20567] Failed to register remote memory, rc=-1 [node05:20567]
>[[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
>mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>[node05:20569] Failed to register remote memory, rc=-1 [node05:20569]
>[[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20571] Failed to register remote memory, rc=-1 [node05:20571]
>[[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20572] Failed to register remote memory, rc=-1 [node05:20572]
>[[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>
>After the above error occurs, I notice that /dev/shm/ is littered with
>cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
>but that doesn't seem to have any effect upon the problem. Rebooting the
>machine also doesn't have any effect. I should also add that my program runs
>without any error if the groups of MPI processes talk directly to the GPUs
>instead of via MPS.
>
>Does anyone have any ideas as to what could be going on?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/05/26881.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-20 Thread Rolf vandeVaart

-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 10:25 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
>> >-Original Message-
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Tuesday, May 19, 2015 6:30 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>> >1.8.5 with CUDA 7.0 and Multi-Process Service
>> >
>> >I'm encountering intermittent errors while trying to use the
>> >Multi-Process Service with CUDA 7.0 for improving concurrent access
>> >to a Kepler K20Xm GPU by multiple MPI processes that perform
>> >GPU-to-GPU communication with each other (i.e., GPU pointers are
>passed to the MPI transmission primitives).
>> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI
>> >1.8.5, which is in turn built against CUDA 7.0. In my current
>> >configuration, I have 4 MPS server daemons running, each of which
>> >controls access to one of 4 GPUs; the MPI processes spawned by my
>> >program are partitioned into 4 groups (which might contain different
>> >numbers of processes) that each talk to a separate daemon. For
>> >certain transmission patterns between these processes, the program
>> >runs without any problems. For others (e.g., 16 processes partitioned into
>4 groups), however, it dies with the following error:
>> >
>> >[node05:20562] Failed to register remote memory, rc=-1
>> >-
>> >- The call to cuIpcOpenMemHandle failed. This is an unrecoverable
>> >error and will cause the program to abort.
>> >  cuIpcOpenMemHandle return value:   21199360
>> >  address: 0x1
>> >Check the cuda.h file for what the return value means. Perhaps a
>> >reboot of the node will clear the problem.
>
>(snip)
>
>> >After the above error occurs, I notice that /dev/shm/ is littered
>> >with
>> >cuda.shm.* files. I tried cleaning up /dev/shm before running my
>> >program, but that doesn't seem to have any effect upon the problem.
>> >Rebooting the machine also doesn't have any effect. I should also add
>> >that my program runs without any error if the groups of MPI processes
>> >talk directly to the GPUs instead of via MPS.
>> >
>> >Does anyone have any ideas as to what could be going on?
>>
>> I am not sure why you are seeing this.  One thing that is clear is
>> that you have found a bug in the error reporting.  The error message
>> is a little garbled and I see a bug in what we are reporting. I will fix 
>> that.
>>
>> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc
>> 0.  My expectation is that you will not see any errors, but may lose
>> some performance.
>
>The error does indeed go away when IPC is disabled, although I do want to
>avoid degrading the performance of data transfers between GPU memory
>locations.
>
>> What does your hardware configuration look like?  Can you send me
>> output from "nvidia-smi topo -m"
>--

I see that you mentioned you are starting 4 MPS daemons.  Are you following the 
instructions here?

http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
 

This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA 
IPC. Since you are using CUDA 7 there is no more need to start multiple 
daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single MPS 
control daemon which will handle all GPUs.  Can you try that?  Because of this 
question, we realized we need to update our documentation as well.

Thanks,
Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Rolf vandeVaart

Answers below...
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Thursday, May 21, 2015 2:19 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT:
>> Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:
>>
>> (snip)
>>
>> > I see that you mentioned you are starting 4 MPS daemons.  Are you
>> > following the instructions here?
>> >
>> > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-se
>> > rvice-mps.html
>>
>> Yes - also
>>
>https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overvie
>w
>> .pdf
>>
>> > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems
>> > for CUDA IPC. Since you are using CUDA 7 there is no more need to
>> > start multiple daemons. You simply leave CUDA_VISIBLE_DEVICES
>> > untouched and start a single MPS control daemon which will handle all
>GPUs.  Can you try that?
>>
>> I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value
>> should be passed to all MPI processes.
There is no need to do anything with CUDA_MPS_PIPE_DIRECTORY with CUDA 7.  

>>
>> Several questions related to your comment above:
>>
>> - Should the MPI processes select and initialize the GPUs they respectively 
>> need
>>   to access as they normally would when MPS is not in use?
Yes.  

>> - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS 
>> (and
>>   hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES 
>> to
>>   control GPU resource allocation, and I would like to run my program (and 
>> the
>>   MPS control daemon) on a cluster via SLURM.
Yes, I believe that is true.  

>> - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
>>   MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU 
>> setting
>>   with CUDA 6.5 even when one starts multiple MPS control daemons as  
>> described
>>   in the aforementioned blog post?
>
>Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to
>solve the problem when IPC is enabled.
>--
Glad to see this worked.  And you are correct that CUDA IPC will not work 
between devices if they are segregated by the use of CUDA_VISIBLE_DEVICES as we 
do with MPS in 6.5.

Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Problems running linpack benchmark on old Sunfire opteron nodes

2015-05-26 Thread Rolf vandeVaart

I think we bumped up a default value in Open MPI 1.8.5.  To go back to the old 
64Mbyte value try running with:

--mca mpool_sm_min_size 67108864

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Aurélien Bouteiller
Sent: Tuesday, May 26, 2015 10:10 AM
To: Open MPI Users
Subject: Re: [OMPI users] Problems running linpack benchmark on old Sunfire 
opteron nodes

* PGP Signed by an unknown key
You can also change the location of tmp files with the following mca option:
-mca orte_tmpdir_base /some/place

ompi_info --param all all -l 9 | grep tmp
MCA orte: parameter "orte_tmpdir_base" (current value: "", data 
source: default, level: 9 dev/all, type: string)
MCA orte: parameter "orte_local_tmpdir_base" (current value: 
"", data source: default, level: 9 dev/all, type: string)
MCA orte: parameter "orte_remote_tmpdir_base" (current value: 
"", data source: default, level: 9 dev/all, type: string)

--
Aurélien Bouteiller ~~ https://icl.cs.utk.edu/~bouteill/

Le 23 mai 2015 à 03:55, Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> a écrit :

Bill,

the root cause is likely there is not enough free space in /tmp.

the simplest, but slowest, option is to run mpirun --mac btl tcp ...
if you cannot make enough space under /tmp (maybe you run diskless)
there are some options to create these kind of files under /dev/shm

Cheers,

Gilles


On Saturday, May 23, 2015, Lane, William 
mailto:william.l...@cshs.org>> wrote:
I've compiled the linpack benchmark using openMPI 1.8.5 libraries
and include files on CentOS 6.4.

I've tested the binary on the one Intel node (some
sort of 4-core Xeon) and it runs, but when I try to run it on any of
the old Sunfire opteron compute nodes it appears to hang (although
top indicates CPU and memory usage) and eventually terminates
by itself. I'm also getting the following openMPI error messages/warnings:

mpirun -np 16 --report-bindings --hostfile hostfile --prefix 
/hpc/apps/mpi/openmpi/1.8.5-dev --mca btl_tcp_if_include eth0 xhpl

[cscld1-0-6:24370] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-3:24734] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-7:25152] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-4:18079] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-8:21443] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-2:19704] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-5:13481] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-0:21884] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1:24240] 7 more processes have sent help message help-opal-shmem-mmap.txt 
/ target full

Note these errors also occur when I try to run the linpack benchmark on a single
node as well.

Does anyone know what's going on here? Google came up w/nothing and I have no
idea what a BTL coordinating structure is.

-Bill L.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26907.php

* Unknown Key
* 0xBF250A1F

* PGP Unprotected
* text/plain body


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

2015-06-17 Thread Rolf vandeVaart

Hi Fei:

The reduction support for CUDA-aware in Open MPI is rather simple.  The GPU 
buffers are copied into temporary host buffers and then the reduction is done 
with the host buffers.  At the completion of the host reduction, the data is 
copied back into the GPU buffers.  So, there is no use of CUDA IPC or GPU 
Direct RDMA in the reduction.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Fei Mao
Sent: Wednesday, June 17, 2015 1:08 PM
To: us...@open-mpi.org
Subject: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

Hi there,

I am doing benchmarks on a GPU cluster with two CPU sockets and 4 K80 GPUs each 
node. Two K80 are connected with CPU socket 0, another two with socket 1. An IB 
ConnectX-3 (FDR) is also under socket 1. We are using Linux's OFED, so I know 
there is no way to do GPU RDMA inter-node communication. I can do intra-node 
IPC for MPI_Send and MPI_Receive with two K80 (4 GPUs in total) which are 
connected under same socket (PCI-e switch). So I thought I could do intra-node 
MPI_Reduce with IPC support in openmpi 1.8.5.

The benchmark I was using is osu-micro-benchmarks-4.4.1, and I got the same 
results when I use two GPU under the same socket or different socket. The 
result was the same even I used two GPUs in different nodes.

Does MPI_Reduce use IPC for intra-node? Should I have to install Mellanox OFED 
stack to support GPU RDMA reduction on GPUs even they are under with the same 
PCI-e switch?

Thanks,

Fei Mao
High Performance Computing Technical Consultant
SHARCNET | http://www.sharcnet.ca
Compute/Calcul Canada | http://www.computecanada.ca

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

2015-06-17 Thread Rolf vandeVaart

There is no short-term plan but we are always looking at ways to improve things 
so this could be looked at some time in the future.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Fei Mao
Sent: Wednesday, June 17, 2015 1:48 PM
To: Open MPI Users
Subject: Re: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

Hi Rolf,

Thank you very much for clarifying the problem. Is there any plan to support 
GPU RDMA for reduction in the future?

On Jun 17, 2015, at 1:38 PM, Rolf vandeVaart 
mailto:rvandeva...@nvidia.com>> wrote:

Hi Fei:

The reduction support for CUDA-aware in Open MPI is rather simple.  The GPU 
buffers are copied into temporary host buffers and then the reduction is done 
with the host buffers.  At the completion of the host reduction, the data is 
copied back into the GPU buffers.  So, there is no use of CUDA IPC or GPU 
Direct RDMA in the reduction.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Fei Mao
Sent: Wednesday, June 17, 2015 1:08 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

Hi there,

I am doing benchmarks on a GPU cluster with two CPU sockets and 4 K80 GPUs each 
node. Two K80 are connected with CPU socket 0, another two with socket 1. An IB 
ConnectX-3 (FDR) is also under socket 1. We are using Linux's OFED, so I know 
there is no way to do GPU RDMA inter-node communication. I can do intra-node 
IPC for MPI_Send and MPI_Receive with two K80 (4 GPUs in total) which are 
connected under same socket (PCI-e switch). So I thought I could do intra-node 
MPI_Reduce with IPC support in openmpi 1.8.5.

The benchmark I was using is osu-micro-benchmarks-4.4.1, and I got the same 
results when I use two GPU under the same socket or different socket. The 
result was the same even I used two GPUs in different nodes.

Does MPI_Reduce use IPC for intra-node? Should I have to install Mellanox OFED 
stack to support GPU RDMA reduction on GPUs even they are under with the same 
PCI-e switch?

Thanks,

Fei Mao
High Performance Computing Technical Consultant
SHARCNET | http://www.sharcnet.ca<http://www.sharcnet.ca/>
Compute/Calcul Canada | 
http://www.computecanada.ca<http://www.computecanada.ca/>

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27147.php

Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-06-30 Thread Rolf vandeVaart

Hi Steven,
Thanks for the report.  Very little has changed between 1.8.5 and 1.8.6 within 
the CUDA-aware specific code so I am perplexed.  Also interesting that you do 
not see the issue with 1.8.5 and CUDA 7.0.
You mentioned that it is hard to share the code on this but maybe you could 
share how you observed the behavior.  Does the code need to run for a while to 
see this?
Any suggestions on how I could reproduce this?

Thanks,
Rolf


From: Steven Eliuk [mailto:s.el...@samsung.com]
Sent: Tuesday, June 30, 2015 6:05 PM
To: Rolf vandeVaart
Cc: Open MPI Users
Subject: 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi All,

Looks like we have found a large memory leak,

Very difficult to share code on this but here are some details,

1.8.5 w/ Cuda 7.0 - no memory leak
1.8.5 w/ cuda 6.5 - no memory leak
1.8.6 w/ cuda 7.0 - large memory leak
1.8.5 w/ cuda 6.5 - no memory leak
mvapich2 2.1 GDR - no issue on either flavor of CUDA.

We have a relatively basic program that reproduces the error and have even 
narrowed it back to a single machine w/ multiple gpus and only two slaves. 
Looks like something in the IPC within a single node,

We don't have many free cycles at the moment but less us know if we can help w/ 
something basic,

Heres our config flag for 1.8.5,

./configure FC=gfortran --without-mx --with-openib=/usr 
--with-openib-libdir=/usr/lib64/ --enable-openib-rdmacm --without-psm 
--with-cuda=/cm/shared/apps/cuda70/toolkit/current 
--prefix=/cm/shared/OpenMPI_1_8_5_CUDA70

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Project Lead,
Computing Science Innovation Center,
SRA - SV,
Samsung Electronics,
665 Clyde Avenue,
Mountain View, CA 94043,
Work: +1 650-623-2986,
Cell: +1 408-819-4407.


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-01 Thread Rolf vandeVaart

Hi Stefan (and Steven who reported this earlier with CUDA-aware program)



I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.



Rolf



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-06 Thread Rolf vandeVaart

Just an FYI that this issue has been found and fixed and will be available in 
the next release.
https://github.com/open-mpi/ompi-release/pull/357

Rolf

From: Rolf vandeVaart
Sent: Wednesday, July 01, 2015 4:47 PM
To: us...@open-mpi.org
Subject: RE: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi Stefan (and Steven who reported this earlier with CUDA-aware program)

I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git<http://git.lammps.org/lammps-ro.git> lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] openmpi 1.8.7 build error with cuda support using pgi compiler 15.4

2015-08-04 Thread Rolf vandeVaart

Hi Shahzeb:
I believe another colleague of mine may have helped you with this issue (I was 
not around last week).  However, to help me better understand the issue you are 
seeing, could you send me your config.log file  from when you did the 
configuration?  You can just send to rvandeva...@nvidia.com.
Thanks, Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Shahzeb
>Sent: Thursday, July 30, 2015 9:45 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] openmpi 1.8.7 build error with cuda support using pgi
>compiler 15.4
>
>Hello,
>
>I am getting error in my make and make installl  with building OpenMPI with
>CUDA support using PGI compiler. Please help me fix this problem.
>No clue why it is happening. We are using PGI 15.4
>
>  ./configure --prefix=/usr/global/openmpi/pgi/1.8.7 CC=pgcc CXX=pgCC
>FC=pgfortran --with-cuda=/usr/global/cuda/7.0/include/
>
>
> fi
>make[2]: Leaving directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/sm'
>Making all in mca/common/verbs
>make[2]: Entering directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/verbs'
>if test -z "libmca_common_verbs.la"; then \
>   rm -f "libmca_common_verbs.la"; \
>   ln -s "libmca_common_verbs_noinst.la" "libmca_common_verbs.la"; \
> fi
>make[2]: Leaving directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/verbs'
>Making all in mca/common/cuda
>make[2]: Entering directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/cuda'
>   CC   common_cuda.lo
>PGC-S-0039-Use of undeclared variable
>mca_common_cuda_cumemcpy_async
>(common_cuda.c: 320)
>PGC-S-0039-Use of undeclared variable libcuda_handle (common_cuda.c:
>396)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 396)
>PGC-S-0103-Illegal operand types for comparison operator (common_cuda.c:
>397)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 441)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 441)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 442)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 442)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 443)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 443)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 444)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 444)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 445)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 445)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 446)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 446)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 447)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 447)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 448)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 448)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 449)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 449)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 450)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 450)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 451)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 451)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 452)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 452)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 453)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 453)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 454)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 454)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 455)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 455)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 463)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 463)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 464)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 464)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 465)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 465)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 469)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 469)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 470)
>PGC-W-0155-Pointer value created from a nonlong integral

Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-11 Thread Rolf vandeVaart

I talked with Jeremia off list and we figured out what was going on.  There is 
the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the 
cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get 
that behavior you need the following:

--mca mpi_common_cuda_cumemcpy_async 1

It is too late to change this in 1.8 but it will be made the default behavior 
in 1.10 and all future versions.  In addition, he is right about not being able 
to see these variables in the Open MPI 1.8 series.  This was a bug and it has 
been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that 
back into 1.10.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär
>Sent: Tuesday, August 11, 2015 9:17 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
>
>Hi!
>
>In my current application, MPI_Send/MPI_Recv hangs when using buffers in
>GPU device memory of a Nvidia GPU. I realized this is due to the fact that
>OpenMPI uses the synchronous cuMempcy rather than the asynchornous
>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,
>synchronous copies cannot be used.
>
>I scanned through the source and saw support for async memcpy's are
>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in
>./ompi/mca/common/cuda/common_cuda.c
>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but
>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?
>Version used is OpenMPI 1.8.5.
>
>Thank you,
>Jeremia
>
>(gdb) bt
>#0  0x2aaaba11 in clock_gettime ()
>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c
>uda.so.1
>#11 0x2c992544 in opal_cuda_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#12 0x2c98adf7 in opal_convertor_pack () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#14 0x2aaab167353f in mca_pml_ob1_send () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#15 0x2bf4f322 in PMPI_Send () from
>/users/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmpi.so.1
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/08/27424.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-12 Thread Rolf vandeVaart

Hi Geoff:

Our original implementation used cuMemcpy for copying GPU memory into and out 
of host memory.  However, what we learned is that the cuMemcpy causes a 
synchronization for all work on the GPU.  This means that one could not overlap 
very well running a kernel and doing communication.  So, now we create an 
internal stream and then use that along with cuMemcpyAsync/cuStreamSynchronize 
for doing the copy.

In turns out in Jeremia’s case, he wanted to have a long running kernel and he 
wanted the MPI_Send/MPI_Recv to happen at the same time.  With the use of 
cuMemcpy, the MPI library was waiting for his kernel to complete before doing 
the cuMemcpy.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Geoffrey Paulsen
Sent: Wednesday, August 12, 2015 12:55 PM
To: us...@open-mpi.org
Cc: us...@open-mpi.org; Sameh S Sharkawi
Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

I'm confused why this application needs an asynchronous cuMemcpyAsync()in a 
blocking MPI call.   Rolf could you please explain?

And how does is a call to cuMemcpyAsync() followed by a syncronization any 
different than a cuMemcpy() in this use case?

I would still expect that if the MPI_Send / Recv call issued the 
cuMemcpyAsync() that it would be MPI's responsibility to issue the 
synchronization call as well.



---
Geoffrey Paulsen
Software Engineer, IBM Platform MPI
IBM Platform-MPI
Phone: 720-349-2832
Email: gpaul...@us.ibm.com<mailto:gpaul...@us.ibm.com>
www.ibm.com<http://www.ibm.com>


- Original message -
From: Rolf vandeVaart mailto:rvandeva...@nvidia.com>>
Sent by: "users" mailto:users-boun...@open-mpi.org>>
To: Open MPI Users mailto:us...@open-mpi.org>>
Cc:
Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
Date: Tue, Aug 11, 2015 1:45 PM

I talked with Jeremia off list and we figured out what was going on.  There is 
the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the 
cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get 
that behavior you need the following:

--mca mpi_common_cuda_cumemcpy_async 1

It is too late to change this in 1.8 but it will be made the default behavior 
in 1.10 and all future versions.  In addition, he is right about not being able 
to see these variables in the Open MPI 1.8 series.  This was a bug and it has 
been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that 
back into 1.10.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär
>Sent: Tuesday, August 11, 2015 9:17 AM
>To: us...@open-mpi.org<mailto:us...@open-mpi.org>
>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
>
>Hi!
>
>In my current application, MPI_Send/MPI_Recv hangs when using buffers in
>GPU device memory of a Nvidia GPU. I realized this is due to the fact that
>OpenMPI uses the synchronous cuMempcy rather than the asynchornous
>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,
>synchronous copies cannot be used.
>
>I scanned through the source and saw support for async memcpy's are
>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in
>./ompi/mca/common/cuda/common_cuda.c
>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but
>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?
>Version used is OpenMPI 1.8.5.
>
>Thank you,
>Jeremia
>
>(gdb) bt
>#0  0x2aaaba11 in clock_gettime ()
>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c
>uda.so.1
>#11 0x2c992544 in opal_cuda_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#12 0x2c98adf7 in opal_convertor_pack () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#14 0x2aaab167353f in mca_pml_ob1_send () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/op

Re: [OMPI users] cuda aware mpi

2015-08-21 Thread Rolf vandeVaart

No, it is not.  You have to use pml ob1 which will pull in the smcuda and 
openib BTLs which have CUDA-aware built into them.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Subhra Mazumdar
Sent: Friday, August 21, 2015 12:18 AM
To: Open MPI Users
Subject: [OMPI users] cuda aware mpi

Hi,

Is cuda aware mpi supported with pml yalla?

Thanks,
Subhra

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Wrong distance calculations in multi-rail setup?

2015-08-28 Thread Rolf vandeVaart

I am not sure why the distances are being computed as you are seeing. I do not 
have a dual rail card system to reproduce with. However, short term, I think 
you could get what you want by running like the following.  The first argument 
tells the selection logic to ignore locality, so both cards will be available 
to all ranks.  Then, using the application specific notation you can pick the 
exact port for each rank.

Something like:
 mpirun -gmca btl_openib_ignore_locality -np 1 --mca btl_openib_if_include 
mlx4_0:1 a.out : -np 1 --mca btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca 
btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include mlx4_1:2 
a.out 

Kind of messy, but that is the general idea.

Rolf
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>marcin.krotkiewski
>Sent: Friday, August 28, 2015 10:49 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Wrong distance calculations in multi-rail setup?
>
>I have a 4-socket machine with two dual-port Infiniband cards (devices
>mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different CPUs (I
>hope..), both ports are active on both cards, everything connected to the
>same physical network.
>
>I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks
>bound to the 4 sockets, hoping to use both IB cards (and both ports):
>
> mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self --mca
>btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv
>
>but OpenMPI refuses to use the mlx4_1 device
>
> [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is too far
>away
> [ the same for other ranks ]
>
>This is confusing, since I have read that OpenMPI automatically uses a closer
>HCA, so at least some (>=one) rank should choose mlx4_1. I use binding by
>socket, here is the reported map:
>
> [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]:
>[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././.
>/./.]
> [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]:
>[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././.
>/./.]
> [node1.local:28263] MCW rank 0 bound to socket 0[core  0[hwt 0]]:
>[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././.
>/./.]
> [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]:
>[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././.
>/./.]
>
>To check what's going on I have modified btl_openib_component.c to print
>the computed distances.
>
> opal_output_verbose(1, ompi_btl_base_framework.framework_output,
> "[rank=%d] openib: device %d/%d distance %lf",
>ORTE_PROC_MY_NAME->vpid,
> (int)i, (int)num_devs, 
> (double)dev_sorted[i].distance);
>
>Here is what I get:
>
> [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00
> [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00
> [node1.local:28267] [rank=2] openib: device 0/2 distance 0.00
> [node1.local:28268] [rank=3] openib: device 0/2 distance 0.00
> [node1.local:28265] [rank=0] openib: device 1/2 distance 2.10
> [node1.local:28266] [rank=1] openib: device 1/2 distance 1.00
> [node1.local:28267] [rank=2] openib: device 1/2 distance 2.10
> [node1.local:28268] [rank=3] openib: device 1/2 distance 2.10
>
>So the computed distance for mlx4_0 is 0 on all ranks. I believe this should 
>not
>be so. The distance should be smaller on 1 rank and larger for 3 others, as is
>the case for mlx4_1. Looks like a bug?
>
>Another question is, In my configuration two ranks will have a 'closer'
>IB card, but two others will not. Since the correct distance to both devices 
>will
>likely be equal, which device will they choose, if they do that automatically? 
>I'd
>rather they didn't both choose mlx4_0.. I guess it would be nice if I could by
>hand specify the device/port, which should be used by a given MPI rank. Is
>this (going to be) possible with OpenMPI?
>
>Thanks a lot,
>
>Marcin
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/08/27503.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Wrong distance calculations in multi-rail setup?

2015-08-28 Thread Rolf vandeVaart

Let me send you a patch off list that will print out some extra information to 
see if we can figure out where things are going wrong.
We basically depend on the information reported by hwloc so the patch will 
print out some extra information to see if we are getting good data from hwloc.

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Marcin
>Krotkiewski
>Sent: Friday, August 28, 2015 12:13 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] Wrong distance calculations in multi-rail setup?
>
>
>Brilliant! Thank you, Rolf. This works: all ranks have reported using the
>expected port number, and performance is twice of what I was observing
>before :)
>
>I can certainly live with this workaround, but I will be happy to do some
>debugging to find the problem. If you tell me what is needed / where I can
>look, I could help to find the issue.
>
>Thanks a lot!
>
>Marcin
>
>
>On 08/28/2015 05:28 PM, Rolf vandeVaart wrote:
>> I am not sure why the distances are being computed as you are seeing. I do
>not have a dual rail card system to reproduce with. However, short term, I
>think you could get what you want by running like the following.  The first
>argument tells the selection logic to ignore locality, so both cards will be
>available to all ranks.  Then, using the application specific notation you can 
>pick
>the exact port for each rank.
>>
>> Something like:
>>   mpirun -gmca btl_openib_ignore_locality -np 1 --mca
>> btl_openib_if_include mlx4_0:1 a.out : -np 1 --mca
>> btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca
>> btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include
>> mlx4_1:2 a.out
>>
>> Kind of messy, but that is the general idea.
>>
>> Rolf
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>>> marcin.krotkiewski
>>> Sent: Friday, August 28, 2015 10:49 AM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] Wrong distance calculations in multi-rail setup?
>>>
>>> I have a 4-socket machine with two dual-port Infiniband cards
>>> (devices
>>> mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different
>>> CPUs (I hope..), both ports are active on both cards, everything
>>> connected to the same physical network.
>>>
>>> I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks
>>> bound to the 4 sockets, hoping to use both IB cards (and both ports):
>>>
>>>  mpirun --map-by socket --bind-to core -np 4 --mca btl
>>> openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1
>>> SendRecv
>>>
>>> but OpenMPI refuses to use the mlx4_1 device
>>>
>>>  [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it
>>> is too far away
>>>  [ the same for other ranks ]
>>>
>>> This is confusing, since I have read that OpenMPI automatically uses
>>> a closer HCA, so at least some (>=one) rank should choose mlx4_1. I
>>> use binding by socket, here is the reported map:
>>>
>>>  [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]:
>>>
>[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././.
>>> /./.]
>>>  [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]:
>>>
>[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././.
>>> /./.]
>>>  [node1.local:28263] MCW rank 0 bound to socket 0[core  0[hwt 0]]:
>>>
>[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././.
>>> /./.]
>>>  [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]:
>>>
>[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././.
>>> /./.]
>>>
>>> To check what's going on I have modified btl_openib_component.c to
>>> print the computed distances.
>>>
>>>  opal_output_verbose(1,
>ompi_btl_base_framework.framework_output,
>>>  "[rank=%d] openib: device %d/%d distance
>>> %lf", ORTE_PROC_MY_NAME->vpid,
>>>  (int)i, (int)num_devs,
>>> (double)dev_sorted[i].distance);
>>>
>>> Here is what I get:
>>>
>>>  [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00
>>>  [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00
>

Re: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle error emitted by OpenMPI

2015-09-03 Thread Rolf vandeVaart

Lev:
Can you run with --mca mpi_common_cuda_verbose 100 --mca mpool_rgpusm_verbose 
100 and send me (rvandeva...@nvidia.com) the output of that.
Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Wednesday, September 02, 2015 7:15 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle
>error emitted by OpenMPI
>
>I recently noticed the following error when running a Python program I'm
>developing that repeatedly performs GPU-to-GPU data transfers via
>OpenMPI:
>
>The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
>cannot be used.
>  cuIpcGetMemHandle return value:   1
>  address: 0x602e75000
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>
>The system is running Ubuntu 14.04.3 and contains several Tesla S2050 GPUs.
>I'm using the following software:
>
>- Linux kernel 3.19.0 (backported to Ubuntu 14.04.3 from 15.04)
>- CUDA 7.0 (installed via NVIDIA's deb packages)
>- NVIDIA kernel driver 346.82
>- OpenMPI 1.10.0 (manually compiled with CUDA support)
>- Python 2.7.10
>- pycuda 2015.1.3 (manually compiled against CUDA 7.0)
>- mpi4py (manually compiled git revision 1d8ab22)
>
>OpenMPI, Python, pycuda, and mpi4py are all locally installed in a conda
>environment.
>
>Judging from my program's logs, the error pops up during one of the
>program's first few iterations. The error isn't fatal, however - the program
>continues running to completion after the message appears.  Running
>mpiexec with --mca plm_base_verbose 10 doesn't seem to produce any
>additional debug info of use in tracking this down.  I did notice, though, that
>there are undeleted cuda.shm.* files in /run/shm after the error message
>appears and my program exits. Deleting the files does not prevent the error
>from recurring if I subsequently rerun the program.
>
>Oddly, the above problem doesn't crop up when I run the same code on an
>Ubuntu
>14.04.3 system with the exact same software containing 2 non-Tesla GPUs
>(specifically, a GTX 470 and 750). The error seems to have started occurring
>over the past two weeks, but none of the changes I made to my code over
>that time seem to be related to the problem (i.e., running an older revision
>resulted in the same errors). I also tried running my code using older releases
>of OpenMPI (e.g., 1.8.5) and mpi4py (e.g., from about 4 weeks ago), but the
>error message still occurs. Both Ubuntu systems are 64-bit and have been
>kept up to date with the latest package updates.
>
>Any thoughts as to what could be causing the problem?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27526.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] How does MPI_Allreduce work?

2015-09-25 Thread Rolf vandeVaart

Hello Yang:
It is not clear to me if you are asking about a CUDA-aware build of Open MPI 
where you do the MPI_Allreduce() or the GPU buffer or if you are handling 
staging the GPU into host memory and then calling the MPI_Allreduce().  Either 
way, they are somewhat similar.  With CUDA-aware, the MPI_Allreduce() of GPU 
data simply first copies the data into a host buffer and then calls the 
underlying implementation.

Depending on how you have configured your Open MPI, the underlying 
implementation may vary.  I would suggest you compile a debug version 
(--enable-debug) and then run some tests with --mca coll_base_verbose 100 which 
will give you some insight into what is actually happening under the covers.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang Zhang
>Sent: Thursday, September 24, 2015 11:41 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] How does MPI_Allreduce work?
>
>Hello OpenMPI users,
>
>Is there any document on MPI_Allreduce() implementation? I’m using it to do
>summation on GPU data. I wonder if OpenMPI will first do summation on
>processes in the same node, and then do summation on the intermediate
>results across nodes. This would be preferable since it reduces cross node
>communication and should be faster?
>
>I’m using OpenMPI 1.10.0 and CUDA 7.0. I need to sum 40 million float
>numbers on 6 nodes, each node running 4 processes. The nodes are
>connected via InfiniBand.
>
>Thanks very much!
>
>Best,
>Yang
>
>
>
>Sent by Apple Mail
>
>Yang ZHANG
>
>PhD candidate
>
>Networking and Wide-Area Systems Group
>Computer Science Department
>New York University
>
>715 Broadway Room 705
>New York, NY 10003
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27675.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] How does MPI_Allreduce work?

2015-09-25 Thread Rolf vandeVaart

In the case of reductions, yes, we copy into host memory so we can do the 
reduction.  For other collectives or point to point communication, then GPU 
Direct RDMA will be used (for smaller messages).

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang Zhang
>Sent: Friday, September 25, 2015 11:37 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] How does MPI_Allreduce work?
>
>Hi Rolf,
>
>Thanks very much for the info! So with CUDA-aware build, OpenMPI still have
>to copy all the data first into host memory, and then do send/recv on the host
>memory? I thought OpenMPI would use GPUdirect and RDMA to send/recv
>GPU memory directly.
>
>I will try a debug build and see what does it say. Thanks!
>
>Best,
>Yang
>
>
>
>Sent by Apple Mail
>
>Yang ZHANG
>
>PhD candidate
>
>Networking and Wide-Area Systems Group
>Computer Science Department
>New York University
>
>715 Broadway Room 705
>New York, NY 10003
>
>> On Sep 25, 2015, at 11:07 AM, Rolf vandeVaart 
>wrote:
>>
>> Hello Yang:
>> It is not clear to me if you are asking about a CUDA-aware build of Open MPI
>where you do the MPI_Allreduce() or the GPU buffer or if you are handling
>staging the GPU into host memory and then calling the MPI_Allreduce().
>Either way, they are somewhat similar.  With CUDA-aware, the
>MPI_Allreduce() of GPU data simply first copies the data into a host buffer
>and then calls the underlying implementation.
>>
>> Depending on how you have configured your Open MPI, the underlying
>implementation may vary.  I would suggest you compile a debug version (--
>enable-debug) and then run some tests with --mca coll_base_verbose 100
>which will give you some insight into what is actually happening under the
>covers.
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang
>>> Zhang
>>> Sent: Thursday, September 24, 2015 11:41 PM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] How does MPI_Allreduce work?
>>>
>>> Hello OpenMPI users,
>>>
>>> Is there any document on MPI_Allreduce() implementation? I’m using it
>>> to do summation on GPU data. I wonder if OpenMPI will first do
>>> summation on processes in the same node, and then do summation on the
>>> intermediate results across nodes. This would be preferable since it
>>> reduces cross node communication and should be faster?
>>>
>>> I’m using OpenMPI 1.10.0 and CUDA 7.0. I need to sum 40 million float
>>> numbers on 6 nodes, each node running 4 processes. The nodes are
>>> connected via InfiniBand.
>>>
>>> Thanks very much!
>>>
>>> Best,
>>> Yang
>>>
>>> -
>>> ---
>>>
>>> Sent by Apple Mail
>>>
>>> Yang ZHANG
>>>
>>> PhD candidate
>>>
>>> Networking and Wide-Area Systems Group Computer Science Department
>>> New York University
>>>
>>> 715 Broadway Room 705
>>> New York, NY 10003
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-
>>> mpi.org/community/lists/users/2015/09/27675.php
>>
>> --
>> - This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information.  Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> --
>> - ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27678.php
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27679.php

Re: [OMPI users] [Rocks-Discuss] compiling Openmpi on solaris studio express

2010-11-29 Thread Rolf vandeVaart

This problem looks a lot like a thread from earlier today.  Can you look 
at this

ticket and see if it helps?  It has a workaround documented in it.

https://svn.open-mpi.org/trac/ompi/ticket/2632

Rolf

On 11/29/10 16:13, Prentice Bisbal wrote:

No, it looks like ld is being called with the option -path, and your
linker doesn't use that switch. Grep you Makefile(s) for the string
"-path". It's probably in a statement defining LDFLAGS somewhere.

When you find it, replace it with the equivalent switch for your
compiler. You may be able to override it's value on the configure
command-line, which is usually easiest/best:

./configure LDFLAGS="-notpath ... ... ..."

--
Prentice


Nehemiah Dacres wrote:
  

it may have been that  I didn't set ld_library_path

On Mon, Nov 29, 2010 at 2:36 PM, Nehemiah Dacres mailto:dacre...@slu.edu>> wrote:

thank you, you have been doubly helpful, but I am having linking
errors and I do not know what the solaris studio compiler's
preferred linker is. The

the configure statement was

./configure --prefix=/state/partition1/apps/sunmpi/
--enable-mpi-threads --with-sge --enable-static
--enable-sparse-groups CC=/opt/oracle/solstudio12.2/bin/suncc
CXX=/opt/oracle/solstudio12.2/bin/sunCC
F77=/opt/oracle/solstudio12.2/bin/sunf77
FC=/opt/oracle/solstudio12.2/bin/sunf90

   compile statement was

make all install 2>errors


error below is

f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -soname passed to ld, if ld is invoked, ignored
otherwise
/usr/bin/ld: unrecognized option '-path'
/usr/bin/ld: use the --help option for usage information
make[4]: *** [libmpi_f90.la ] Error 2
make[3]: *** [all-recursive] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

am I doing this wrong? are any of those configure flags unnecessary
or inappropriate



On Mon, Nov 29, 2010 at 2:06 PM, Gus Correa mailto:g...@ldeo.columbia.edu>> wrote:

Nehemiah Dacres wrote:

I want to compile openmpi to work with the solaris studio
express  or
solaris studio. This is a different version than is installed on
rockscluster 5.2  and would like to know if there any
gotchas or configure
flags I should use to get it working or portable to nodes on
the cluster.
Software-wise,  it is a fairly homogeneous environment with
only slight
variations on the hardware side which could be isolated
(machinefile flag
and what-not)
Please advise


Hi Nehemiah
I just answered your email to the OpenMPI list.
I want to add that if you build OpenMPI with Torque support,
the machine file for each is not needed, it is provided by Torque.
I believe the same is true for SGE (but I don't use SGE).
Gus Correa




-- 
Nehemiah I. Dacres
System Administrator 
Advanced Technology Group Saint Louis University





--
Nehemiah I. Dacres
System Administrator 
Advanced Technology Group Saint Louis University





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] [Rocks-Discuss] compiling Openmpi on solaris studio express

2010-11-29 Thread Rolf vandeVaart

No, I do not believe so.  First, I assume you are trying to build either 
1.4 or 1.5, not the trunk.
Secondly, I assume you are building from a tarfile that you have 
downloaded.  Assuming these
two things are true, then (as stated in the bug report), prior to 
running configure, you want to
make the following edits to config/libtool.m4 in all the places you see 
it. ( I think just one place)


FROM:

   *Sun\ F*)
 # Sun Fortran 8.3 passes all unrecognized flags to the linker
 _LT_TAGVAR(lt_prog_compiler_pic, $1)='-KPIC'
 _LT_TAGVAR(lt_prog_compiler_static, $1)='-Bstatic'
 _LT_TAGVAR(lt_prog_compiler_wl, $1)=''
 ;;

TO:

   *Sun\ F*)
 # Sun Fortran 8.3 passes all unrecognized flags to the linker
 _LT_TAGVAR(lt_prog_compiler_pic, $1)='-KPIC'
 _LT_TAGVAR(lt_prog_compiler_static, $1)='-Bstatic'
 _LT_TAGVAR(lt_prog_compiler_wl, $1)='-Qoption ld '
 ;;



Note the difference in the lt_prog_compiler_wl line. 

Then, you need to run ./autogen.sh.  Then, redo your configure but you 
do not need to do anything
with LDFLAGS.  Just use your original flags.  I think this should work, 
but I am only reading

what is in the ticket.

Rolf


On 11/29/10 16:26, Nehemiah Dacres wrote:

that looks about right. So the suggestion:

./configure LDFLAGS="-notpath ... ... ..."

-notpath should be replaced by whatever the proper flag should be, in my case -L ? 

  

On Mon, Nov 29, 2010 at 3:16 PM, Rolf vandeVaart 
mailto:rolf.vandeva...@oracle.com>> wrote:


This problem looks a lot like a thread from earlier today.  Can
you look at this
ticket and see if it helps?  It has a workaround documented in it.

https://svn.open-mpi.org/trac/ompi/ticket/2632

Rolf


On 11/29/10 16:13, Prentice Bisbal wrote:

No, it looks like ld is being called with the option -path, and your
linker doesn't use that switch. Grep you Makefile(s) for the string
"-path". It's probably in a statement defining LDFLAGS somewhere.

When you find it, replace it with the equivalent switch for your
compiler. You may be able to override it's value on the configure
command-line, which is usually easiest/best:

./configure LDFLAGS="-notpath ... ... ..."

--
Prentice


Nehemiah Dacres wrote:
  

it may have been that  I didn't set ld_library_path

On Mon, Nov 29, 2010 at 2:36 PM, Nehemiah Dacres mailto:dacre...@slu.edu>
<mailto:dacre...@slu.edu>> wrote:

thank you, you have been doubly helpful, but I am having linking
errors and I do not know what the solaris studio compiler's
preferred linker is. The

the configure statement was

./configure --prefix=/state/partition1/apps/sunmpi/
--enable-mpi-threads --with-sge --enable-static
--enable-sparse-groups CC=/opt/oracle/solstudio12.2/bin/suncc
CXX=/opt/oracle/solstudio12.2/bin/sunCC
F77=/opt/oracle/solstudio12.2/bin/sunf77
FC=/opt/oracle/solstudio12.2/bin/sunf90

   compile statement was

make all install 2>errors


error below is

f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -soname passed to ld, if ld is invoked, ignored
otherwise
/usr/bin/ld: unrecognized option '-path'
/usr/bin/ld: use the --help option for usage information
make[4]: *** [libmpi_f90.la <http://libmpi_f90.la> 
<http://libmpi_f90.la>] Error 2
make[3]: *** [all-recursive] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

am I doing this wrong? are any of those configure flags unnecessary
or inappropriate



On Mon, Nov 29, 2010 at 2:06 PM, Gus Correa mailto:g...@ldeo.columbia.edu>
<mailto:g...@ldeo.columbia.edu>> wrote:

Nehemiah Dacres wrote:

I want to compile openmpi to work with the solaris studio
express  or
solaris studio. This is a different version than is installed on
rockscluster 5.2  and would like to know if there any
gotchas or configure
flags I should use to get it working or portable to nodes on
the cluster.
Software-wise,  it is a fairly homogeneous environment with
only slight
variations on the hardware side which could be isol

Re: [OMPI users] One-sided datatype errors

2010-12-14 Thread Rolf vandeVaart


Hi James:
I can reproduce the problem on a single node with Open MPI 1.5 and the 
trunk.  I have submitted a ticket with

the information.

https://svn.open-mpi.org/trac/ompi/ticket/2656

Rolf

On 12/13/10 18:44, James Dinan wrote:

Hi,

I'm getting strange behavior using datatypes in a one-sided 
MPI_Accumulate operation.


The attached example performs an accumulate into a patch of a shared 
2d matrix.  It uses indexed datatypes and can be built with 
displacement or absolute indices (hindexed) - both cases fail.  I'm 
seeing data validation errors, hanging, or other erroneous behavior 
under OpenMPI 1.5 on Infiniband.  The example works correctly under 
the current release of MVAPICH on IB and under MPICH on shared memory.


Any help would be greatly appreciated.

Best,
 ~Jim.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart

Hi Brice:
Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You definitely 
need the patch or you will see the behavior just as you described, a hang. One 
thing you could try is disabling the large message RDMA in OMPI and see if that 
works.  That can be done by adjusting the openib BTL flags.

-- mca btl_openib_flags 304

Rolf 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:16 AM
To: us...@open-mpi.org
Subject: [OMPI users] anybody tried OMPI with gpudirect?

Hello,

I am trying to play with nvidia's gpudirect. The test program given with the 
gpudirect tarball just does a basic MPI ping-pong between two process that 
allocated their buffers with cudaHostMalloc instead of malloc. It seems to work 
with Intel MPI but Open MPI 1.5 hangs in the first MPI_Send. Replacing the cuda 
buffer with a normally-malloc'ed buffer makes the program work again. I assume 
that something goes wrong when OMPI tries to register/pin the cuda buffer in 
the IB stack (that's what gpudirect seems to be about), but I don't see why 
Intel MPI would succeed there.

Has anybody ever looked at this?

FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and 
w/o the gpudirect patch.

Thanks
Brice Goglin

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart


For the GPU Direct to work with Infiniband, you need to get some updated OFED 
bits from your Infiniband vendor. 

In terms of checking the driver updates, you can do a grep on the string 
get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
kernel is updated correctly.

The GPU Direct functioning should be independent of the MPI you are using.

Rolf  


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:42 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect?

Le 28/02/2011 17:30, Rolf vandeVaart a écrit :
> Hi Brice:
> Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You 
> definitely need the patch or you will see the behavior just as you described, 
> a hang. One thing you could try is disabling the large message RDMA in OMPI 
> and see if that works.  That can be done by adjusting the openib BTL flags.
>
> -- mca btl_openib_flags 304
>
> Rolf 
>   

Thanks Rolf. Adding this mca parameter worked-around the hang indeed.
The kernel is supposed to be properly patched for gpudirect. Are you
aware of anything else we might need to make this work? Do we need to
rebuild some OFED kernel modules for instance?

Also, is there any reliable/easy way to check if gpudirect works in our
kernel ? (we had to manually fix the gpudirect patch for SLES11).

Brice

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 2:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? 

Le 28/02/2011 19:49, Rolf vandeVaart a écrit :
> For the GPU Direct to work with Infiniband, you need to get some updated OFED 
> bits from your Infiniband vendor. 
>
> In terms of checking the driver updates, you can do a grep on the string 
> get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
> kernel is updated correctly.
>   

The kernel looks ok then. But I couldn't find any kernel modules (tried 
nvidia.ko and all ib modules) which references this symbol. So I guess my OFED 
kernel modules aren't ok. I'll check on Mellanox website (we have some very 
recent Mellanox ConnectX QDR boards).

thanks
Brice

--
I have since learned that you can check /sys/module/ib_core/parameters/*  which 
will list a couple of GPU direct files if the driver is installed correctly and 
loaded.

Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-06 Thread Rolf vandeVaart

Hi Fengguang:

That is odd that you see the problem even when running with the openib flags 
set as Brice indicated.  Just to be extra sure there are no typo errors in your 
flag settings, maybe you can verify with the ompi_info command like this?

ompi_info -mca btl_openib_flags 304 -param btl openib | grep btl_openib_flags

When running with the 304 setting, then all communications travel through a 
regular send/receive protocol on IB.  The message is broken up into a 12K 
fragment, followed by however many 64K fragments it takes to move the message.

I will try and find to time to reproduce the other 1 Mbyte issue that Brice 
reported.

Rolf

PS: Not sure if you are interested, but in the trunk, you can configure in 
support so that you can send and receive GPU buffers directly.  There are still 
many performance issues to be worked out, but just thought I would mention it.

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Fengguang Song
Sent: Sunday, June 05, 2011 9:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA

Hi Brice,

Thank you! I saw your previous discussion and actually have tried "--mca 
btl_openib_flags 304".
It didn't solve the problem unfortunately. In our case, the MPI buffer is 
different from the cudaMemcpy buffer and we do manually copy between them. I'm 
still trying to figure out how to configure OpenMPI's mca parameters to solve 
the problem...

Thanks,
Fengguang

On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:

> Le 05/06/2011 00:15, Fengguang Song a écrit :
>> Hi,
>> 
>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. 
>> My program uses MPI to exchange data between nodes, and uses cudaMemcpyAsync 
>> to exchange data between Host and GPU devices within a node.
>> When the MPI message size is less than 1MB, everything works fine. 
>> However, when the message size is > 1MB, the program hangs (i.e., an MPI 
>> send never reaches its destination based on my trace).
>> 
>> The issue may be related to locked-memory contention between OpenMPI and 
>> CUDA.
>> Does anyone have the experience to solve the problem? Which MCA 
>> parameters should I tune to increase the message size to be > 1MB (to avoid 
>> the program hang)? Any help would be appreciated.
>> 
>> Thanks,
>> Fengguang
> 
> Hello,
> 
> I may have seen the same problem when testing GPU direct. Do you use 
> the same host buffer for copying from/to GPU and for sending/receiving 
> on the network ? If so, you need a GPUDirect enabled kernel and 
> mellanox drivers, but it only helps before 1MB.
> 
> You can work around the problem with one of the following solution:
> * add --mca btl_openib_flags 304 to force OMPI to always send/recv 
> through an intermediate (internal buffer), but it'll decrease 
> performance before 1MB too
> * use different host buffers for the GPU and the network and manually 
> copy between them
> 
> I never got any reply from NVIDIA/Mellanox/here when I reported this 
> problem with GPUDirect and messages larger than 1MB.
> http://www.open-mpi.org/community/lists/users/2011/03/15823.php
> 
> Brice
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Rolf vandeVaart


>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't
>happen until the third iteration. I take that to mean that the basic
>communication works, but that something is saturating. Is there some notion
>of buffer size somewhere in the MPI system that could explain this?
>
>Hmm.  This is not a good sign; it somewhat indicates a problem with your OS.
>Based on this email and your prior emails, I'm guessing you're using TCP for
>communication, and that the problem is based on inter-node communication
>(e.g., the problem would occur even if you only run 1 process per machine,
>but does not occur if you run all N processes on a single machine, per your #4,
>below).
>

I agree with Jeff here.  Open MPI uses lazy connections to establish 
connections and round robins through the interfaces.
So, the first few communications could work as they are using interfaces that 
could communicate between the nodes, but the third iteration uses an interface 
that for some reason cannot establish the connection.

One flag you can use that may help is --mca btl_base_verbose 20, like this;

mpirun --mca btl_base_verbose 20 connectivity_c

It will dump out a bunch of stuff, but there will be a few lines that look like 
this:

[...snip...]
[dt:09880] btl: tcp: attempting to connect() to [[58627,1],1] address 
10.20.14.101 on port 1025
[...snip...]

Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] gpudirect p2p?

2011-10-14 Thread Rolf vandeVaart

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Chris Cooper
>Sent: Friday, October 14, 2011 1:28 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] gpudirect p2p?
>
>Hi,
>
>Are the recent peer to peer capabilities of cuda leveraged by Open MPI when
>eg you're running a rank per gpu on the one workstation?

Currently, no.  I am actively working on adding that capability. 

>
>It seems in my testing that I only get in the order of about 1GB/s as per
>http://www.open-mpi.org/community/lists/users/2011/03/15823.php,
>whereas nvidia's simpleP2P test indicates ~6 GB/s.
>
>Also, I ran into a problem just trying to test.  It seems you have to do
>cudaSetDevice/cuCtxCreate with the appropriate gpu id which I was wanting
>to derive from the rank.  You don't however know the rank until after
>MPI_Init() and you need to initialize cuda before.  Not sure if there's a
>standard way to do it?  I have a workaround atm.
>

The recommended way is to put the GPU in exclusive mode first.

#nvidia-smi -c 1

Then, have this kind of snippet at the beginning of the program. (this is driver
API, probably should use runtime API)

res = cuInit(0);
if (CUDA_SUCCESS != res) {
exit(1);
} 

if(CUDA_SUCCESS != cuDeviceGetCount(&cuDevCount)) {
exit(2);
}
for (device = 0; device < cuDevCount; device++) {
if (CUDA_SUCCESS != (res = cuDeviceGet(&cuDev, device))) {
exit(3);
}
if (CUDA_SUCCESS != cuCtxCreate(&ctx, 0, cuDev)) {
 /* Another process must have grabbed it.  Go to the next one. */
} else {
break;
}
i++;
}



>Thanks,
>Chris
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] configure with cuda

2011-10-27 Thread Rolf vandeVaart

Actually, that is not quite right.  From the FAQ:

"This feature currently only exists in the trunk version of the Open MPI 
library."

You need to download and use the trunk version for this to work.

http://www.open-mpi.org/nightly/trunk/

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, October 27, 2011 11:43 AM
To: Open MPI Users
Subject: Re: [OMPI users] configure with cuda


I'm pretty sure cuda support was never moved to the 1.4 series. You will, 
however, find it in the 1.5 series. I suggest you get the latest tarball from 
there.


On Oct 27, 2011, at 12:38 PM, Peter Wells wrote:



I am attempting to configure OpenMPI 1.4.3 with cuda support on a Redhat 5 box. 
When I try to run configure with the following command:

 ./configure --prefix=/opt/crc/sandbox/pwells2/openmpi/1.4.3/intel-12.0-cuda/ 
FC=ifort F77=ifort CXX=icpc CC=icc --with-sge --disable-dlopen --enable-static 
--enable-shared --disable-openib-connectx-xrc --disable-openib-rdmacm 
--without-openib --with-cuda=/opt/crc/cuda/4.0/cuda 
--with-cuda-libdir=/opt/crc/cuda/4.0/cuda/lib64

I receive the warning that '--with-cuda' and '--with-cuda-libdir' are 
unrecognized options. According to the FAQ these options are supported in this 
version of OpenMPI. I attempted the same thing with v.1.4.4 downloaded directly 
from open-mpi.org with similar results. Attached are the 
results of configure and make on v.1.4.3. Any help would be greatly appreciated.

Peter Wells
HPC Intern
Center for Research Computing
University of Notre Dame
pwel...@nd.edu
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] How "CUDA Init prior to MPI_Init" co-exists with unique GPU for each MPI process?

2011-12-14 Thread Rolf vandeVaart

To add to this, yes, we recommend that the CUDA context exists prior to a call 
to MPI_Init.  That is because a CUDA context needs to exist prior to MPI_Init 
as the library attempts to register some internal buffers with the CUDA library 
that require a CUDA context exists already.  Note that this is only relevant if 
you plan to send and receive CUDA device memory directly from MPI calls.   
There is a little more about this in the FAQ here.

http://www.open-mpi.org/faq/?category=running#mpi-cuda-support


Rolf

From: Matthieu Brucher [mailto:matthieu.bruc...@gmail.com]
Sent: Wednesday, December 14, 2011 10:47 AM
To: Open MPI Users
Cc: Rolf vandeVaart
Subject: Re: [OMPI users] How "CUDA Init prior to MPI_Init" co-exists with 
unique GPU for each MPI process?

Hi,

Processes are not spawned by MPI_Init. They are spawned before by some 
applications between your mpirun call and when your program starts. When it 
does, you already have all MPI processes (you can check by adding a sleep or 
something like that), but they are not synchronized and do not know each other. 
This is what MPI_Init is used for.

Matthieu Brucher
2011/12/14 Dmitry N. Mikushin mailto:maemar...@gmail.com>>
Dear colleagues,

For GPU Winter School powered by Moscow State University cluster
"Lomonosov", the OpenMPI 1.7 was built to test and popularize CUDA
capabilities of MPI. There is one strange warning I cannot understand:
OpenMPI runtime suggests to initialize CUDA prior to MPI_Init. Sorry,
but how could it be? I thought processes are spawned during MPI_Init,
and such context will be created on the very first root process. Why
do we need existing CUDA context before MPI_Init? I think there was no
such error in previous versions.

Thanks,
- D.
___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] how this is possible?

2007-10-02 Thread Rolf . Vandevaart



My guess is you must have a mismatched MPI_Bcast somewhere
in the code.  Presumably, there is a call to MPI_Bcast on the head
node that broadcasts something larger than 1 MPI_INT and does not
have the matching call on the worker nodes.  Then, when the MPI_Bcast
on the worker nodes is called, they are matching up with the previous
call on the head node.

Since the head node is the root of the MPI_Bcast, then it can
blast its data to the other nodes and then continue on.  I have
attached a simple program that shows the same behavior.

And yes, the Send and Recv can work independently from the
Broadcast as they are using different tags to match up their data.

Rolf

PS: Simple program at end of message.

Marcin Skoczylas wrote:


Sorry I forgot to mention: Open MPI version 1.2.4


Marcin Skoczylas wrote:
 


Hello,

After whole day of coding I'm fighting little bit with one small 
fragment which seems strange for me.
For testing I have one head node and two worker nodes on localhost. 
Having this code (with debug stuff added like sleeps, barriers, etc):


void CImageData::SpreadToNodes()
{
   sleep(5);
   logger->debug("CImageData::SpreadToNodes, w=%d h=%d type=%d",
   this->width, this->height, this->type);
  
   logger->debug("head barrier");

   MPI_Barrier(MPI_COMM_WORLD);
   sleep(2);
   MPI_Barrier(MPI_COMM_WORLD);
  
   // debug 'sync' test

   logger->debug("head send SYNC str");
   char buf[5];
   buf[0] = 'S'; buf[1] = 'Y'; buf[2] = 'N'; buf[3] = 'C';
   for (int nodeId = 1; nodeId < g_NumProcesses; nodeId++)
   {   
   MPI_Send(buf, 4, MPI_CHAR, nodeId, DEF_MSG_TAG, MPI_COMM_WORLD);

   }
  
   logger->debug("head bcast width: %d", this->width);

   MPI_Bcast(&(this->width), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("head bcast height: %d", this->height);
   MPI_Bcast(&(this->height), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("head bcast type: %d", this->type);
   MPI_Bcast(&(this->type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
  
   logger->debug("head sleep 10s");

   sleep(10);
  
   logger->debug("finished CImageData::SpreadToNodes");   
}


// this function is decleared static:
CImageData *CImageData::ReceiveFromHead()
{
   sleep(5);

   logger->debug("CImageData::ReceiveFromHead");
   MPI_Status status;
   int _width;
   int _height;
   byte _type;
  
   logger->debug("worker barrier");

   MPI_Barrier(MPI_COMM_WORLD);
   sleep(2);
   MPI_Barrier(MPI_COMM_WORLD);

   char buf[5];
   MPI_Recv(buf, 4, MPI_CHAR, HEAD_NODE, DEF_MSG_TAG, MPI_COMM_WORLD, 
&status);
   logger->debug("worker received sync str: '%c' '%c' '%c' '%c'", 
buf[0], buf[1], buf[2], buf[3]);


   logger->debug("worker bcast width");
   MPI_Bcast(&(_width), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("worker bcast height");
   MPI_Bcast(&(_height), 1, MPI_INT, 0, MPI_COMM_WORLD);
   logger->debug("worker bcast type");
   MPI_Bcast(&(_type), 1, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
  
   logger->debug("width=%d height=%d type=%d", _width, _height, _type);


   // TODO: create CImageData object, return...
   return NULL;
}


That part of code gives me an error:
RANK 0 -> PID 17115
RANK 1 -> PID 17116
RANK 2 -> PID 17117

2007-10-02 19:50:37,829 [17115] DEBUG: CImageData::SpreadToNodes, w=768 
h=576 type=1

2007-10-02 19:50:37,829 [17117] DEBUG: CImageData::ReceiveFromHead
2007-10-02 19:50:37,829 [17115] DEBUG: head barrier
2007-10-02 19:50:37,829 [17116] DEBUG: CImageData::ReceiveFromHead
2007-10-02 19:50:37,829 [17116] DEBUG: worker barrier
2007-10-02 19:50:37,829 [17117] DEBUG: worker barrier
2007-10-02 19:50:39,836 [17115] DEBUG: head send SYNC str
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast width: 768
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast height: 576
2007-10-02 19:50:39,836 [17115] DEBUG: head bcast type: 1
2007-10-02 19:50:39,836 [17115] DEBUG: head sleep 10s
2007-10-02 19:50:39,836 [17116] DEBUG: worker received sync str: 'S' 'Y' 
'N' 'C'

2007-10-02 19:50:39,836 [17116] DEBUG: worker bcast width
[pc801:17116] *** An error occurred in MPI_Bcast
[pc801:17116] *** on communicator MPI_COMM_WORLD
[pc801:17116] *** MPI_ERR_TRUNCATE: message truncated
[pc801:17116] *** MPI_ERRORS_ARE_FATAL (goodbye)
2007-10-02 19:50:39,836 [17117] DEBUG: worker received sync str: 'S' 'Y' 
'N' 'C'

2007-10-02 19:50:39,836 [17117] DEBUG: worker bcast width
[pc801:17117] *** An error occurred in MPI_Bcast
[pc801:17117] *** on communicator MPI_COMM_WORLD
[pc801:17117] *** MPI_ERR_TRUNCATE: message truncated
[pc801:17117] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 17115 on node pc801 exited on 
signal 15 (Terminated).



Could it be that somewhere before this part the data stream was out of 
sync? The project is quite large and I have a lot of communication 
between processes before CImageData::SpreadToNodes() so whole debugging 
could take hours/days, however it seems that data flow before this 
particular fragment is ok. How could it be that MPI_Send/Re

Re: [OMPI users] large number of processes

2007-12-03 Thread Rolf vandeVaart



Hi:
I managed to run a 256 process job on a single node. I ran a simple test 
in which all processes send a message to all others.
This was using Sun's Binary Distribution of Open MPI on Solaris which is 
based on r16572 of the 1.2 branch.  The machine had 8 cores.


burl-ct-v40z-0 49 =>/opt/SUNWhpc/HPC7.1/bin/mpirun --mca 
mpool_sm_max_size 2147483647 -np 256 connectivity_c

Connectivity test on 256 processes PASSED.
burl-ct-v40z-0 50 =>
burl-ct-v40z-0 50 =>/opt/SUNWhpc/HPC7.1/bin/mpirun --mca 
mpool_sm_max_size 2147483647 -np 300 connectivity_c -v

Connectivity test on 300 processes PASSED.

burl-ct-v40z-0 54 =>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize 0 kbytes
vmemoryuse unlimited
descriptors 65536
burl-ct-v40z-0 55 =>



hbtcx...@yahoo.co.jp wrote:

I installed Open MPI 1.2.4 in Red Hat Enterprise Linux 3. It worked
fine in normal usage. I tested executions with extremely many
processes.

$ mpiexec -n 128 --host node0 --mca btl_tcp_if_include eth0 --mca
mpool_sm_max_size 2147483647 ./cpi

$ mpiexec -n 256 --host node0,node1 --mca btl_tcp_if_include eth0 --mca
mpool_sm_max_size 2147483647 ./cpi

If I specified the mpiexec options like this, the execution succeeded
in both cases.

$ mpiexec -n 256 --host node0 --mca btl_tcp_if_include eth0 --mca
mpool_sm_max_size 2147483647 ./cpi

mpiexec noticed that job rank 0 with PID 0 on node node0 exited on
signal 15 (Terminated).
252 additional processes aborted (not shown)

$ mpiexec -n 512 --host node0,node1 --mca btl_tcp_if_include eth0 --mca
mpool_sm_max_size 2147483647 ./cpi

mpiexec noticed that job rank 0 with PID 0 on node node0 exited on
signal 15 (Terminated).
505 additional processes aborted (not shown)

If I increased the number of processes, the executions aborted.
Could I execute 256 processes per node using Open MPI?

We would like to execute as large number of processes as possible even if
the performance becomes worse.
If we use MPICH, we can execute 256 processes per node,
but the performance of Open MPI may be better.
We understand we must increase the number of open files next because
the current implementation uses many sockets.

SUSUKITA, Ryutaro
Peta-scale System Interconnect Project
Fukuoka Industry, Science & Technology Foundation
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] how to select a specific network

2008-01-11 Thread Rolf Vandevaart



Hello:
Have you actually tried this and got it to work?  It did not work for me.

 burl-ct-v440-0 50 =>mpirun -host burl-ct-v440-0,burl-ct-v440-1 -np 1 
-mca btl self,sm,tcp -mca btl_tcp_if_include ce0 connectivity_c : -np 1 
-mca btl self,sm,tcp -mca btl_tcp_if_include ce0 connectivity_c

Connectivity test on 2 processes PASSED.
 burl-ct-v440-0 51 =>mpirun -host burl-ct-v440-0,burl-ct-v440-1 -np 1 
-mca btl self,sm,tcp -mca btl_tcp_if_include ibd0 connectivity_c : -np 1 
-mca btl self,sm,tcp -mca btl_tcp_if_include ibd0 connectivity_c

Connectivity test on 2 processes PASSED.
 burl-ct-v440-0 52 =>mpirun -host burl-ct-v440-0,burl-ct-v440-1 -np 1 
-mca btl self,sm,tcp -mca btl_tcp_if_include ce0 connectivity_c : -np 1 
-mca btl self,sm,tcp -mca btl_tcp_if_include ibd0 connectivity_c

--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
 burl-ct-v440-0 53 =>



Aurélien Bouteiller wrote:

Try something similar to this

mpirun -np 1 -mca btl self,tcp -mca btl_tcp_if_include en1 NetPIPE_3.6/ 
NPmpi : -np 1 -mca btl self,tcp -mca btl_tcp_if_include en0   
NetPIPE_3.6/NPmpi


You should then be able to specify a different if_include mask for you  
different processes.


Aurelien

Le 11 janv. 08 à 06:46, Lydia Heck a écrit :


I should have added that the two networks are not routable,
and that they are private class B.


On Fri, 11 Jan 2008, Lydia Heck wrote:


I have a setup which contains one set of machines
with one nge and one e1000g network and of machines
with two e1000g networks configured. I am planning a
large run where all these computers will be occupied
with one job and the mpi communication should only go
over one specific network which is configured over
e1000g0 on the first set of machines and on e1000g1 on the
second set. I cannot use - for obvious reasons to either
include all of e1000g or to exclude part of e1000g - if that is
possible.
So I have to exclude or include on the internet number range.

Is there an obvious flag - which I have not yet found - to tell
mpirun to use one specific network?

Lydia

--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Dr. Aurélien Bouteiller
Sr. Research Associate - Innovative Computing Laboratory
Suite 350, 1122 Volunteer Boulevard
Knox

Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread Rolf Vandevaart



This worked for me although I am not sure how extensive our 32/64 
interoperability support is.  I tested on Solaris using the TCP 
interconnect and a 1.2.5 version of Open MPI.  Also, we configure with 
the --enable-heterogeneous flag which may make a difference here.  Also 
this did not work for me over the sm btl.


By the way, can you run a simple /bin/hostname across the two nodes?


 burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o 
simple.32
 burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o 
simple.64
 burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca 
btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 
simple.32 : -host burl-ct-v20z-5 -np 3 simple.64

[burl-ct-v20z-4]I am #0/6 before the barrier
[burl-ct-v20z-5]I am #3/6 before the barrier
[burl-ct-v20z-5]I am #4/6 before the barrier
[burl-ct-v20z-4]I am #1/6 before the barrier
[burl-ct-v20z-4]I am #2/6 before the barrier
[burl-ct-v20z-5]I am #5/6 before the barrier
[burl-ct-v20z-5]I am #3/6 after the barrier
[burl-ct-v20z-4]I am #1/6 after the barrier
[burl-ct-v20z-5]I am #5/6 after the barrier
[burl-ct-v20z-5]I am #4/6 after the barrier
[burl-ct-v20z-4]I am #2/6 after the barrier
[burl-ct-v20z-4]I am #0/6 after the barrier
 burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open 
MPI) 1.2.5r16572


Report bugs to http://www.open-mpi.org/community/help/
 burl-ct-v20z-4 65 =>


jody wrote:

i narrowed it down:
The majority of processes get stuck in MPI_Barrier.
My Test application looks like this:

#include 
#include 
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
int iResult = 0;
int iRank1;
int iNum1;

char sName[256];
gethostname(sName, 255);

MPI_Init(&iArgC, &apArgV);

MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
MPI_Comm_size(MPI_COMM_WORLD, &iNum1);

printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
MPI_Barrier(MPI_COMM_WORLD);
printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);

MPI_Finalize();

return iResult;
}


If i make this call:
mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
./run_gdb.sh ./MPITest64

(run_gdb.sh is a script which starts gdb in a xterm for each process)
Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize,
all other processes get stuck in PMPI_Barrier,
Process 1 (on aim-plankton) displays the message
   
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Process 2 on (aim-plankton) displays the same message twice.

Any ideas?

  Thanks Jody

On Thu, Apr 10, 2008 at 1:05 PM, jody  wrote:

Hi
 Using a more realistic application than a simple "Hello, world"
 even the --host version doesn't work correctly
 Called this way

 mpirun -np 3 --host aim-plankton ./QHGLauncher
 --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt

 the application starts but seems to hang after a while.

 Running the application in gdb:

 mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
 --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
 -o bruzlopf -n 12
 --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim

 i can see that the processes on aim-fanta4 have indeed gotten stuck
 after a few initial outputs,
 and the processes on aim-plankton all have a messsage:

 
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
 connect() failed with errno=113

 If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
 as expected.

 BTW: i'm, using open MPI 1.2.2

 Thanks
  Jody


On Thu, Apr 10, 2008 at 12:40 PM, jody  wrote:
 > HI
 >  In my network i have some 32 bit machines and some 64 bit machines.
 >  With --host i successfully call my application:
 >   mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
 >  -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
 >  (MPITest64 has the same code as MPITest, but was compiled on the 64 bit 
machine)
 >
 >  But when i use hostfiles:
 >   mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
 >  -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
 >  all 6 processes are started on the 64 bit machine aim-fanta4.
 >
 >  hosts32:
 >aim-plankton slots=3
 >  hosts64
 >   aim-fanta4 slots
 >
 >  Is this a bug or a feature?  ;)
 >
 >  Jody
 >


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread Rolf Vandevaart

On a CentOS Linux box, I see the following:

> grep 113 /usr/include/asm-i386/errno.h
#define EHOSTUNREACH113 /* No route to host */

I have also seen folks do this to figure out the errno.

> perl -e 'die$!=113'
No route to host at -e line 1.

I am not sure why this is happening, but you could also check the Open 
MPI User's Mailing List Archives where there are other examples of 
people running into this error.  A search of "113" had a few hits.

http://www.open-mpi.org/community/lists/users

Also, I assume you would see this problem with or without the 
MPI_Barrier if you add this parameter to your mpirun line:

--mca mpi_preconnect_all 1

The MPI_Barrier is causing the bad behavior because by default 
connections are setup up lazily. Therefore only when the MPI_Barrier 
call is made and we start communicating and establishing connections do 
we start seeing the communication problems.

Rolf

jody wrote:

Rolf,
I was able to run hostname on the two noes that way,
and also a simplified version of my testprogram (without a barrier)
works. Only MPI_Barrier shows bad behaviour.

Do you know what this message means?
[aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Does it give an idea what could be the problem?

Jody

On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
 wrote:

 This worked for me although I am not sure how extensive our 32/64
 interoperability support is.  I tested on Solaris using the TCP
 interconnect and a 1.2.5 version of Open MPI.  Also, we configure with
 the --enable-heterogeneous flag which may make a difference here.  Also
 this did not work for me over the sm btl.

 By the way, can you run a simple /bin/hostname across the two nodes?

  burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
 simple.32
  burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
 simple.64
  burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
 btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3
 simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
 [burl-ct-v20z-4]I am #0/6 before the barrier
 [burl-ct-v20z-5]I am #3/6 before the barrier
 [burl-ct-v20z-5]I am #4/6 before the barrier
 [burl-ct-v20z-4]I am #1/6 before the barrier
 [burl-ct-v20z-4]I am #2/6 before the barrier
 [burl-ct-v20z-5]I am #5/6 before the barrier
 [burl-ct-v20z-5]I am #3/6 after the barrier
 [burl-ct-v20z-4]I am #1/6 after the barrier
 [burl-ct-v20z-5]I am #5/6 after the barrier
 [burl-ct-v20z-5]I am #4/6 after the barrier
 [burl-ct-v20z-4]I am #2/6 after the barrier
 [burl-ct-v20z-4]I am #0/6 after the barrier
  burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
 MPI) 1.2.5r16572

 Report bugs to http://www.open-mpi.org/community/help/
  burl-ct-v20z-4 65 =>

 jody wrote:
 > i narrowed it down:
 > The majority of processes get stuck in MPI_Barrier.
 > My Test application looks like this:
 >
 > #include 
 > #include 
 > #include "mpi.h"
 >
 > int main(int iArgC, char *apArgV[]) {
 > int iResult = 0;
 > int iRank1;
 > int iNum1;
 >
 > char sName[256];
 > gethostname(sName, 255);
 >
 > MPI_Init(&iArgC, &apArgV);
 >
 > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
 > MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
 >
 > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
 > MPI_Barrier(MPI_COMM_WORLD);
 > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);
 >
 > MPI_Finalize();
 >
 > return iResult;
 > }
 >
 >
 > If i make this call:
 > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
 > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
 > ./run_gdb.sh ./MPITest64
 >
 > (run_gdb.sh is a script which starts gdb in a xterm for each process)
 > Process 0 (on aim-plankton) passes the barrier and gets stuck in 
PMPI_Finalize,
 > all other processes get stuck in PMPI_Barrier,
 > Process 1 (on aim-plankton) displays the message
 >
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
 > connect() failed with errno=113
 > Process 2 on (aim-plankton) displays the same message twice.
 >
 > Any ideas?
 >
 >   Thanks Jody
 >
 > On Thu, Apr 10, 2008 at 1:05 PM, jody  wrote:
 >> Hi
 >>  Using a more realistic application than a simple "Hello, world"
 >>  even the --host version doesn't work correctly
 >>  Called this way
 >>
 >>  mpirun -np 3 --host aim-plankton ./QHGLauncher
 >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 >>  ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
 >>
 >>  the application starts but seems to h

Re: [OMPI users] vprotocol pessimist

2008-07-15 Thread Rolf vandeVaart

And if you want to stop seeing it in the short term, you have at least 
two choices I know of.

At configure time, add this to your configure line.

--enable-mca-no-build=vprotocol

This will prevent that component from being built, and will eliminate 
the message. 

If it is in there, you can force the component to not be opened (and 
therefore eliminate the message) at run time with this mca parameter.

--mca pml ^v

Rolf

Tom Riddle wrote:
Thanks! I hadn't seen any adverse effects to anything I was running 
but I'd thought I'd ask anyway. Tom

--- On *Tue, 7/15/08, Jeff Squyres //* wrote:

From: Jeff Squyres 
Subject: Re: [OMPI users] vprotocol pessimist
To: "Open MPI Users" 
Cc: rarebit...@yahoo.com
Date: Tuesday, July 15, 2008, 8:52 AM

It's also only on the developer trunk (and v1.3) branch; we have an  
open ticket:

  https://svn.open-mpi.org/trac/ompi/ticket/1328

I'll add you into the CC list so that you can see what happens as we  
fix it.

On Jul 15, 2008, at 11:49 AM, Aurélien Bouteiller wrote:

> This is a harmless error, related to
 fault tolerance component. If  
> you don't need FT, you can safely ignore it. It will disappear soon.

>
> Aurelien
>
>
> Le 15 juil. 08 à 11:22, Tom Riddle a écrit :
>
>> Hi,
>>
>> I wonder if anyone can shed some light on what exactly this is  
>> referring to? At the start of any mpirun program I get the following

>>
>> mca: base: component_find: unable to open vprotocol pessimist: file  
>> not found (ignored)

>>
>> I combed the FAQs and the forum archives but didn't see anything  
>> that jumped out. I am running the latest version from the openmpi  
>> trunk. TIA, Tom

>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
 ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres

Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How to cease the process triggered by OPENMPI

2008-07-28 Thread Rolf Vandevaart



One other option which should kill of processes and cleanup is the 
orte-clean command.  In your case, you could do the following:


mpirun -hostfile ~/hostfile --pernode orte-clean

There is a man page for it also.

Rolf

Brock Palen wrote:
You would be much better off to not use nohup, and then just kill the 
mpirun.


What I mean is a batch system 
(http://www.clusterresources.com/pages/products/torque-resource-manager.php). 
 Most batch systems have a launching system that lets you kill all the 
remote processes when you kill the job.  

Look at how MPI works.  When you are starting the way you are starting 
MPI (without a batch system) you are using ether ssh or rsh to start the 
remote processes.  Once these are started, the user has no control over 
the remote processes. 

Try killing your mpirun not your orted or pw.x.  You will be much 
happier with a batch system. 
Or make a script that ssh to hostfile and kills pw.x on all of them.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu 
(734)936-1985



On Jul 27, 2008, at 2:04 PM, vega lew wrote:

Dear Brock Palen,

Thank you for your responding.

My linux is redhat enterprise 4. My compiler is 10.1.015 version of 
intel fortran and intel c.


You said 'when the job is killed all the children are also'

But I started my OPENMPI job using the nohup command to put the job 
background like this,

" nohup mpirun -hostfile ~/hostfile -np 64 pw.x < input > output  & ".

When I killed one of the process named pw.x, all the others didn't stop.
When I killed the process named orted, the pw.x process in the same 
node stoped immediately,

but the job in the other node were still running.

Do you think there is something wrong with my cluster or openmpi or 
the software named pw.x?


Is there a command for openmpi to force all the process to stop in the 
cluster or a list of nodes to stop. 


Vega Lew (weijia liu)
PH.D Candidate in Chemical Engineering
State Key Laboratory of Materials-oriented Chemical Engineering
College of Chemistry and Chemical Engineering
Nanjing University of Technology, 210009, Nanjing, Jiangsu, China


From: bro...@umich.edu 
Date: Sat, 26 Jul 2008 12:52:08 -0400
To: us...@open-mpi.org 
Subject: Re: [OMPI users] How to cease the process triggered by OPENMPI

Does the cluster your using use a batch system?  Like SLURM, PBS or other?

If so many have native ways to launch jobs that OMPI can use.  SO that 
when the job is killed all the children are also.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu 
(734)936-1985



On Jul 26, 2008, at 12:25 PM, vega lew wrote:

Dear all,

I have enjoyed the openmpi a couple of days. With the help of
openmpi I could run ESPRESSO efficiently.

I started the mpi-job by the openmpi command like this,

" nohup mpirun -hostfile ~/hostfile -np 64 pw.x < input > output &".

When I want to stop the job before it finished, I find it not easy
to stop all the process manually. When I killed the process
in one node of the cluster, the processes in other nodes were
still running. So I must ssh to every node, find the
process id and kill the process. If there are 100 processors or
more for one mpi job, the situation even worse.

Is there a command for openmpi to force all the process to stop in
the cluster or a list of nodes to stop. 


vega

Vega Lew （weijia liu)
PH.D Candidate in Chemical Engineering
State Key Laboratory of Materials-oriented Chemical Engineering
College of Chemistry and Chemical Engineering
Nanjing University of Technology, 210009, Nanjing, Jiangsu, China

Explore the seven wonders of the world Learn more!

___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users




Get news, entertainment and everything you care about at Live.com. 
Check it out! 

___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] problem with alltoall with ppn=8

2008-08-18 Thread Rolf Vandevaart

Ashley Pittman wrote:

On Sat, 2008-08-16 at 08:03 -0400, Jeff Squyres wrote:
- large all to all operations are very stressful on the network, even  
if you have very low latency / high bandwidth networking such as DDR IB

- if you only have 1 IB HCA in a machine with 8 cores, the problem  
becomes even more difficult because all 8 of your MPI processes will  
be hammering the HCA with read and write requests; it's a simple I/O  
resource contention issue

That alone doesn't explain the sudden jump (drop) in performance
figures.

- there are several different algorithms in Open MPI for performing  
alltoall, but they were not tuned for ppn>4 (honestly, they were tuned  
for ppn=1, but they still usually work "well enough" for ppn<=4).  In  
Open MPI v1.3, we introduce the "hierarch" collective module, which  
should greatly help with ppn>4 kinds of scenarios for collectives  
(including, at least to some degree, all to all)

Is there a way to tell or influence which algorithm is used in the
current case?  Looking through the code I can see several but cannot see
how to tune the thresholds.

Ashley.

The answer is sort of.  By default, Open MPI uses precompiled thresholds 
to select different algorithms.

However, if you want to experiment with the different algorithms within 
the tuned component, you can tell Open MPI which one you want to use. 
This algorithm is then used for all calls to that collective.

For example, to tell it to use "pairwise alltoall", you would do this.

> mpirun -np 2 --mca coll_tuned_use_dynamic_rules 1 --mca 
coll_tuned_alltoall_algorithm 2 a.out

To see the different algorithms, you can look through the code or try 
and glean it from a call to ompi_info.

> ompi_info -all -mca coll_tuned_use_dynamic_rules 1 | grep alltoall

You can also create a file that can change the thresholds if you decide 
you want to change the precompiled ones.  I have only lightly tested 
that feature but it should work.

Rolf

--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Continuous poll/select using btl sm (svn 1.4a1r18899)

2008-08-29 Thread Rolf Vandevaart



I have submitted a ticket on this issue.

https://svn.open-mpi.org/trac/ompi/ticket/1468

Rolf

On 08/18/08 18:27, Mostyn Lewis wrote:

George,

I'm glad you changed the scheduling and my program seems to work.
Thank you.

However, to stress it a bit more I changed

#define NUM_ITERS 1000

to

#define NUM_ITERS 10

and it glues up at around ~30k

Please try it and see.

Regads,
Mostyn
Mostyn,

There was a problem with the SM BTL. Please upgrade to at least 19315 
and [hopefully] your application will run to completion.


  Thanks,
george.


On Jul 24, 2008, at 3:39 AM, Mostyn Lewis wrote:


Hello,

Using a very recent svn version (1.4a1r18899) I'm getting a 
non-terminating

condition if I use the sm btl with tcp,self or with openib,self.
The program is not finishing a reduce operation. It works if the sm btl
is left out.

Using 2 4 core nodes.
Program is:
- 


/*
* Copyright (c) 2001-2002 The Trustees of Indiana University.
* All rights reserved.
* Copyright (c) 1998-2001 University of Notre Dame.
* All rights reserved.
* Copyright (c) 1994-1998 The Ohio State University.
* All rights reserved.
*
* This file is part of the LAM/MPI software package.  For license
* information, see the LICENSE file in the top level directory of the
* LAM/MPI source distribution.
*
* $HEADER$
*
* $Id: cpi.c,v 1.4 2002/11/23 04:06:58 jsquyres Exp $
*
* Portions taken from the MPICH distribution example cpi.c.
*
* Example program to calculate the value of pi by integrating f(x) =
* 4 / (1 + x^2).
*/

#include 
#include 
#include 
#include 
#include 


/* Constant for how many values we'll estimate */

#define NUM_ITERS 1000


/* Prototype the function that we'll use below. */

static double f(double);


int
main(int argc, char *argv[])
{
 int iter, rank, size, i;
 double PI25DT = 3.141592653589793238462643;
 double mypi, pi, h, sum, x;
 double startwtime = 0.0, endwtime;
 int namelen;
 char processor_name[MPI_MAX_PROCESSOR_NAME];
 pid_t pid;
 pid = getpid();
 char foo[200];

 /* Normal MPI startup */

 MPI_Init(&argc, &argv);
 MPI_Comm_size(MPI_COMM_WORLD, &size);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Get_processor_name(processor_name, &namelen);

/*  if(rank == 7){ */
 printf("Process %d of %d on %s\n", rank, size, processor_name);
 /* system("set"); */
/*  sprintf(foo,"%s %d","/tools/linux/bin/cpu -o -p ",pid);
 system(foo); */ /*  } */

 /* Do approximations for 1 to 100 points */

 /* sleep(5); */
 for (iter = 2; iter < NUM_ITERS; ++iter) {
   h = 1.0 / (double) iter;
   sum = 0.0;

   /* A slightly better approach starts from large i and works back */

   if (rank == 0)
 startwtime = MPI_Wtime();

   for (i = rank + 1; i <= iter; i += size) {
 x = h * ((double) i - 0.5);
 sum += f(x);
   }
   mypi = h * sum;

   MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
/*if (rank == 0) {
MPI_Reduce(MPI_IN_PLACE, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, 
MPI_COMM_WORLD);

   } else {
MPI_Reduce(&mypi, NULL, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
   } */

/*if (rank == 0) {
 printf("%d points: pi is approximately %.16f, error = %.16f\n",
iter, pi, fabs(pi - PI25DT));
 endwtime = MPI_Wtime();
 printf("wall clock time = %f\n", endwtime - startwtime);
 fflush(stdout);
   } */
 }

 /* All done */

 MPI_Finalize();
 return 0;
}


static double
f(double a)
{
 return (4.0 / (1.0 + a * a));
} 
- 

A "good" non sm btl run (--mca btl tcp,self --mca opal_event_include 
select) gives

Process 1 of 8 on r4150_18
Process 3 of 8 on r4150_18
Process 5 of 8 on r4150_17
Process 4 of 8 on r4150_17
Process 7 of 8 on r4150_17
Process 6 of 8 on r4150_17
Process 0 of 8 on r4150_18
Process 2 of 8 on r4150_18

A "bad" sm btl run (--mca btl sm,tcp,self --mca opal_event_include 
select)

When using gdb to attach a non-terminating process shows:
(gdb) where
#0  0x00366d4c78d3 in __select_nocancel () from /lib64/libc.so.6
#1  0x2b076546 in select_dispatch (base=0x278bc00, 
arg=0x278bb50, tv=0x7fff8bd179a0)

   at ../../.././opal/event/select.c:176
#2  0x2b073308 in opal_event_base_loop (base=0x278bc00, flags=2)
   at ../../.././opal/event/event.c:803
#3  0x2b073004 in opal_event_loop (flags=2) at 
../../.././opal/event/event.c:726
#4  0x2b0636a4 in opal_progress () at 
../.././opal/runtime/opal_progress.c:189
#5  0x2aaf9c23 in opal_condition_wait (c=0x2add7bc0, 
m=0x2add7c20)

   at ../.././opal/threads/condition.h:100
#6  0x2aafa1bd in ompi_request_default_wait_all (count=1, 
requests=0x7fff8bd17b50,

   statuses=0x0) at ../.././ompi/request/req_wait.c:262
#7  0x2f0f913e in ompi_coll_tuned_reduce_generic 
(sendbuf=0x7fff8bd180d0,
   recvbuf=0x7fff8bd180c8, original_count=1, datatype=0x6012

Re: [OMPI users] Problems with compilig of OpenMPI 1.2.7

2008-08-29 Thread Rolf Vandevaart


Hi Paul:

I can comment on why you are seeing the mpicxx problem, but I am not 
sure what to do about it.


In the file mpicxx.cc there is a declaration near the bottom that looks 
like this.


const int LOCK_SHARED = MPI_LOCK_SHARED;

The preprocessor is going through that file and replacing LOCK_SHARED 
with 0x01.  Then when it tries to compile it you are trying to compile a 
line that looks like this.


const int 0x01 = 2;

That is why you see the error.

In our case I noticed that newer versions of Solaris have an include 
file called /usr/include/sys/synch.h with a define for LOCK_SHARED.


Here it is from the file.

#define LOCK_SHARED 0x01  /* same as USYNC_PROCESS */

So, what to do about it?  I am not sure yet.  I will check around a little.

Rolf


On 08/29/08 12:03, Paul Kapinos wrote:

Hi all,

I just tried to install the 1.2.7 version of OpenMPI alongside to our 
used 1.2.5 and 1.2.6.


Because we use gcc, intel, studio and pgi compilers on three OSes 
(Linux, Solaris/Sparc/ Solaris/Opteron), we have at least 15 Versions to 
compile (not all compilers are available everywere). Some versions are 
configured and compiled well (so, I don't check them yet, but they 
compiled and installed...)


But 8 of versions beraks the compilation with some crude error messages. 
Mostly there were logs like


"/usr/include/infiniband/kern-abi.h", line 144: zero-sized struct/union
"btl_openib_component.c", line 293: warning: syntax error:  empty 
declaration



Six of 8 broken versions stopped the compilation on the file 
./ompi/mpi/cxx/mpicxx.cc


+
..
Making all in cxx
gmake[3]: Entering directory 
`/rwthfs/rz/cluster/home/pk224850/OpenMPI/openmpi-1.2.7_SolX86_studio/ompi/mpi/cxx' 


source='mpicxx.cc' object='mpicxx.lo' libtool=yes \
DEPDIR=.deps depmode=none /bin/bash ../../../config/depcomp \
/bin/bash ../../../libtool --tag=CXX   --mode=compile CC 
-DHAVE_CONFIG_H -I. -I../../../opal/include -I../../../orte/include 
-I../../../ompi/include  -DOMPI_BUILDING_CXX_BINDINGS_LIBRARY=1 
-DOMPI_SKIP_MPICXX=1 -I../../..-DNDEBUG -O2 -m64 -mt -c -o mpicxx.lo 
mpicxx.cc
libtool: compile:  CC -DHAVE_CONFIG_H -I. -I../../../opal/include 
-I../../../orte/include -I../../../ompi/include 
-DOMPI_BUILDING_CXX_BINDINGS_LIBRARY=1 -DOMPI_SKIP_MPICXX=1 -I../../.. 
-DNDEBUG -O2 -m64 -mt -c mpicxx.cc  -KPIC -DPIC -o .libs/mpicxx.o
"mpicxx.cc", line 293: Error: A declaration does not specify a tag or an 
identifier.

"mpicxx.cc", line 293: Error: Use ";" to terminate declarations.
"mpicxx.cc", line 293: Error: A declaration was expected instead of "0x01".
3 Error(s) detected
...
+


So, it seems to me, there is somewhat nasty with one or more 
declarations  somewhere...


Does somebody have any idea what I am doing wrong? Or maybe there is an 
bug in 1.2.7?



Best regards
Paul Kapinos


P.S. I Used quite straght forward configuretion like following:

./configure --enable-static --with-devel-headers CFLAGS="-O2 -m64" 
CXXFLAGS="-O2 -m64" F77=f77 FFLAGS="-O2 -m64" FCFLAGS="-O2 -m64" 
LDFLAGS="-O2 -m64" --prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.7/solx8664/studio



Of course other versions (32bit, other compiler) are configured with 
adapted settings ;-) The very same settings applied to 1.2.5 and 1.2.6 
does not produce any problems in the past.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Problems with compilig of OpenMPI 1.2.7

2008-08-29 Thread Rolf Vandevaart



Hi Again Paul:

A workaround for your issue is to add the following to your configure line.

--without-threads

This will prevent the synch.h from being included as it is not needed 
anyways.  We will have to figure out a better solution but for now I 
think that will get you passed your mpicxx.cc issue.


Rolf

On 08/29/08 13:48, Rolf Vandevaart wrote:

Hi Paul:

I can comment on why you are seeing the mpicxx problem, but I am not 
sure what to do about it.


In the file mpicxx.cc there is a declaration near the bottom that looks 
like this.


const int LOCK_SHARED = MPI_LOCK_SHARED;

The preprocessor is going through that file and replacing LOCK_SHARED 
with 0x01.  Then when it tries to compile it you are trying to compile a 
line that looks like this.


const int 0x01 = 2;

That is why you see the error.

In our case I noticed that newer versions of Solaris have an include 
file called /usr/include/sys/synch.h with a define for LOCK_SHARED.


Here it is from the file.

#define LOCK_SHARED 0x01  /* same as USYNC_PROCESS */

So, what to do about it?  I am not sure yet.  I will check around a little.

Rolf


On 08/29/08 12:03, Paul Kapinos wrote:

Hi all,

I just tried to install the 1.2.7 version of OpenMPI alongside to our 
used 1.2.5 and 1.2.6.


Because we use gcc, intel, studio and pgi compilers on three OSes 
(Linux, Solaris/Sparc/ Solaris/Opteron), we have at least 15 Versions 
to compile (not all compilers are available everywere). Some versions 
are configured and compiled well (so, I don't check them yet, but they 
compiled and installed...)


But 8 of versions beraks the compilation with some crude error 
messages. Mostly there were logs like


"/usr/include/infiniband/kern-abi.h", line 144: zero-sized struct/union
"btl_openib_component.c", line 293: warning: syntax error:  empty 
declaration



Six of 8 broken versions stopped the compilation on the file 
./ompi/mpi/cxx/mpicxx.cc


+
..
Making all in cxx
gmake[3]: Entering directory 
`/rwthfs/rz/cluster/home/pk224850/OpenMPI/openmpi-1.2.7_SolX86_studio/ompi/mpi/cxx' 


source='mpicxx.cc' object='mpicxx.lo' libtool=yes \
DEPDIR=.deps depmode=none /bin/bash ../../../config/depcomp \
/bin/bash ../../../libtool --tag=CXX   --mode=compile CC 
-DHAVE_CONFIG_H -I. -I../../../opal/include -I../../../orte/include 
-I../../../ompi/include  -DOMPI_BUILDING_CXX_BINDINGS_LIBRARY=1 
-DOMPI_SKIP_MPICXX=1 -I../../..-DNDEBUG -O2 -m64 -mt -c -o 
mpicxx.lo mpicxx.cc
libtool: compile:  CC -DHAVE_CONFIG_H -I. -I../../../opal/include 
-I../../../orte/include -I../../../ompi/include 
-DOMPI_BUILDING_CXX_BINDINGS_LIBRARY=1 -DOMPI_SKIP_MPICXX=1 -I../../.. 
-DNDEBUG -O2 -m64 -mt -c mpicxx.cc  -KPIC -DPIC -o .libs/mpicxx.o
"mpicxx.cc", line 293: Error: A declaration does not specify a tag or 
an identifier.

"mpicxx.cc", line 293: Error: Use ";" to terminate declarations.
"mpicxx.cc", line 293: Error: A declaration was expected instead of 
"0x01".

3 Error(s) detected
...
+


So, it seems to me, there is somewhat nasty with one or more 
declarations  somewhere...


Does somebody have any idea what I am doing wrong? Or maybe there is 
an bug in 1.2.7?



Best regards
Paul Kapinos


P.S. I Used quite straght forward configuretion like following:

./configure --enable-static --with-devel-headers CFLAGS="-O2 -m64" 
CXXFLAGS="-O2 -m64" F77=f77 FFLAGS="-O2 -m64" FCFLAGS="-O2 -m64" 
LDFLAGS="-O2 -m64" 
--prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.7/solx8664/studio



Of course other versions (32bit, other compiler) are configured with 
adapted settings ;-) The very same settings applied to 1.2.5 and 1.2.6 
does not produce any problems in the past.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Problems with compilig of OpenMPI 1.2.7

2008-09-02 Thread Rolf Vandevaart


On 08/29/08 19:27, Jeff Squyres wrote:

On Aug 29, 2008, at 10:48 AM, Rolf Vandevaart wrote:

In the file mpicxx.cc there is a declaration near the bottom that 
looks like this.


const int LOCK_SHARED = MPI_LOCK_SHARED;

The preprocessor is going through that file and replacing LOCK_SHARED 
with 0x01.  Then when it tries to compile it you are trying to compile 
a line that looks like this.


const int 0x01 = 2;

That is why you see the error.


Hmm.  This hasn't changed in mpicxx.cc for a long time.  What made it 
get activated now?




I think I touched upon this in my earlier post.  There was a change in 
/usr/include/sys/synch.h Solaris header file. And one of the changes was 
adding the following line.


#define LOCK_SHARED 0x01/* same as USYNC_PROCESS */

Therefore, we are seeing it on later versions of Solaris.

Rolf

--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Why compilig in global paths (only) for configuretion files?

2008-09-17 Thread Rolf vandeVaart


Paul Kapinos wrote:

Hi Jeff again!


(update) it works with "truly" OpenMPI, but it works *not* with SUN 
Cluster Tools 8.0 (which is also an OpenMPI). So, it seems be an SUN 
problem and not general problem of openMPI. Sorry for false relating 
the problem.


Ah, gotcha.  I guess my Sun colleagues on this list will need to 
address that.  ;-)


I hope!






The only trouble we have now are the error messages like

-- 


Sorry!  You were supposed to get help about:
   no hca params found
from the file:
   help-mpi-btl-openib.txt
But I couldn't find any file matching that name.  Sorry!
-- 



(the job still runs without problems! :o)

if running openmpi from new location, and the old location being 
removed. (if the old location being also persistense there is no 
error, so it seems to be an attempt to access to an file on old path).


Doh; that's weird.

Maybe we have to explicitly pass the OPAL_PREFIX environment 
variable to all processes?


Hmm.  I don't need to do this in my 1.2.7 installation.  I do 
something like this (I assume you're using rsh/ssh as a launcher?):


We use zsh as login shell, ssh as communication protocol and an 
wrapper to mpiexec which produces an command somewhat like



/opt/MPI/openmpi-1.2.7/linux64/intel/bin/mpiexec -x LD_LIBRARY_PATH -x 
PATH -x MPI_NAME --hostfile 
/tmp/pk224850/26654@linuxhtc01/hostfile3564 -n 2 MPI_FastTest.exe



(hostfiles are generated temporarely by our wrapper due of load 
balancing, and /opt/MPI/openmpi-1.2.7/linux64/intel/ is the path to 
our local installation of OpenMPI... )



You see that we also explicitly order OpenMPI to export environment 
variables PATH and LD_LIBRARY_PATH.


If we add an " -x OPAL_PREFIX " flag, and through forces explicitly 
forwarding of this environment variable, the error was not occured. So 
we mean that this variable is needed to be exported across *all* 
systhems in cluster.


It seems, the variable OPAL_PREFIX  will *NOT* be automatically 
exported to new processes on the local and remote nodes.


Maybe the FAQ in 
http://www.open-mpi.org/faq/?category=building#installdirs should be 
extended in this mean?





Did you (or anyone reading this message) have any contact to SUN 
developers to point to this circumstance? *Why* do them use 
hard-coded paths? :o)


I don't know -- this sounds like an issue with the Sun CT 8 build 
process.  It could also be a by-product of using the combined 32/64 
feature...?  I haven't used that in forever and I don't remember the 
restrictions.  Terry/Rolf -- can you comment?



I will write an separate eMail to ct-feedb...@sun.com


Hi Paul:
Yes, there are Sun people on this list!  We originally put those 
hardcoded paths in to make everything work correctly out of the box and 
our install process ensured that everything would be at 
/opt/SUNWhpc/HPC8.0.  However, let us take a look at everything that was 
just discussed here and see what we can do.  We will get back to you 
shortly.


Rolf

Re: [OMPI users] openmpi+sge

2008-10-02 Thread Rolf Vandevaart


On 10/02/08 11:18, Reuti wrote:

Am 02.10.2008 um 16:51 schrieb Jaime Perea:


Hi

builtin, do I have to change them to ssh and sshd as in sge 6.1?


I always used only rsh, as ssh doesn't provide a Tight Integration with 
correct accounting (unless you compiled SGE with -tigth-ssh on your own).


But it would be worth a try with either the rsh or ssh stuff, as the 
builtin starter is a new feature of SGE 6.2.


-- Reuti


As was mentioned, SGE 6.2 has a new Integrated Job Starter so that rsh 
and ssh do not need to be used to start jobs on remote nodes.  This is 
the recommended way of starting as it is faster than ssh and more 
scalable than rsh.  And, you do not need to do any hacks for proper job 
accounting like was needed for ssh.


Under the covers, Open MPI uses qrsh to start the MPI jobs on all the 
nodes.


Not sure if that helps, but just wanted to mention that information.

Rolf







Thanks again

--
Jaime Perea


El Jueves, 2 de Octubre de 2008, Reuti escribió:

Am 02.10.2008 um 16:12 schrieb Jaime Perea:

Hi again, thanks for the answer

Actually I took the definition of the pe from the openmpi
webpage, in my case

qconf -sp orte
pe_nameorte
slots  24
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE

Our sge is version 6.2 and openmpi was configured with
the --with-sge switch of course.


In SGE 6.2 two types of remote startup are implemented. Which one are
you using (builtin or the former settings for each command) in the
SGE configuration?

-- Reuti


Regards

--
Jaime Perea

El Jueves, 2 de Octubre de 2008, Reuti escribió:

Hi,

Am 02.10.2008 um 15:37 schrieb Jaime Perea:

Hello,

I am having some problems with a combination of openmpi+sge6.2

Currently I'm working with the 1.3a1r19666 openmpi release and the


AFAIK, you have to enable SGE support in Open MPI 1.3 during its
compilation.


myrinet gm libraries (2.1.19)  but the problem was the same with the
prior 1.3 version. In short, I'm able to send jobs to a que via
qrsh,
more or less this way,

qrsh -cwd -V -q para -pe orte 6 mpirun -np 6 ctiming


It should also work without specifying the number of slots a second
time, i.e.:

qrsh -cwd -V -q para -pe orte 6 mpirun ctiming


ctiming is a small test program and in this way it works, but if I
try to
send the same task by using qsub on a script like this one

#!/bin/sh
#$ -pe orte 6


This PE has just /bin/true for start-/stop_proc_args?


#$ -q para
#$ -cwd
#
mpirun -np $NSLOTS  /model/jaime/ctiming


mpirun /model/jaime/ctiming


It fails with a message like this,
..

error reading job context from "qlogin_starter"


qlogin_starter should of course only be started with a qlogin command
in SGE.



--

A daemon (pid 11207) died unexpectedly with status 1 while
attempting
to launch so we are aborting.

There may be more information reported by the environment (see
above).

This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

.

I know that LD_LIBRARY_PATH is not the problem,  since I checked
that all
the environment is present any idea?

For previous releases of the sge and openmpi I was able to do them
work
together with a few wrappers,


Which version of SGE are you using?

-- Reuti


but now the integration looks much better!
This happen only when sending openmpi jobs.

Thanks and all the best

---

   Jaime D. Perea Duarte. 
 Linux registered user #10472

   Dep. Astrofisica Extragalactica.
   Instituto de Astrofisica de Andalucia (CSIC)
   Apdo. 3004, 18080 Granada, Spain.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] OMPI link error with petsc 2.3.3

2008-10-07 Thread Rolf Vandevaart



This is strange.  We need to look into this a little more.  However, you 
may be OK as the warning says it is taking the value from libmpi.so 
which I believe is the correct one.  Does your program run OK?


Rolf

On 10/07/08 10:57, Doug Reeder wrote:

Yann,

It looks like somehow the libmpi and libmpi_f90 have different values 
for the variable mpi_fortran_status_ignore. It sounds like a configure 
problem. You might check the mpi include files to see if you can see 
where the different values are coming from.


Doug Reeder
On Oct 7, 2008, at 7:55 AM, Yann JOBIC wrote:


Hello,

I'm using openmpi 1.3r19400 (ClusterTools 8.0), with sun studio 12, 
and solaris 10u5


I've got this error when linking a PETSc code :
ld: warning: symbol `mpi_fortran_status_ignore_' has differing sizes:
   (file /opt/SUNWhpc/HPC8.0/lib/amd64/libmpi.so value=0x8; file 
/opt/SUNWhpc/HPC8.0/lib/amd64/libmpi_f90.so value=0x14);

   /opt/SUNWhpc/HPC8.0/lib/amd64/libmpi.so definition taken


Isn't it very strange ?

Have you got any idea on the way to solve it ?

Many thanks,

Yann
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Problem running an mpi application on nodes with more than one interface

2012-02-17 Thread Rolf vandeVaart

Open MPI cannot handle having two interfaces on a node on the same subnet.  I 
believe it has to do with our matching code when we try to match up a 
connection.
The result is a hang as you observe.  I also believe it is not good practice to 
have two interfaces on the same subnet.
If you put them on different subnets, things will work fine and communication 
will stripe over the two of them.

Rolf


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Richard Bardwell
Sent: Friday, February 17, 2012 5:37 AM
To: Open MPI Users
Subject: Re: [OMPI users] Problem running an mpi application on nodes with 
more than one interface

I had exactly the same problem.
Trying to run mpi between 2 separate machines, with each machine having
2 ethernet ports, causes really weird behaviour on the most basic code.
I had to disable one of the ethernet ports on each of the machines
and it worked just fine after that. No idea why though !

- Original Message -
From: Jingcha Joba
To: us...@open-mpi.org
Sent: Thursday, February 16, 2012 8:43 PM
Subject: [OMPI users] Problem running an mpi application on nodes with more 
than one interface

Hello Everyone,
This is my 1st post in open-mpi forum.
I am trying to run a simple program which does Sendrecv between two nodes 
having 2 interface cards on each of two nodes.
Both the nodes are running RHEL6, with open-mpi 1.4.4 on a 8 core Xeon 
processor.
What I noticed was that when using two or more interface on both the nodes, the 
mpi "hangs" attempting to connect.
These details might help,
Node 1 - Denver has a single port "A" card (eth21 - 25.192.xx.xx - which I use 
to ssh to that machine), and a double port "B" card (eth23 - 10.3.1.1 & eth24 - 
10.3.1.2).
Node 2 - Chicago also the same single port A card (eth19 - 25.192.xx.xx - again 
uses for ssh) and a double port B card ( eth29 - 10.3.1.3 & eth30 - 10.3.1.4).
My /etc/host looks like
25.192.xx.xx denver.xxx.com denver
10.3.1.1 denver.xxx.com denver
10.3.1.2 denver.xxx.com denver
25.192.xx.xx chicago.xxx.com chicago
10.3.1.3 chicago.xxx.com chicago
10.3.1.4 chicago.xxx.com chicago
...
...
...
This is how I run,
mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude 
eth21,eth19,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
I get bunch of things from both chicago and denver, which says its has found 
components like tcp, sm, self and stuffs, and then hangs at
[denver.xxx.com:21682] btl: tcp: attempting to 
connect() to address 10.3.1.3 on port 4
[denver.xxx.com:21682] btl: tcp: attempting to 
connect() to address 10.3.1.4 on port 4
However, if I run the same program by excluding eth29 or eth30, then it works 
fine. Something like this:
mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude 
eth21,eth19,eth29,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
My hostfile looks like this
[sshuser@denver Sendrecv]$ cat host1
denver slots=2
chicago slots=2
I am not sure if I have to provide somethbing else. Please if I have to, please 
feel to ask me..
thanks,
--
Joba

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Open MPI 1.4.5 and CUDA support

2012-04-17 Thread Rolf vandeVaart

Yes, they are supported in the sense that they can work together.  However, if 
you want to have the ability to send/receive GPU buffers directly via MPI 
calls, then I recommend you get CUDA 4.1 and use the Open MPI trunk.

http://www.open-mpi.org/faq/?category=building#build-cuda

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rohan Deshpande
Sent: Tuesday, April 17, 2012 2:13 AM
To: Open MPI Users
Subject: [OMPI users] Open MPI 1.4.5 and CUDA support

Hi,

I am using Open MPI 1.4.5 and I have CUDA 3.2 installed.

Anyone knows whether CUDA 3.2 is supported by OpenMPI?

Thanks




---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] MPI and CUDA

2012-04-24 Thread Rolf vandeVaart

I am not sure about everything that is going wrong, but there are at least two 
issues I found.
First, you are skipping the first line that you read from integers.txt.  Maybe 
something like this instead.

  while(fgets(line, sizeof line, fp)!= NULL){
sscanf(line,"%d",&data[k]);
sum = sum + data[k]; // calculate sum to verify later on
k++;
}

Secondly, your function run_kernel is returning a pointer to an integer, but 
you are treating it as an integer.
A quick hack fix is:

mysumptr = run_kernel(...)
mysum = *mysumptr;

I would suggest adding lots of printfs or walking through a debugger to find 
out other places there might be problems.

Rolf

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Rohan Deshpande [rohan...@gmail.com]
Sent: Tuesday, April 24, 2012 3:35 AM
To: Open MPI Users
Subject: [OMPI users] MPI and CUDA

I am combining mpi and cuda. Trying to find out sum of array elements using 
cuda and using mpi to distribute the array.

my cuda code

#include 

__global__ void add(int *devarray, int *devsum)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
*devsum = *devsum + devarray[index];
}

extern "C"

int * run_kernel(int array[],int nelements)
{
int  *devarray, *sum, *devsum;
sum =(int *) malloc(1 * sizeof(int));

printf("\nrun_kernel called..");


cudaMalloc((void**) &devarray, sizeof(int)*nelements);
cudaMalloc((void**) &devsum, sizeof(int));
cudaMemcpy(devarray, array, sizeof(int)*nelements, 
cudaMemcpyHostToDevice);

//cudaMemcpy(devsum, sum, sizeof(int), cudaMemcpyHostToDevice);
add<<<2, 3>>>(devarray, devsum);
  //  printf("\ndevsum is %d", devsum);

cudaMemcpy(sum, devsum, sizeof(int), cudaMemcpyDeviceToHost);

printf(" \nthe sum is %d\n", *sum);
cudaFree(devarray);

cudaFree(devsum);
return sum;

}



#include "mpi.h"

#include 
#include 

#include 

#define  ARRAYSIZE  2000

#define  MASTER 0
int  data[ARRAYSIZE];


int main(int argc, char* argv[])
{


int   numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize, 
namelen;

int mysum;
long sum;
int update(int myoffset, int chunk, int myid);

char myname[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;
double start = 0.0, stop = 0.0, time = 0.0;

double totaltime;
FILE *fp;
char line[128];

char element;
int n;
int k=0;


/* Initializations */

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);

MPI_Get_processor_name(myname, &namelen);
printf ("MPI task %d has started on host %s...\n", taskid, myname);

chunksize = (ARRAYSIZE / numtasks);
tag2 = 1;

tag1 = 2;

/* Master task only **/


if (taskid == MASTER){

  fp=fopen("integers.txt", "r");

  if(fp != NULL){
   sum = 0;

   while(fgets(line, sizeof line, fp)!= NULL){

fscanf(fp,"%d",&data[k]);
sum = sum + data[k]; // calculate sum to verify later on

k++;
   }
  }


printf("Initialized array sum %d\n", sum);


  /* Send each task its portion of the array - master keeps 1st part */

  offset = chunksize;
  for (dest=1; dest MASTER) {


  /* Receive my portion of array from the master task */

  start= MPI_Wtime();
  source = MASTER;

  MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status);

  MPI_Recv(&data[offset], chunksize, MPI_INT, source, tag2,MPI_COMM_WORLD, 
&status);

  mysum = run_kernel(&data[offset], chunksize);
  printf("\nKernel returns sum %d ", mysum);


// mysum = update(offset, chunksize, taskid);

  stop = MPI_Wtime();
  time = stop -start;

  printf("time taken by process %d to recieve elements and caluclate own sum is 
= %lf seconds \n", taskid, time);

 // totaltime = totaltime + time;




  /* Send my results back to the master task */

  dest = MASTER;
  MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);

  MPI_Send(&data[offset], chunksize, MPI_INT, MASTER, tag2, MPI_COMM_WORLD);

  MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM, MASTER, MPI_COMM_WORLD);


  } /* end of non-master */


 MPI_Finalize();
}

here is the output of above code -

MPI task 2 has started on host 4
MPI task 3 has started on host 4
MPI task 0 has started on host 4
MPI task 1 has started on host 4

Initialized array sum 9061
Sent 500 elements to task 1 offset= 500
Sent 500 elements to task 2 offset= 1000
Sent 500 elements to task 3 offset= 1500



run_kernel called..
the sum is 10

Kernel returns sum 159300360 time taken by process 2 to recieve elements and 
caluclate own sum is = 0.290016 seconds
run_kernel called..
the sum is 268452367
run_kernel called..
the sum is 10

Kernel returns sum 145185544 time taken by process 3 to recieve elements and 
caluclate own sum is = 0.293579 seconds
run_kernel called..
the sum is 1048

Kernel returns sum 156969736 time taken by process 1 to recieve elements and 
c

Re: [OMPI users] MPI over tcp

2012-05-03 Thread Rolf vandeVaart

I tried your program on a single node and it worked fine.  Yes, TCP message 
passing in Open MPI has been working well for some time.
I have a few suggestions.
1. Can you run something like hostname successfully (mpirun -np 10 -hostfile 
yourhostfile hostname)
2. If that works, then you can also run with a debug switch to see what 
connections are being made by MPI.

I would suggest reading through here for some ideas and for the debug switch.

http://www.open-mpi.org/faq/?category=tcp

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Don Armstrong
>Sent: Thursday, May 03, 2012 2:51 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] MPI over tcp
>
>I'm attempting to use MPI over tcp; the attached (rather trivial) code gets
>stuck in MPI_Send. Looking at TCP dumps indicates that the TCP connection is
>made successfully to the right port, but the actual data doesn't appear to be
>sent.
>
>I'm beginning to suspect that there's some basic problem with my
>configuration, or an underlying bug in TCP message passing in MPI. Any
>suggestions to try (or a response indicating that MPI over TCP actually works,
>and that it's some problem with my setup) appreciated.
>
>The relevant portion of the hostfile looks like this:
>
>archimedes.int.donarmstrong.com slots=2
>krel.int.donarmstrong.com slots=8
>
>and the output of the run and tcpdump is attached.
>
>Thanks in advance.
>
>
>Don Armstrong
>
>--
>[T]he question of whether Machines Can Think, [...] is about as relevant as the
>question of whether Submarines Can Swim.
> -- Edsger W. Dijkstra "The threats to computing science"
>
>http://www.donarmstrong.com  http://rzlab.ucr.edu

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] MPI over tcp

2012-05-04 Thread Rolf vandeVaart

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Don Armstrong
>Sent: Thursday, May 03, 2012 5:43 PM
>To: us...@open-mpi.org
>Subject: Re: [OMPI users] MPI over tcp
>
>On Thu, 03 May 2012, Rolf vandeVaart wrote:
>> I tried your program on a single node and it worked fine.
>
>It works fine on a single node, but deadlocks when it communicates in
>between nodes. Single node communication doesn't use tcp by default.
>
>> Yes, TCP message passing in Open MPI has been working well for some
>> time.
>
>Ok. Which version(s) of openmpi are you using successfully? [I'm assuming
>that this is in an environment which doesn't use IB.]

I was using a trunk version from a month or so ago.  However, TCP has not 
changed too much over the years, so I would expect all versions to work just 
fine.

>
>> 1. Can you run something like hostname successfully (mpirun -np 10
>> -hostfile yourhostfile hostname)
>
>Yes, but this only shows that processes start and output is returned, which
>doesn't utilize the in-band message passing at all.

Yes, I agree. But it at least shows that TCP connections can work between the 
machines.  We typically first make sure that something like hostname works.
Then we try something like the connectivity_c.c program in the examples 
directory to test out MPI communication.

>
>> 2. If that works, then you can also run with a debug switch to see
>> what connections are being made by MPI.
>
>You can see the connections being made in the attached log:
>
>[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
>138.23.141.162 on port 2001

Yes, I missed that.  So, can we simplify the problem.  Can you run with np=2 
and one process on each node?
Also, maybe you can send the ifconfig output from each node.  We sometimes see 
this type of hanging when
a node has two different interfaces on the same subnet.  

Assuming there are multiple interfaces, can you experiment with the runtime 
flags outlined here?
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

Maybe by restricting to specific interfaces you can figure out which network is 
the problem.

>
>> I would suggest reading through here for some ideas and for the debug
>> switch.
>
>Thanks. I checked the FAQ, and didn't see anything that shed any light,
>unfortunately.
>
>
>Don Armstrong
>
>--
>Fate and Temperament are two words for one and the same concept.
> -- Novalis [Hermann Hesse _Demian_]
>
>http://www.donarmstrong.com  http://rzlab.ucr.edu
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] GPU and CPU timing - OpenMPI and Thrust

2012-05-08 Thread Rolf vandeVaart

You should be running with one GPU per MPI process.  If I understand correctly, 
you have a 3 node cluster and each node has a GPU so you should run with np=3.
Maybe you can try that and see if your numbers come out better.


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rohan Deshpande
Sent: Monday, May 07, 2012 9:38 PM
To: Open MPI Users
Subject: [OMPI users] GPU and CPU timing - OpenMPI and Thrust

 I am running MPI and Thrust code on a cluster and measuring time for 
calculations.

My MPI code -

#include "mpi.h"
#include 
#include 
#include 
#include 
#include 
#include 

#define  MASTER 0
#define ARRAYSIZE 2000

int 
*masterarray,*onearray,*twoarray,*threearray,*fourarray,*fivearray,*sixarray,*sevenarray,*eightarray,*ninearray;
   int main(int argc, char* argv[])
{
  int   numtasks, taskid,chunksize, namelen;
  int mysum,one,two,three,four,five,six,seven,eight,nine;

  char myname[MPI_MAX_PROCESSOR_NAME];
  MPI_Status status;
  int a,b,c,d,e,f,g,h,i,j;

/* Initializations */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
MPI_Get_processor_name(myname, &namelen);
printf ("MPI task %d has started on host %s...\n", taskid, myname);

masterarray= malloc(ARRAYSIZE * sizeof(int));
onearray= malloc(ARRAYSIZE * sizeof(int));
twoarray= malloc(ARRAYSIZE * sizeof(int));
threearray= malloc(ARRAYSIZE * sizeof(int));
fourarray= malloc(ARRAYSIZE * sizeof(int));
fivearray= malloc(ARRAYSIZE * sizeof(int));
sixarray= malloc(ARRAYSIZE * sizeof(int));
sevenarray= malloc(ARRAYSIZE * sizeof(int));
eightarray= malloc(ARRAYSIZE * sizeof(int));
ninearray= malloc(ARRAYSIZE * sizeof(int));

/* Master task only **/
if (taskid == MASTER){
   for(a=0; a < ARRAYSIZE; a++){
 masterarray[a] = 1;

}
   mysum = run_kernel0(masterarray,ARRAYSIZE,taskid, myname);

 }  /* end of master section */

  if (taskid > MASTER) {

 if(taskid == 1){
for(b=0;b
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

  extern "C"
 int run_kernel0( int array[], int nelements, int taskid, char hostname[])
 {

   float elapsedTime;
int result = 0;
int threshold = 2500;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
thrust::device_vector gpuarray;
int *begin = array;
int *end = array + nelements;
while(begin != end)
{
   int chunk_size = thrust::min(threshold,end - begin);
   gpuarray.assign(begin, begin + chunk_size);
 result += thrust::reduce(gpuarray.begin(), gpuarray.end());
   begin += chunk_size;
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

printf(" Task %d on has sum (on GPU): %ld Time for the kernel: %f ms 
\n", taskid, result, elapsedTime);

return result;
}

I also calculate the sum using CPU and the code is as below -

  struct timespec time1, time2, temp_time;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
  int i;
  int cpu_sum = 0;
  long diff = 0;

  for (i = 0; i < nelements; i++) {
cpu_sum += array[i];
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time2);
  temp_time.tv_sec = time2.tv_sec - time1.tv_sec;
  temp_time.tv_nsec = time2.tv_nsec - time1.tv_nsec;
  diff = temp_time.tv_sec * 10 + temp_time.tv_nsec;
  printf("Task %d calculated sum: %d using CPU in %lf ms \n", taskid, cpu_sum, 
(double) diff/100);
  return cpu_sum;

Now when I run the job on cluster with 10 MPI tasks and compare the timings of 
CPU and GPU, I get weird results where GPU time is much much higher than CPU 
time.
But the case should be opposite isnt it?

The CPU time is almost same for all the task but GPU time increases.

Just wondering what might be the cause of this or are these results correct? 
Anything wrong with MPI code?

My cluster has 3 machines. 4 MPI tasks run on 2 machine and 2 Tasks run on 1 
machine.
Each machine has 1 GPU - GForce 9500 GT with 512 MB memory.

Can anyone please help me with this.?

Thanks
--




---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] NVCC mpi.h: error: attribute "deprecated" does not take arguments

2012-06-18 Thread Rolf vandeVaart

Hi Dmitry:
Let me look into this.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Dmitry N. Mikushin
Sent: Monday, June 18, 2012 10:56 AM
To: Open MPI Users
Cc: Олег Рябков
Subject: Re: [OMPI users] NVCC mpi.h: error: attribute "__deprecated__" does 
not take arguments

Yeah, definitely. Thank you, Jeff.

- D.
2012/6/18 Jeff Squyres mailto:jsquy...@cisco.com>>
On Jun 18, 2012, at 10:41 AM, Dmitry N. Mikushin wrote:

> No, I'm configuring with gcc, and for openmpi-1.6 it works with nvcc without 
> a problem.
Then I think Rolf (from Nvidia) should figure this out; I don't have access to 
nvcc.  :-)

> Actually, nvcc always meant to be more or less compatible with gcc, as far as 
> I know. I'm guessing in case of trunk nvcc is the source of the issue.
>
> And with ./configure CC=nvcc etc. it won't build:
> /home/dmikushin/forge/openmpi-trunk/opal/mca/event/libevent2019/libevent/include/event2/util.h:126:2:
>  error: #error "No way to define ev_uint64_t"
You should complain to Nvidia about that.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] NVCC mpi.h: error: attribute "deprecated" does not take arguments

2012-06-18 Thread Rolf vandeVaart

Dmitry:

It turns out that by default in Open MPI 1.7, configure enables warnings for 
deprecated MPI functionality.  In Open MPI 1.6, these warnings were disabled by 
default.
That explains why you would not see this issue in the earlier versions of Open 
MPI.

I assume that gcc must have added support for __attribute__((__deprecated__)) 
and then later on __attribute__((__deprecated__(msg))) and your version of gcc 
supports both of these.  (My version of gcc, 4.5.1 does not support the msg in 
the attribute)

The version of nvcc you have does not support the "msg" argument so everything 
blows up.

I suggest you configure with -disable-mpi-interface-warning which will prevent 
any of the deprecated attributes from being used and then things should work 
fine.

Let me know if this fixes your problem.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rolf vandeVaart
Sent: Monday, June 18, 2012 11:00 AM
To: Open MPI Users
Cc: Олег Рябков
Subject: Re: [OMPI users] NVCC mpi.h: error: attribute "__deprecated__" does 
not take arguments

Hi Dmitry:
Let me look into this.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Dmitry N. Mikushin
Sent: Monday, June 18, 2012 10:56 AM
To: Open MPI Users
Cc: Олег Рябков
Subject: Re: [OMPI users] NVCC mpi.h: error: attribute "__deprecated__" does 
not take arguments

Yeah, definitely. Thank you, Jeff.

- D.
2012/6/18 Jeff Squyres mailto:jsquy...@cisco.com>>
On Jun 18, 2012, at 10:41 AM, Dmitry N. Mikushin wrote:

> No, I'm configuring with gcc, and for openmpi-1.6 it works with nvcc without 
> a problem.
Then I think Rolf (from Nvidia) should figure this out; I don't have access to 
nvcc.  :-)

> Actually, nvcc always meant to be more or less compatible with gcc, as far as 
> I know. I'm guessing in case of trunk nvcc is the source of the issue.
>
> And with ./configure CC=nvcc etc. it won't build:
> /home/dmikushin/forge/openmpi-trunk/opal/mca/event/libevent2019/libevent/include/event2/util.h:126:2:
>  error: #error "No way to define ev_uint64_t"
You should complain to Nvidia about that.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

Re: [OMPI users] gpudirect p2p (again)?

2012-07-09 Thread Rolf vandeVaart

Yes, this feature is in Open MPI 1.7.  It is implemented in the "smcuda" btl.  
If you configure as outlined in the FAQ, then things should just work.  The 
smcuda btl will be selected and P2P will be used between GPUs on the same node. 
 This is only utilized on transfers of buffers that are larger than 4K in size.

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Crni Gorac
>Sent: Monday, July 09, 2012 1:25 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] gpudirect p2p (again)?
>
>Trying to examine CUDA support in OpenMPI, using OpenMPI current feature
>series (v1.7).  There was a question on this mailing list back in October 2011
>(http://www.open-mpi.org/community/lists/users/2011/10/17539.php),
>about OpenMPI being able to use P2P transfers in case when two MPI
>processed involved in the transfer operation happens to execute on the same
>machine, and the answer was that this feature is being implemented.  So my
>question is - what is the current status here, is this feature supported now?
>
>Thanks.
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] bug in CUDA support for dual-processor systems?

2012-07-31 Thread Rolf vandeVaart

The current implementation does assume that the GPUs are on the same IOH and 
therefore can use the IPC features of the CUDA library for communication.
One of the initial motivations for this was that to be able to detect whether 
GPUs can talk to one another, the CUDA library has to be initialized and the 
GPUs have to be selected by each rank.  It is at that point that we can 
determine whether the IPC will work between the GPUs.However, this means 
that the GPUs need to be selected by each rank prior to the call to MPI_Init as 
that is where we determine whether IPC is possible, and we were trying to avoid 
that requirement.

I will submit a ticket against this and see if we can improve this.

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Zbigniew Koza
>Sent: Tuesday, July 31, 2012 12:38 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] bug in CUDA support for dual-processor systems?
>
>Hi,
>
>I wrote a simple program to see if OpenMPI can really handle cuda pointers as
>promised in the FAQ and how efficiently.
>The program (see below) breaks if MPI communication is to be performed
>between two devices that are on the same node but under different IOHs in a
>dual-processor Intel machine.
>Note that  cudaMemCpy works for such devices, although not as efficiently as
>for the devices on the same IOH and GPUDirect enabled.
>
>Here's the output from my program:
>
>===
>
> >  mpirun -n 6 ./a.out
>Init
>Init
>Init
>Init
>Init
>Init
>rank: 1, size: 6
>rank: 2, size: 6
>rank: 3, size: 6
>rank: 4, size: 6
>rank: 5, size: 6
>rank: 0, size: 6
>device 3 is set
>Process 3 is on typhoon1
>Using regular memory
>device 0 is set
>Process 0 is on typhoon1
>Using regular memory
>device 4 is set
>Process 4 is on typhoon1
>Using regular memory
>device 1 is set
>Process 1 is on typhoon1
>Using regular memory
>device 5 is set
>Process 5 is on typhoon1
>Using regular memory
>device 2 is set
>Process 2 is on typhoon1
>Using regular memory
>^C^[[A^C
>zkoza@typhoon1:~/multigpu$
>zkoza@typhoon1:~/multigpu$ vim cudamussings.c
>zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
>-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
>zkoza@typhoon1:~/multigpu$ vim cudamussings.c
>zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
>-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
>zkoza@typhoon1:~/multigpu$ mpirun -n 6 ./a.out Process 1 of 6 is on
>typhoon1 Process 2 of 6 is on typhoon1 Process 0 of 6 is on typhoon1 Process
>4 of 6 is on typhoon1 Process 5 of 6 is on typhoon1 Process 3 of 6 is on
>typhoon1 device 2 is set device 1 is set device 0 is set Using regular memory
>device 5 is set device 3 is set device 4 is set
>Host->device bandwidth for processor 1: 1587.993499 MB/sec device
>Host->bandwidth for processor 2: 1570.275316 MB/sec device bandwidth for
>Host->processor 3: 1569.890751 MB/sec device bandwidth for processor 5:
>Host->1483.637702 MB/sec device bandwidth for processor 0: 1480.888029
>Host->MB/sec device bandwidth for processor 4: 1476.241371 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [1] bandwidth: 3338.57 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [1] bandwidth: 420.85 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[1] bandwidth: 362.13 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Device[1] bandwidth: 6552.35 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [2] bandwidth: 3238.88 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [2] bandwidth: 418.18 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[2] bandwidth: 362.06 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Device[2] bandwidth: 5022.82 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [3] bandwidth: 3295.32 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [3] bandwidth: 418.90 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[3] bandwidth: 359.16 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Device[3] bandwidth: 5019.89 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [4] bandwidth: 4619.55 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [4] bandwidth: 419.24 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[4] bandwidth: 364.52 MB/sec
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>   cuIpcOpenMemHandle return value:   205
>   address: 0x20020
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>[typhoon1:06098] Failed to register remote memory, rc=-1 [typhoon1:06098]
>[[33788,1],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 465
>
>
>
>
>
>Comment:
>In my machine there are 2 six-core intel processors with HT on, yielding
>24 virtual processors, and  6 Tesla C2070s.
>The dev

Re: [OMPI users] CUDA in v1.7? (was: Compilation of OpenMPI 1.5.4 & 1.6.X fail for PGI compiler...)

2012-08-09 Thread Rolf vandeVaart

>-Original Message-
>From: Jeff Squyres [mailto:jsquy...@cisco.com]
>Sent: Thursday, August 09, 2012 9:45 AM
>To: Open MPI Users
>Cc: Rolf vandeVaart
>Subject: CUDA in v1.7? (was: Compilation of OpenMPI 1.5.4 & 1.6.X fail for PGI
>compiler...)
>
>On Aug 9, 2012, at 9:37 AM, ESCOBAR Juan wrote:
>
>> ... but as I'am also interested in testing the Open-MPI/CUDA feature (
>> with potentially pgi-acc or open-acc directive ) I've 'googled' and finish 
>> in the
>the Open-MPI  'trunck' .
>>
>> => this Open-MPI/CUDA feature will be only in the 1.9 serie or also on
>1.7/1.8 ?
>
>Good question.
>
>Rolf -- do you plan to bring cuda stuff to v1.7.x?
>
>--

Yes, there currently is support in Open MPI 1.7 for CUDA.  What is missing is 
some improvements for internode transfers over IB which I still plan to check 
into the trunk and then move over to 1.7.  
Hopefully within the next month or so.

Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] RDMA GPUDirect CUDA...

2012-08-14 Thread Rolf vandeVaart

To answer the original questions, Open MPI will look at taking advantage of the 
RDMA CUDA when it is available.  Obviously, work needs to be done to figure out 
the best way to integrate into the library.  Much like there are a variety of 
protocols under the hood to support host transfer of data via IB, we will have 
to see what works  best for transferring GPU buffers.

It is unclear how this will affect the send/receive latency.

Lastly, the support will be for Kepler -class Quadro and Tesla devices.

Rolf


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Durga Choudhury
Sent: Tuesday, August 14, 2012 4:46 PM
To: Open MPI Users
Subject: Re: [OMPI users] RDMA GPUDirect CUDA...

Dear OpenMPI developers

I'd like to add my 2 cents that this would be a very desirable feature 
enhancement for me as well (and perhaps others).

Best regards
Durga

On Tue, Aug 14, 2012 at 4:29 PM, Zbigniew Koza 
mailto:zzk...@gmail.com>> wrote:
Hi,

I've just found this information on  nVidia's plans regarding enhanced support 
for MPI in their CUDA toolkit:
http://developer.nvidia.com/cuda/nvidia-gpudirect

The idea that two GPUs can talk to each other via network cards without CPU as 
a middleman looks very promising.
This technology is supposed to be revealed and released in September.

My questions:

1. Will OpenMPI include   RDMA support in its CUDA interface?
2. Any idea how much can this technology reduce the CUDA Send/Recv latency?
3. Any idea whether this technology will be available for Fermi-class Tesla 
devices or only for Keplers?

Regards,

Z Koza



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] ompi-clean on single executable

2012-10-24 Thread Rolf vandeVaart

And just to give a little context, ompi-clean was created initially to "clean" 
up a node, not for cleaning up a specific job.  It was for the case where MPI 
jobs would leave some files behind or leave some processes running.  (I do not 
believe this happens much at all anymore.)  But, as was said, no reason it 
could not be modified.

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Jeff Squyres
>Sent: Wednesday, October 24, 2012 12:56 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] ompi-clean on single executable
>
>...but patches would be greatly appreciated.  :-)
>
>On Oct 24, 2012, at 12:24 PM, Ralph Castain wrote:
>
>> All things are possible, including what you describe. Not sure when we
>would get to it, though.
>>
>>
>> On Oct 24, 2012, at 4:01 AM, Nicolas Deladerriere
> wrote:
>>
>>> Reuti,
>>>
>>> The problem I am facing is a small small part of our production
>>> system, and I cannot modify our mpirun submission system. This is why
>>> i am looking at solution using only ompi-clean of mpirun command
>>> specification.
>>>
>>> Thanks,
>>> Nicolas
>>>
>>> 2012/10/24, Reuti :
 Am 24.10.2012 um 11:33 schrieb Nicolas Deladerriere:

> Reuti,
>
> Thanks for your comments,
>
> In our case, we are currently running different mpirun commands on
> clusters sharing the same frontend. Basically we use a wrapper to
> run the mpirun command and to run an ompi-clean command to clean
>up
> the mpi job if required.
> Using ompi-clean like this just kills all other mpi jobs running on
> same frontend. I cannot use queuing system

 Why? Using it on a single machine was only one possible setup. Its
 purpose is to distribute jobs to slave hosts. If you have already
 one frontend as login-machine it fits perfect: the qmaster (in case
 of SGE) can run there and the execd on the nodes.

 -- Reuti


> as you have suggested this
> is why I was wondering a option or other solution associated to
> ompi-clean command to avoid this general mpi jobs cleaning.
>
> Cheers
> Nicolas
>
> 2012/10/24, Reuti :
>> Hi,
>>
>> Am 24.10.2012 um 09:36 schrieb Nicolas Deladerriere:
>>
>>> I am having issue running ompi-clean which clean up (this is
>>> normal) session associated to a user which means it kills all
>>> running jobs assoicated to this session (this is also normal).
>>> But I would like to be able to clean up session associated to a
>>> job (a not user).
>>>
>>> Here is my point:
>>>
>>> I am running two executable :
>>>
>>> % mpirun -np 2 myexec1
>>> --> run with PID 2399 ...
>>> % mpirun -np 2 myexec2
>>> --> run with PID 2402 ...
>>>
>>> When I run orte-clean I got this result :
>>> % orte-clean -v
>>> orte-clean: cleaning session dir tree
>>> openmpi-sessions-ndelader@myhost_0
>>> orte-clean: killing any lingering procs
>>> orte-clean: found potential rogue orterun process
>>> (pid=2399,user=ndelader), sending SIGKILL...
>>> orte-clean: found potential rogue orterun process
>>> (pid=2402,user=ndelader), sending SIGKILL...
>>>
>>> Which means that both jobs have been killed :-( Basically I would
>>> like to perform orte-clean using executable name or PID or
>>> whatever that identify which job I want to stop an clean. It
>>> seems I would need to create an openmpi session per job. Does it
>make sense ?
>>> And
>>> I would like to be able to do something like following command
>>> and get following result :
>>>
>>> % orte-clean -v myexec1
>>> orte-clean: cleaning session dir tree
>>> openmpi-sessions-ndelader@myhost_0
>>> orte-clean: killing any lingering procs
>>> orte-clean: found potential rogue orterun process
>>> (pid=2399,user=ndelader), sending SIGKILL...
>>>
>>>
>>> Does it make sense ? Is there a way to perform this kind of
>>> selection in cleaning process ?
>>
>> How many jobs are you starting on how many nodes at one time? This
>> requirement could be a point to start to use a queuing system,
>> where can remove job individually and also serialize your
>> workflow. In fact: we use GridEngine also local on workstations
>> for this purpose.
>>
>> -- Reuti
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users

>>> ___
>>> users m

Re: [OMPI users] mpi_leave_pinned is dangerous

2012-11-08 Thread Rolf vandeVaart

Not sure.  I will look into this.   And thank you for the feedback Jens!
Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Jeff Squyres
>Sent: Thursday, November 08, 2012 8:49 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] mpi_leave_pinned is dangerous
>
>On Nov 7, 2012, at 7:21 PM, Jens Glaser wrote:
>
>> With the help of MVAPICH2 developer S. Potluri the problem was isolated
>and fixed.
>
>Sorry about not replying; we're all (literally) very swamped trying to prepare
>for the Supercomputing trade show/conference next week.  I know I'm
>wy behind on OMPI user mails; sorry folks.  :-(
>
>> It was, as expected, due to the library not intercepting the
>> cudaHostAlloc() and cudaFreeHost() calls to register pinned memory, as
>would be required for the registration cache to work.
>
>Rolf/NVIDIA -- what's the chance of getting that to be intercepted properly?
>Do you guys have good hooks for this?  (HINT HINT :-) )
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Stream interactions in CUDA

2012-12-13 Thread Rolf vandeVaart

Hi Justin:

I assume you are running on a single node.  In that case, Open MPI is supposed 
to take advantage of the CUDA IPC support.  This will be used only when 
messages are larger than 4K,  which yours are.  In that case, I would have 
expected that the library would exchange some messages and then do a DMA copy 
of all the GPU data.  For messages smaller than that, the library does 
synchronous copies through host memory.  (cuMemcpy).

Let me try out your application and see what I see.

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Jens Glaser
>Sent: Wednesday, December 12, 2012 8:12 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] Stream interactions in CUDA
>
>Hi Justin
>
>from looking at your code it seems you are receiving more bytes from the
>processors then you send (I assume MAX_RECV_SIZE_PER_PE >
>send_sizes[p]).
>I don't think this is valid. Your transfers should have matched sizes on the
>sending and receiving side. To achieve this, either communicate the message
>size before exchanging the actual data (a simple MPI_Isend/MPI_Irecv pair
>with one MPI_INT will do), or use a mechanism provided by the MPI library for
>this. I believe MPI_Probe is made for this purpose.
>
>As to why the transfers occur, my wild guess would be: you have set
>MAX_RECV_SIZE_PER_PE to something large, which would explain the size
>and number of the H2D transfers.
>I am just guessing, but maybe OMPI divides the data into chunks. Unless you
>are using intra-node Peer2Peer (smcuda), all MPI traffic has to go through the
>host, therefore the copies.
>I don't know what causes the D2H transfers to be of the same size, the library
>might be doing something strange here, given that you have potentially asked
>it to receive more data then you send - don't do that. Your third loop actually
>does not exchange the data, as you wrote, it just does an extra copying of
>data which in principle you could avoid by sending the message sizes first.
>
>Concerning your question about asynchronous copying. If you are using
>device buffers (and it seems you do) for MPI, then you will have to rely on the
>library to do asynchronous copying of the buffers (cudaMemcpyAsync) for
>you. I don't know if OpenMPI does this, you could check the source. I think
>MVAPICH2 does. If you really want control over the streams, you have to the
>D2H/H2D copying yourself, which is fine unless you are relying on peer-to-
>peer capability - but it seems you don't. If you are manually copying the data
>you can give any stream parameter to the cudaMemcpyAsync calls you prefer.
>
>My general experiences can be summarized as: achieving true async MPI
>computation is hard if using the CUDA support of the library, but very easy if
>you are using only the host routines of MPI. Since your kernel calls are async
>with respect to host already, all you have to do is asynchronously copy the
>data between host and device.
>
>Jens
>
>On Dec 12, 2012, at 6:30 PM, Justin Luitjens wrote:
>
>> Hello,
>>
>> I'm working on an application using OpenMPI with CUDA and GPUDirect.  I
>would like to get the MPI transfers to overlap with computation on the CUDA
>device.  To do this I need to ensure that all memory transfers do not go to
>stream 0.  In this application I have one step that performs an MPI_Alltoall
>operation.  Ideally I would like this Alltoall operation to be asynchronous.  
>Thus
>I have implemented my own Alltoall using Isend and Irecv.  Which can be
>found at the bottom of this email.
>>
>> The profiler shows that this operation has some very odd PCI-E traffic that I
>was hoping someone could explain and help me eliminate.  In this example
>NPES=2 and each process has its own M2090 GPU.  I am using cuda 5.0 and
>OpenMPI-1.7rc5.  The behavior I am seeing is the following.  Once the Isend
>loop occurs there is a sequence of DtoH followed by HtoD transfers.  These
>transfers are 256K in size and there are 28 of them that occur.  Each of these
>transfers are placed in stream0.  After this there are a few more small
>transfers also placed in stream0.  Finally when the 3rd loop occurs there are 2
>DtoD transfers (this is the actual data being exchanged).
>>
>> Can anyone explain what all of the traffic ping-ponging back and forth
>between the host and device is?  Is this traffic necessary?
>>
>> Thanks,
>> Justin
>>
>>
>> uint64_t scatter_gather( uint128 * input_buffer, uint128
>> *output_buffer, uint128 *recv_buckets, int* send_sizes, int
>> MAX_RECV_SIZE_PER_PE) {
>>
>>  std::vector srequest(NPES), rrequest(NPES);
>>
>>  //Start receives
>>  for(int p=0;p>
>>
>MPI_Irecv(recv_buckets+MAX_RECV_SIZE_PER_PE*p,MAX_RECV_SIZE_PER
>_PE,MPI
>> _INT_128,p,0,MPI_COMM_WORLD,&rrequest[p]);
>>  }
>>
>>  //Start sends
>>  int send_count=0;
>>  for(int p=0;p>
>MPI_Isend(input_buffer+send_count,send_sizes[p],MPI_INT_128,p,0,MPI_C
>OMM_WORLD,&srequest[p]);
>>send_count+=send_sizes[p];
>>  }
>>
>>  //Process out

Re: [OMPI users] status of cuda across multiple IO hubs?

2013-03-11 Thread Rolf vandeVaart

Yes, unfortunately, that issue is still unfixed.  I just created the ticket and 
included a possible workaround.

https://svn.open-mpi.org/trac/ompi/ticket/3531

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Russell Power
>Sent: Monday, March 11, 2013 11:28 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] status of cuda across multiple IO hubs?
>
>I'm running into issues when trying to use GPUs in a multiprocessor system
>when using the latest release candidate (1.7rc8).
>Specifically, it looks like the OpenMPI code is still assuming that all GPUs 
>are
>on the same IOH, as in this message from a few months
>ago:
>
>http://www.open-mpi.org/community/lists/users/2012/07/19879.php
>
>I couldn't determine what happened to the ticket mentioned in that thread.
>
>For the moment, I'm just constraining myself to using the GPUs attached to
>one processor, but obviously that's less then ideal :).
>
>Curiously, the eager send path doesn't seem to have the same issue - if I
>adjust btl_smcuda_eager_limit up, sends work up to that threshold.
>Unfortunately, if I increase it beyond 10 megabytes I start seeing bus errors.
>
>I can manually breakup my own sends to be below the eager limit, but that
>seems non-optimal.
>
>Any other recommendations?
>
>Thanks,
>
>R
>
>The testing code and output is pasted below.
>
>---
>
>#include 
>#include 
>#include 
>
>
>#  define CUDA_SAFE_CALL( call) {\
>cudaError err = call;\
>if( cudaSuccess != err) {\
>fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",\
>__FILE__, __LINE__, cudaGetErrorString( err) );  \
>exit(EXIT_FAILURE);  \
>} }
>
>int recv(int src) {
>  cudaSetDevice(0);
>  for (int bSize = 1; bSize < 100e6; bSize *= 2) {
>fprintf(stderr, "Recv: %d\n", bSize);
>void* buffer;
>CUDA_SAFE_CALL(cudaMalloc(&buffer, bSize));
>auto world = MPI::COMM_WORLD;
>world.Recv(buffer, bSize, MPI::BYTE, src, 0);
>CUDA_SAFE_CALL(cudaFree(buffer))
>  }
>}
>
>int send(int dst) {
>  cudaSetDevice(2);
>  for (int bSize = 1; bSize < 100e6; bSize *= 2) {
>fprintf(stderr, "Send: %d\n", bSize);
>void* buffer;
>CUDA_SAFE_CALL(cudaMalloc(&buffer, bSize));
>auto world = MPI::COMM_WORLD;
>world.Send(buffer, bSize, MPI::BYTE, dst, 0);
>CUDA_SAFE_CALL(cudaFree(buffer))
>  }
>}
>
>void checkPeerAccess() {
>  fprintf(stderr, "Access capability: gpu -> gpu\n");
>  for (int a = 0; a < 3; ++a) {
>for (int b = a; b < 3; ++b) {
>  if (a == b) { continue; }
>  int res;
>  cudaDeviceCanAccessPeer(&res, a, b);
>  fprintf(stderr, "%d <-> %d: %d\n", a, b, res);
>}
>  }
>}
>
>int main() {
>  MPI::Init_thread(MPI::THREAD_MULTIPLE);
>  if (MPI::COMM_WORLD.Get_rank() == 0) {
>checkPeerAccess();
>recv(1);
>  } else {
>send(0);
>  }
>  MPI::Finalize();
>}
>
>output from running:
>mpirun -mca btl_smcuda_eager_limit 64 -n 2 ./a.out Access capability: gpu ->
>gpu
>0 <-> 1: 1
>0 <-> 2: 0
>1 <-> 2: 0
>Send: 1
>Recv: 1
>Send: 2
>Recv: 2
>Send: 4
>Recv: 4
>Send: 8
>Recv: 8
>Send: 16
>Recv: 16
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   217
>  address: 0x230020
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

1 2 >

1 - 100 of 139 matches

Mail list logo