[OMPI devel] Launch windows nodes from linux

2012-01-20 Thread Alex.Burton
Hi developers,

I can see in the code that the part that launches processes on other
machines on Windows is not compiled on other platforms because it uses
COM.

Is there another way of launching processes on Windows from non windows
machines ?

What would I need to do to write a daemon similar to MPICH2s smpd which
runs as a windows service ?
It looks like it would only have to handle authentication and launch the
ORTE process.

I would use MPICH2, but it appears to not work with a heterogeneous
network.

Alex

Alex Burton
Research Engineer NSEC
CSIRO Energy Technology
Box 330 Newcastle NSW 2300
+61 2 49 606 110
alex.bur...@csiro.au



[OMPI devel] 1.4.5rc2 Solaris results [libtool problem]

2012-01-20 Thread Paul H. Hargrove

As promised earlier today, here are results from my Solaris platforms.
Note that there are libtool-related failures below that may be worth 
pursuing.
If necessary, access to most of my machines can be arranged for 
qualified persons.


== GNU compilers with {C,CXX,F77,FC}FLAGS=-mcpu=v9 on SPARCs, and -m64 
on amd64


PASS:
solaris-10 s10_69/sun4u (w/ g77, no FC)
solaris-10 Generic_142901-03/i386 (w/ Sun's f77 and f95, both dated 
April 2009)
solaris-11 snv_151a/amd64 [including ofud, openib and dapl] (w/ 
g77, no FC)


FAIL:
solaris-10 Generic_137111-07/sun4v with default GNU compilers
Using system default gcc, which is actually Sun's gccfss-4.0.4, there 
are assertion failures seen in the atomics in "make check".  I can 
provide details is anybody cares, but I know from past experience that 
support for gcc-style inline asm is marginal in this compiler.


== Sun Studio 12.2 compilers w/ {C,CXX,F77,FC}=-m64 on SPARCs and amd64

Both of my SPARC systems appear to have an out-of-date libmtsk.so, which 
both prevents the Sun f77 and f90 compilers from running at all, and 
additionally leads to failure like the following when building OpenMP 
support in VT:
/bin/bash ../../libtool --tag=CXX--mode=link sunCC -xopenmp 
-DVT_OMP  -m64 -xopenmp  -o vtfilter vtfilter-vt_filter.o  
vtfilter-vt_filthandler.o  vtfilter-vt_otfhandler.o  
vtfilter-vt_tracefilter.o ../../util/util.o  -L../../extlib/otf/otflib 
-L../../extlib/otf/otflib/.libs -lotf  -lz -lsocket -lnsl  -lrt -lm 
-lthread
libtool: link: sunCC -xopenmp -DVT_OMP -m64 -xopenmp -o vtfilter 
vtfilter-vt_filter.o vtfilter-vt_filthandler.o 
vtfilter-vt_otfhandler.o vtfilter-vt_tracefilter.o ../../util/util.o  
-L/home/hargrove/OMPI/openmpi-1.4.5rc2-solaris10-sparcT2-ss12u2/BLD/ompi/contrib/vt/vt/extlib/otf/otflib/.libs 
-L/home/hargrove/OMPI/openmpi-1.4.5rc2-solaris10-sparcT2-ss12u2/BLD/ompi/contrib/vt/vt/extlib/otf/otflib 
/home/hargrove/OMPI/openmpi-1.4.5rc2-solaris10-sparcT2-ss12u2/BLD/ompi/contrib/vt/vt/extlib/otf/otflib/.libs/libotf.a 
-lz -lsocket -lnsl -lrt -lm -lthread
CC: Warning: Optimizer level changed from 0 to 3 to support 
parallelized code.

Undefined   first referenced
 symbol in file
__mt_MasterFunction_cxt_vtfilter-vt_tracefilter.o
ld: fatal: Symbol referencing errors. No output written to vtfilter
*** Error code 2
This is a lack of required Solaris patches and NOT an ompi or vt problem 
to be solved.

However, as a result my two SPARC platforms are configured with
   --disable-mpi-f77 --disable-mpi-f90 
--with-contrib-vt-flags="--disable-omp --disable-hyb"
[It took a bit of work to figure out how disable OMP and not just VT in 
its entirety.]
I report this info just to note that my SPARC testing is "narrower" than 
on my x86 and amd64 machines.


The one "real" problem I found appears to be libtool related and 
impacted all 4 platforms:

solaris-10 s10_69/sun4u
solaris-10 Generic_142901-03/i386
solaris-11 snv_151a/amd64 [including ofud, openib and dapl]
solaris-10 Generic_137111-07/sun4v
No problem with "make all" or with "make check", but "make install" 
fails with:

Making install in mpi/cxx
make[2]: Entering directory 
`/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/ompi/mpi/cxx'
make[3]: Entering directory 
`/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/ompi/mpi/cxx'
test -z 
"/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/INST/lib" 
|| /usr/gnu/bin/mkdir -p 
"/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/INST/lib"
 /bin/sh ../../../libtool   --mode=install /usr/bin/ginstall -c  
'libmpi_cxx.la' 
'/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/INST/lib/libmpi_cxx.la'

libtool: install: warning: relinking `libmpi_cxx.la'
libtool: install: (cd 
/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/ompi/mpi/cxx; 
/bin/sh 
/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/libtool  
--tag CXX --mode=relink sunCC -O -DNDEBUG -m64 -version-info 0:1:0 
-export-dynamic -o libmpi_cxx.la -rpath 
/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/INST/lib 
mpicxx.lo intercepts.lo comm.lo datatype.lo win.lo file.lo 
../../../ompi/libmpi.la -lsocket -lnsl -lm -lthread )

mv: cannot stat `libmpi_cxx.so.0.0.1': No such file or directory
libtool: install: error: relink `libmpi_cxx.la' with the above command 
before installing it

make[3]: *** [install-libLTLIBRARIES] Error 1
make[3]: Leaving directory 
`/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/ompi/mpi/cxx'

make[2]: *** [install-am] Error 2
make[2]: Leaving directory 
`/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/ompi/mpi/cxx'

make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory 
`/home/phargrov/OMPI/openmpi-1.4.5rc2-solaris11-x64-ib-suncc/BLD/ompi'

make: *** [install-recursive] Error 1

No such problem was seen w/ the GNU compilers on the same 4

[OMPI devel] Violating standard in MPI_Close_port

2012-01-20 Thread Y.MATSUMOTO
Dear All,

Next is question about "MPI_Close_port".
According to the MPI-2.2 standard, 
the "port_name" argument of
MPI_Close_port() is marked as 'IN'.
But, in Open MPI (both trunk and 1.4.x), the content of
"port_name" is updated in MPI_Close_port().
It seems to violate the MPI standard.

The following is the suspicious part.
---ompi/mca/dpm/orte/dpm_orte.c---
919 static int close_port(char *port_name)
920 {
921 /* the port name is a pointer to an array - DO NOT FREE IT! */
922 memset(port_name, 0, MPI_MAX_PORT_NAME);
923 return OMPI_SUCCESS;
924 }
---ompi/mca/dpm/orte/dpm_orte.c---

This memset makes "port_name" "INOUT".
Would you tell me why call this memset?

Best regards,
Yuki MATSUMOTO
MPI development team,
Fujitsu



Re: [OMPI devel] 1.4.5rc2 now released

2012-01-20 Thread TERRY DONTJE



On 1/19/2012 5:22 PM, Paul H. Hargrove wrote:
Minor documentation nit, which might apply to the 1.5 branch as well 
(didn't check).


README says:

- Open MPI does not support the Sparc v8 CPU target, which is the
  default on Sun Solaris.  The v8plus (32 bit) or v9 (64 bit)
  targets must be used to build Open MPI on Solaris.  This can be
  done by including a flag in CFLAGS, CXXFLAGS, FFLAGS, and FCFLAGS,
  -xarch=v8plus for the Sun compilers, -mcpu=v9 for GCC.


However, following that instruction w/ Sun Studio 12 Update 2 yields:

cc: Warning: -xarch=v8plus is deprecated, use -m32 -xarch=sparc instead

for every single compilation.

I vaguely recall noting this once before, perhaps 2 years or so.

Additionally, it appears that the "Sun" example is for the 32-bit ABI 
and the "GCC" example for the 64-bit ABI.
Actually I think the whole comment is incorrect (at least from Solaris 
Studio 12u2 on) in that the default is no longer SPARC v8 target and 
that one can actually specify just -m32 and -m64 without the -xarch 
option.  So I wonder if we should just strike that whole block of text 
from the README.


Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI devel] Violating standard in MPI_Close_port

2012-01-20 Thread Ralph Castain
No reason for doing so comes to mind - I suspect the original author probably 
started out doing a "free", then discovered that the overlying MPI code was 
passing in an array and so just converted it to a memset. Either way, it really 
should be the responsibility of the user's code to deal with the  memory.

I'll remove it. Thanks for pointing it out!

On Jan 20, 2012, at 1:28 AM, Y.MATSUMOTO wrote:

> Dear All,
> 
> Next is question about "MPI_Close_port".
> According to the MPI-2.2 standard, 
> the "port_name" argument of
> MPI_Close_port() is marked as 'IN'.
> But, in Open MPI (both trunk and 1.4.x), the content of
> "port_name" is updated in MPI_Close_port().
> It seems to violate the MPI standard.
> 
> The following is the suspicious part.
> ---ompi/mca/dpm/orte/dpm_orte.c---
>919 static int close_port(char *port_name)
>920 {
>921 /* the port name is a pointer to an array - DO NOT FREE IT! */
>922 memset(port_name, 0, MPI_MAX_PORT_NAME);
>923 return OMPI_SUCCESS;
>924 }
> ---ompi/mca/dpm/orte/dpm_orte.c---
> 
> This memset makes "port_name" "INOUT".
> Would you tell me why call this memset?
> 
> Best regards,
> Yuki MATSUMOTO
> MPI development team,
> Fujitsu
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Launch windows nodes from linux

2012-01-20 Thread Ralph Castain
Guess I'm confused. The launcher is running on a Linux machine, so it has to 
use a Linux service to launch the remote daemon. Can you use ssh to launch the 
daemons onto the Windows machines? In other words, can you have the Windows 
machine support an ssh connection?

I did a quick search and found a number of options for supporting ssh 
connections to Windows. Here is one article that describes how to do it:

http://www.windowsnetworking.com/articles_tutorials/install-ssh-server-windows-server-2008.html

Once the daemon is started on the Windows machine, it will automatically select 
the Windows options for starting its local procs - so that shouldn't be an 
issue. The issue will be figuring out a way to get the daemon started.


On Jan 19, 2012, at 10:25 PM,  wrote:

> Hi developers,
>  
> I can see in the code that the part that launches processes on other
> machines on Windows is not compiled on other platforms because it uses
> COM.
>  
> Is there another way of launching processes on Windows from non windows
> machines ?
>  
> What would I need to do to write a daemon similar to MPICH2s smpd which
> runs as a windows service ?
> It looks like it would only have to handle authentication and launch the
> ORTE process.
>  
> I would use MPICH2, but it appears to not work with a heterogeneous
> network.
>  
> Alex
>  
> Alex Burton
> Research Engineer NSEC
> CSIRO Energy Technology
> Box 330 Newcastle NSW 2300
> +61 2 49 606 110
> alex.bur...@csiro.au
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] GPUDirect v1 issues

2012-01-20 Thread Sebastian Rinke
With 

* MLNX OFED stack tailored for GPUDirect
* RHEL + kernel patch 
* MVAPICH2 

it is possible to monitor GPUDirect v1 activities by means of observing changes 
to values in

* /sys/module/ib_core/parameters/gpu_direct_pages
* /sys/module/ib_core/parameters/gpu_direct_shares

By setting CUDA_NIC_INTEROP=1 there are no changes anymore.

Is there a different way now to monitor if GPUDirect actually works?

Sebastian.

On Jan 18, 2012, at 5:06 PM, Kenneth Lloyd wrote:

> It is documented in 
> http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technology_Overview.pdf
> set CUDA_NIC_INTEROP=1
>  
>  
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Sebastian Rinke
> Sent: Wednesday, January 18, 2012 8:15 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> Setting the environment variable fixed the problem for Open MPI with CUDA 
> 4.0. Thanks!
>  
> However, I'm wondering why this is not documented in the NVIDIA GPUDirect 
> package.
>  
> Sebastian.
>  
> On Jan 18, 2012, at 1:28 AM, Rolf vandeVaart wrote:
> 
> 
> Yes, the step outlined in your second bullet is no longer necessary. 
>  
> Rolf
>  
>  
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 5:22 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> Thank you very much. I will try setting the environment variable and if 
> required also use the 4.1 RC2 version.
> 
> To clarify things a little bit for me, to set up my machine with GPUDirect v1 
> I did the following:
> 
> * Install RHEL 5.4
> * Use the kernel with GPUDirect support
> * Use the MLNX OFED stack with GPUDirect support
> * Install the CUDA developer driver
> 
> Does using CUDA  >= 4.0  make one of the above steps  redundant?
> 
> I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is  
> not needed any more?
> 
> Sebastian.
> 
> Rolf vandeVaart wrote:
> I ran your test case against Open MPI 1.4.2 and CUDA 4.1 RC2 and it worked 
> fine.  I do not have a machine right now where I can load CUDA 4.0 drivers.
> Any chance you can try CUDA 4.1 RC2?  There were some improvements in the 
> support (you do not need to set an environment variable for one)
>  http://developer.nvidia.com/cuda-toolkit-41
>  
> There is also a chance that setting the environment variable as outlined in 
> this link may help you.
> http://forums.nvidia.com/index.php?showtopic=200629
>  
> However, I cannot explain why MVAPICH would work and Open MPI would not.  
>  
> Rolf
>  
>   
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
> On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 12:08 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> I use CUDA 4.0 with MVAPICH2 1.5.1p1 and Open MPI 1.4.2.
>  
> Attached you find a little test case which is based on the GPUDirect v1 test
> case (mpi_pinned.c).
> In that program the sender splits a message into chunks and sends them
> separately to the receiver which posts the corresponding recvs. It is a kind 
> of
> pipelining.
>  
> In mpi_pinned.c:141 the offsets into the recv buffer are set.
> For the correct offsets, i.e. increasing them, it blocks with Open MPI.
>  
> Using line 142 instead (offset = 0) works.
>  
> The tarball attached contains a Makefile where you will have to adjust
>  
> * CUDA_INC_DIR
> * CUDA_LIB_DIR
>  
> Sebastian
>  
> On Jan 17, 2012, at 4:16 PM, Kenneth A. Lloyd wrote:
>  
> 
> Also, which version of MVAPICH2 did you use?
>  
> I've been pouring over Rolf's OpenMPI CUDA RDMA 3 (using CUDA 4.1 r2)
> vis MVAPICH-GPU on a small 3 node cluster. These are wickedly interesting.
>  
> Ken
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
>   
> mpi.org]
> 
> On Behalf Of Rolf vandeVaart
> Sent: Tuesday, January 17, 2012 7:54 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> I am not aware of any issues.  Can you send me a test program and I
> can try it out?
> Which version of CUDA are you using?
>  
> Rolf
>  
>   
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
> 
> mpi.org]
> 
> On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 8:50 AM
> To: Open MPI Developers
> Subject: [OMPI devel] GPUDirect v1 issues
>  
> Dear all,
>  
> I'm using GPUDirect v1 with Open MPI 1.4.3 and experience blocking
> MPI_SEND/RECV to block forever.
>  
> For two subsequent MPI_RECV, it hangs if the recv buffer pointer of
> the second recv points to somewhere, i.e. not at the beginning, in
> the recv buffer (previously allocated with cudaMallocHost()).
>  
> I tried the same with MVAPICH2 and did not see the problem.
>  
> Does anybody know about issues with GPUDirect v1 using Open MPI?
>  
> Thanks for 

Re: [OMPI devel] GPUDirect v1 issues

2012-01-20 Thread Rolf vandeVaart
You can tell it is working because your program does not hang anymore :)  
Otherwise, there is a not a way that I am aware of.

Rolf

PS: And I assume you mean Open MPI under your third bullet below.

From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Sebastian Rinke
Sent: Friday, January 20, 2012 12:21 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] GPUDirect v1 issues

With

* MLNX OFED stack tailored for GPUDirect
* RHEL + kernel patch
* MVAPICH2

it is possible to monitor GPUDirect v1 activities by means of observing changes 
to values in

* /sys/module/ib_core/parameters/gpu_direct_pages
* /sys/module/ib_core/parameters/gpu_direct_shares

By setting CUDA_NIC_INTEROP=1 there are no changes anymore.

Is there a different way now to monitor if GPUDirect actually works?

Sebastian.

On Jan 18, 2012, at 5:06 PM, Kenneth Lloyd wrote:


It is documented in 
http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technology_Overview.pdf
set CUDA_NIC_INTEROP=1


From: devel-boun...@open-mpi.org 
[mailto:devel-boun...@open-mpi.org] On Behalf Of Sebastian Rinke
Sent: Wednesday, January 18, 2012 8:15 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] GPUDirect v1 issues

Setting the environment variable fixed the problem for Open MPI with CUDA 4.0. 
Thanks!

However, I'm wondering why this is not documented in the NVIDIA GPUDirect 
package.

Sebastian.

On Jan 18, 2012, at 1:28 AM, Rolf vandeVaart wrote:



Yes, the step outlined in your second bullet is no longer necessary.

Rolf


From: devel-boun...@open-mpi.org 
[mailto:devel-boun...@open-mpi.org] On Behalf Of Sebastian Rinke
Sent: Tuesday, January 17, 2012 5:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] GPUDirect v1 issues

Thank you very much. I will try setting the environment variable and if 
required also use the 4.1 RC2 version.

To clarify things a little bit for me, to set up my machine with GPUDirect v1 I 
did the following:

* Install RHEL 5.4
* Use the kernel with GPUDirect support
* Use the MLNX OFED stack with GPUDirect support
* Install the CUDA developer driver

Does using CUDA  >= 4.0  make one of the above steps  redundant?

I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is  
not needed any more?

Sebastian.

Rolf vandeVaart wrote:

I ran your test case against Open MPI 1.4.2 and CUDA 4.1 RC2 and it worked 
fine.  I do not have a machine right now where I can load CUDA 4.0 drivers.

Any chance you can try CUDA 4.1 RC2?  There were some improvements in the 
support (you do not need to set an environment variable for one)

 http://developer.nvidia.com/cuda-toolkit-41



There is also a chance that setting the environment variable as outlined in 
this link may help you.

http://forums.nvidia.com/index.php?showtopic=200629



However, I cannot explain why MVAPICH would work and Open MPI would not.



Rolf





-Original Message-

From: devel-boun...@open-mpi.org 
[mailto:devel-boun...@open-mpi.org]

On Behalf Of Sebastian Rinke

Sent: Tuesday, January 17, 2012 12:08 PM

To: Open MPI Developers

Subject: Re: [OMPI devel] GPUDirect v1 issues



I use CUDA 4.0 with MVAPICH2 1.5.1p1 and Open MPI 1.4.2.



Attached you find a little test case which is based on the GPUDirect v1 test

case (mpi_pinned.c).

In that program the sender splits a message into chunks and sends them

separately to the receiver which posts the corresponding recvs. It is a kind of

pipelining.



In mpi_pinned.c:141 the offsets into the recv buffer are set.

For the correct offsets, i.e. increasing them, it blocks with Open MPI.



Using line 142 instead (offset = 0) works.



The tarball attached contains a Makefile where you will have to adjust



* CUDA_INC_DIR

* CUDA_LIB_DIR



Sebastian



On Jan 17, 2012, at 4:16 PM, Kenneth A. Lloyd wrote:





Also, which version of MVAPICH2 did you use?



I've been pouring over Rolf's OpenMPI CUDA RDMA 3 (using CUDA 4.1 r2)

vis MVAPICH-GPU on a small 3 node cluster. These are wickedly interesting.



Ken

-Original Message-

From: devel-boun...@open-mpi.org 
[mailto:devel-bounces@open-



mpi.org]



On Behalf Of Rolf vandeVaart

Sent: Tuesday, January 17, 2012 7:54 AM

To: Open MPI Developers

Subject: Re: [OMPI devel] GPUDirect v1 issues



I am not aware of any issues.  Can you send me a test program and I

can try it out?

Which version of CUDA are you using?



Rolf





-Original Message-

From: devel-boun...@open-mpi.org 
[mailto:devel-bounces@open-



mpi.org]



On Behalf Of Sebastian Rinke

Sent: Tuesday, January 17, 2012 8:50 AM

To: Open MPI Developers

Subject: [OMPI devel] GPUDirect v1 issues



Dear all,



I'm using GPUDirect v1 with Open MPI 1.4.3 and experience blocking

MPI_SEND/RECV to block forever.



For two subsequent MPI_RECV,