[OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

2014-08-26 Thread Christoph Winter
Hey all,

to test the performance of my application I duplicated the call to the
function that will issue the computation on two GPUs 5 times. During the
4th and 5th run of the algorithm, however, the algorithm yields
different results (9 instead of 20):

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters
identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 *820* *9*
121.* 1000 *820* *9*

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both
compiled with cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can
be used (because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca
btl_smcuda_cuda_ipc_verbose 100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work
since the first GPU  identifies 9 clusters, the second GPU identifies 11
clusters (makes 20 clusters total). Debugging the application shows,
that all clusters are identified correctly, however, the exchange of the
identified clusters seems not to work: Each MPI process stores its
identified clusters in an buffer, that both processes exchange using
MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that
will receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems
to cause the problems (synchronisation and/or fail-silent error) and
indeed, disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca
btl_smcuda_use_cuda_ipc_same_gpu 0 -np 2 ./double_test
../data/similarities2.double.-300 ex.2.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters
identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 *807 20*
121.* 1000 *807 20*

Surprisingly, the wrong results _always_ occur during the 4th and 5th
run. Is there a way to force synchronisation (I tried MPI_Barrier()
without success), has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph


[OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Syed Ahsan Ali
I have problems in compilation of openmpi-1.8.1 on Linux machine. Kindly
see the logs attached.


configure.bz2
Description: BZip2 compressed data


Re: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

2014-08-26 Thread Rolf vandeVaart
Hi Christoph:
I will try and reproduce this issue and will let you know what I find.  There 
may be an issue with CUDA IPC support with certain traffic patterns.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter
Sent: Tuesday, August 26, 2014 2:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

Hey all,

to test the performance of my application I duplicated the call to the function 
that will issue the computation on two GPUs 5 times. During the 4th and 5th run 
of the algorithm, however, the algorithm yields different results (9 instead of 
20):

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 820 9
121.* 1000 820 9

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with 
cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can be used 
(because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose 
100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work since 
the first GPU  identifies 9 clusters, the second GPU identifies 11 clusters 
(makes 20 clusters total). Debugging the application shows, that all clusters 
are identified correctly, however, the exchange of the identified clusters 
seems not to work: Each MPI process stores its identified clusters in an 
buffer, that both processes exchange using MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that will 
receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems to 
cause the problems (synchronisation and/or fail-silent error) and indeed, 
disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0 
-np 2 ./double_test ../data/similarities2.double.-300 
ex.2.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20

Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is 
there a way to force synchronisation (I tried MPI_Barrier() without success), 
has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] long initialization

2014-08-26 Thread Timur Ismagilov

Hello!
Here is my time results:
$time mpirun -n 1 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 21, 
2014 (nightly snapshot tarball), 146)
real 1m3.985s
user 0m0.031s
sys 0m0.083s


Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain :
>I'm also puzzled by your timing statement - I can't replicate it:
>
>07:41:43  $ time mpirun -n 1 ./hello_c
>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer copy, 
>125)
>
>real 0m0.547s
>user 0m0.043s
>sys 0m0.046s
>
>The entire thing ran in 0.5 seconds
>
>
>On Aug 22, 2014, at 6:33 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>Hi,
>>The default delimiter is ";" . You can change delimiter with 
>>mca_base_env_list_delimiter.
>>
>>
>>
>>On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>wrote:
>>>Hello!
>>>If i use latest night snapshot:
>>>$ ompi_info -V
>>>Open MPI v1.9a1r32570
>>>*  In programm hello_c initialization takes ~1 min
>>>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>>>*  if i use 
>>>$mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
>>>--map-by slot:pe=8 -np 1 ./hello_c
>>>i got error 
>>>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>>>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>>>but with -x all works fine (but with warn)
>>>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>>>WARNING: The mechanism by which environment variables are explicitly
>>>..
>>>..
>>>..
>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>>21, 2014 (nightly snapshot tarball), 146)
>>>Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain < r...@open-mpi.org >:
Not sure I understand. The problem has been fixed in both the trunk and the 
1.8 branch now, so you should be able to work with either of those nightly 
builds.

On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>Have i I any opportunity to run mpi jobs?
>
>
>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >:
>>yes, i know - it is cmr'd
>>
>>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > 
>>wrote:
>>>btw, we get same error in v1.8 branch as well.
>>>
>>>
>>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain   < r...@open-mpi.org >   
>>>wrote:
It was not yet fixed - but should be now.

On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > 
wrote:
>Hello!
>
>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still 
>have the problem
>
>a)
>$ mpirun  -np 1 ./hello_c
>--
>An ORTE daemon has unexpectedly failed after launch and before
>communicating back to mpirun. This could be caused by a number
>of factors, including an inability to create a connection back
>to mpirun due to a lack of common network interfaces and/or no
>route found between them. Please check network connectivity
>(including firewalls and network routing requirements).
>--
>b)
>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>--
>An ORTE daemon has unexpectedly failed after launch and before
>communicating back to mpirun. This could be caused by a number
>of factors, including an inability to create a connection back
>to mpirun due to a lack of common network interfaces and/or no
>route found between them. Please check network connectivity
>(including firewalls and network routing requirements).
>--
>
>c)
>
>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca 
>plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 
>-np 1 ./hello_c
>[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>[compiler-2:14673] mca:base:select:( plm) Query of component 
>[isolated] set priority to 0
>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>priority to 10
>[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] 
>set pr

Re: [OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Ralph Castain
Looks like there is something wrong with your gfortran install:

*** Fortran compiler
checking for gfortran... gfortran
checking whether we are using the GNU Fortran compiler... yes
checking whether gfortran accepts -g... yes
checking whether ln -s works... yes
checking if Fortran compiler works... no
**
* It appears that your Fortran compiler is unable to produce working
* executables.  A simple test application failed to properly
* execute.  Note that this is likely not a problem with Open MPI,
* but a problem with the local compiler installation.  More
* information (including exactly what command was given to the
* compiler and what error resulted when the command was executed) is
* available in the config.log file in the Open MPI build directory.
**
configure: error: Could not run a simple Fortran program.  Aborting.


FWIW: I can compile and run on my CentOS6.5 system just fine. I have gfortran 
4.4.7 installed on it

On Aug 26, 2014, at 2:59 AM, Syed Ahsan Ali  wrote:

> 
> 
> I have problems in compilation of openmpi-1.8.1 on Linux machine. Kindly see 
> the logs attached.
>  
>  
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25147.php



Re: [OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Jeff Squyres (jsquyres)
Just to elaborate: as the error message implies, this error message was put 
there specifically to ensure that the Fortran compiler works before continuing 
any further.  If the Fortran compiler is busted, configure exits with this help 
message.

You can either fix your Fortran compiler, or use --disable-mpi-fortran to 
disable all Fortran support from Open MPI (and therefore this "test whether the 
Fortran compiler works" test will be skipped).

Here's the specific log section showing the failure:

-
configure:32389: checking if Fortran compiler works
configure:32418: gfortran -o conftestconftest.f  >&5
configure:32418: $? = 0
configure:32418: ./conftest
./conftest: error while loading shared libraries: libquadmath.so.0: wrong ELF 
class: ELFCLASS32
configure:32418: $? = 127
configure: program exited with status 127
configure: failed program was:
|   program main
|
|   end
configure:32434: result: no
configure:32448: error: Could not run a simple Fortran program.  Aborting.
-


On Aug 26, 2014, at 10:10 AM, Ralph Castain  wrote:

> Looks like there is something wrong with your gfortran install:
> 
> *** Fortran compiler
> checking for gfortran... gfortran
> checking whether we are using the GNU Fortran compiler... yes
> checking whether gfortran accepts -g... yes
> checking whether ln -s works... yes
> checking if Fortran compiler works... no
> **
> * It appears that your Fortran compiler is unable to produce working
> * executables.  A simple test application failed to properly
> * execute.  Note that this is likely not a problem with Open MPI,
> * but a problem with the local compiler installation.  More
> * information (including exactly what command was given to the
> * compiler and what error resulted when the command was executed) is
> * available in the config.log file in the Open MPI build directory.
> **
> configure: error: Could not run a simple Fortran program.  Aborting.
> 
> 
> FWIW: I can compile and run on my CentOS6.5 system just fine. I have gfortran 
> 4.4.7 installed on it
> 
> On Aug 26, 2014, at 2:59 AM, Syed Ahsan Ali  wrote:
> 
>> 
>> 
>> I have problems in compilation of openmpi-1.8.1 on Linux machine. Kindly see 
>> the logs attached.
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25147.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25150.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] OpenMPI Remote Execution Problem (Application does not start)

2014-08-26 Thread Benjamin Giehle
Hello,

i have a problem with running my mpi application on a remote machine.
If I start the application via ssh everything works just fine, but if I use 
mpirun the application won't start.
If I start the application on the local machine with mpi it works too.

ssh myhost ./myapp   <- works
mpirun -np 2 ./myapp  <- works
mpirun -np 2 --host myhost  ./myapp<- does not work

I already configured ssh, so that I don't have to enter a password.
I am using the OpenMPI Version 1.8.1 on both machines.
I uploaded all required files, I hope you can help me...

Regards

Benjamin Giehle

config.log.bz2
Description: config.log.bz2
[superuser@CUDAServer ~]$ mpirun -d -np 2 --host 192.168.54.137 ./MAFMpi > 
/mnt/nas/cudaPraktikant/mpirun.txt
[CUDAServer:24592] procdir: 
/tmp/openmpi-sessions-superuser@CUDAServer_0/61346/0/0
[CUDAServer:24592] jobdir: /tmp/openmpi-sessions-superuser@CUDAServer_0/61346/0
[CUDAServer:24592] top: openmpi-sessions-superuser@CUDAServer_0
[CUDAServer:24592] tmp: /tmp
[CUDAServer:24592] sess_dir_cleanup: job session dir does not exist
[CUDAServer:24592] procdir: 
/tmp/openmpi-sessions-superuser@CUDAServer_0/61346/0/0
[CUDAServer:24592] jobdir: /tmp/openmpi-sessions-superuser@CUDAServer_0/61346/0
[CUDAServer:24592] top: openmpi-sessions-superuser@CUDAServer_0
[CUDAServer:24592] tmp: /tmp
[localhost.localdomain:08464] procdir: 
/tmp/openmpi-sessions-superuser@localhost_0/61346/0/1
[localhost.localdomain:08464] jobdir: 
/tmp/openmpi-sessions-superuser@localhost_0/61346/0
[localhost.localdomain:08464] top: openmpi-sessions-superuser@localhost_0
[localhost.localdomain:08464] tmp: /tmp
[localhost.localdomain:08464] sess_dir_cleanup: job session dir does not exist
[localhost.localdomain:08464] procdir: 
/tmp/openmpi-sessions-superuser@localhost_0/61346/0/1
[localhost.localdomain:08464] jobdir: 
/tmp/openmpi-sessions-superuser@localhost_0/61346/0
[localhost.localdomain:08464] top: openmpi-sessions-superuser@localhost_0
[localhost.localdomain:08464] tmp: /tmp
[localhost.localdomain:08464] sess_dir_cleanup: job session dir does not exist
exiting with status 1
[CUDAServer:24592] sess_dir_cleanup: job session dir does not exist
exiting with status 1



ompiinfo.txt.bz2
Description: ompiinfo.txt.bz2


ompiparseable.txt.bz2
Description: ompiparseable.txt.bz2


Re: [OMPI users] OpenMPI Remote Execution Problem (Application does not start)

2014-08-26 Thread Ralph Castain
Add --enable-debug to your configure, and then re-run the --host test and add 
"--leave-session-attached -mca plm_base_verbose 5 -ma oob_base_verbose 5" and 
let's see what's going on


On Aug 26, 2014, at 7:31 AM, Benjamin Giehle  wrote:

> Hello,
> 
> i have a problem with running my mpi application on a remote machine.
> If I start the application via ssh everything works just fine, but if I use 
> mpirun the application won't start.
> If I start the application on the local machine with mpi it works too.
> 
> ssh myhost ./myapp   <- works
> mpirun -np 2 ./myapp  <- works
> mpirun -np 2 --host myhost  ./myapp<- does not work
> 
> I already configured ssh, so that I don't have to enter a password.
> I am using the OpenMPI Version 1.8.1 on both machines.
> I uploaded all required files, I hope you can help me...
> 
> Regards
> 
> Benjamin 
> Giehle___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25152.php



Re: [OMPI users] long initialization

2014-08-26 Thread Timur Ismagilov

I'm using slurm 2.5.6

$salloc -N8 --exclusive -J ompi -p test
$ srun hostname
node1-128-21
node1-128-24
node1-128-22
node1-128-26
node1-128-27
node1-128-20
node1-128-25
node1-128-23
$ time mpirun -np 1 --host node1-128-21 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 21, 
2014 (nightly snapshot tarball), 146)
real 1m3.932s
user 0m0.035s
sys 0m0.072s


Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain :
>hmmmwhat is your allocation like? do you have a large hostfile, for 
>example?
>
>if you add a --host argument that contains just the local host, what is the 
>time for that scenario?
>
>On Aug 26, 2014, at 6:27 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>Hello!
>>Here is my time results:
>>$time mpirun -n 1 ./hello_c
>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>21, 2014 (nightly snapshot tarball), 146)
>>real 1m3.985s
>>user 0m0.031s
>>sys 0m0.083s
>>
>>
>>Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>I'm also puzzled by your timing statement - I can't replicate it:
>>>
>>>07:41:43    $ time mpirun -n 1 ./hello_c
>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>>>Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer 
>>>copy, 125)
>>>
>>>real 0m0.547s
>>>user 0m0.043s
>>>sys 0m0.046s
>>>
>>>The entire thing ran in 0.5 seconds
>>>
>>>
>>>On Aug 22, 2014, at 6:33 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
Hi,
The default delimiter is ";" . You can change delimiter with 
mca_base_env_list_delimiter.



On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov   < tismagi...@mail.ru >   
wrote:
>Hello!
>If i use latest night snapshot:
>$ ompi_info -V
>Open MPI v1.9a1r32570
>*  In programm hello_c initialization takes ~1 min
>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>*  if i use 
>$mpirun  --mca mca_base_env_list 
>'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' --map-by slot:pe=8 -np 1 
>./hello_c
>i got error 
>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>but with -x all works fine (but with warn)
>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>WARNING: The mechanism by which environment variables are explicitly
>..
>..
>..
>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>21, 2014 (nightly snapshot tarball), 146)
>Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain < r...@open-mpi.org >:
>>Not sure I understand. The problem has been fixed in both the trunk and 
>>the 1.8 branch now, so you should be able to work with either of those 
>>nightly builds.
>>
>>On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > 
>>wrote:
>>>Have i I any opportunity to run mpi jobs?
>>>
>>>
>>>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >:
yes, i know - it is cmr'd

On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > 
wrote:
>btw, we get same error in v1.8 branch as well.
>
>
>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain   < r...@open-mpi.org > 
>  wrote:
>>It was not yet fixed - but should be now.
>>
>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > 
>>wrote:
>>>Hello!
>>>
>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i 
>>>still have the problem
>>>
>>>a)
>>>$ mpirun  -np 1 ./hello_c
>>>--
>>>An ORTE daemon has unexpectedly failed after launch and before
>>>communicating back to mpirun. This could be caused by a number
>>>of factors, including an inability to create a connection back
>>>to mpirun due to a lack of common network interfaces and/or no
>>>route found between them. Please check network connectivity
>>>(including firewalls and network routing requirements).
>>>--
>>>b)
>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>--
>>>An ORTE daemon has unexpectedly failed after launch and before
>>>communicating back to mpirun. This could be caused by a number
>>>of factors, including an inability to create a connection back
>>>to mpirun du

Re: [OMPI users] A daemon on node cl231 failed to start as expected

2014-08-26 Thread Pengcheng Wang
Hi Reuti,

Thanks a lot for your help.

The 'Openmp' PE in our clusters has the allocation rule 'pe_slots'. But I
guess I can only use limited slots for my job under this PE...

The command 'qacct -j jobID' gives the information below. It turns out the
job might exceed its memory allocation. After setting a larger h_vmem (5G),
it works now.

*$ qacct -j jobID*
*...*
*failed   100 : assumedly after job*
*exit_status  137*
*...*
*maxvmem  4.003G*

However, in this case, the number of slots my job can use is still limited.
For example, in one cluster, the job can run for a few seconds with 10
slots. Then the job state (qstat) becomes 'dr' and it is deleted by the
shell without any error messages. In another cluster, an error message
below will appear if I require more than 8 slots.





*[cl093:30366] mca_btl_mx_init: mx_open_endpoint() failed with
status=20--[0,3,0]:
Myrinet/MX on host cl093 was unable to find any endpoints. Another
transport will be used instead, although this may result inlower
performance.*

Anyway, it can temporarily work with 8 slots now, especially when these 8
slots are on the same machine coincidentally, which allows a large virtual
memory limit. It would be better if it can be run with more slots to save
computational time.

Best regards,
Pengcheng


On Mon, Aug 25, 2014 at 1:00 PM,  wrote:

> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: Running a hybrid MPI+openMP program (Reuti)
>2. Re: A daemon on node cl231 failed to start as expected
>   (Pengcheng) (Pengcheng Wang)
>3. Re: A daemon on node cl231 failed to start as expected
>   (Pengcheng) (Reuti)
>
>
> --
>
> Message: 1
> Date: Mon, 25 Aug 2014 11:51:35 +0200
> From: Reuti 
> To: Open MPI Users 
> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> Message-ID:
> <9eae85f0-5479-45af-a8f1-14519216b...@staff.uni-marburg.de>
> Content-Type: text/plain; charset=us-ascii
>
> Am 21.08.2014 um 16:50 schrieb Reuti:
>
> > Am 21.08.2014 um 16:00 schrieb Ralph Castain:
> >
> >>
> >> On Aug 21, 2014, at 6:54 AM, Reuti  wrote:
> >>
> >>> Am 21.08.2014 um 15:45 schrieb Ralph Castain:
> >>>
>  On Aug 21, 2014, at 2:51 AM, Reuti 
> wrote:
> 
> > Am 20.08.2014 um 23:16 schrieb Ralph Castain:
> >
> >>
> >> On Aug 20, 2014, at 11:16 AM, Reuti 
> wrote:
> >>
> >>> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
> >>>
> > 
> > Aha, this is quite interesting - how do you do this: scanning
> the /proc//status or alike? What happens if you don't find enough free
> cores as they are used up by other applications already?
> >
> 
>  Remember, when you use mpirun to launch, we launch our own
> daemons using the native launcher (e.g., qsub). So the external RM will
> bind our daemons to the specified cores on each node. We use hwloc to
> determine what cores our daemons are bound to, and then bind our own child
> processes to cores within that range.
> >>>
> >>> Thx for reminding me of this. Indeed, I mixed up two different
> aspects in this discussion.
> >>>
> >>> a) What will happen in case no binding was done by the RM (hence
> Open MPI could use all cores) and two Open MPI jobs (or something
> completely different besides one Open MPI job) are running on the same node
> (due to the Tight Integration with two different Open MPI directories in
> /tmp and two `orted`, unique for each job)? Will the second Open MPI job
> know what the first Open MPI job used up already? Or will both use the same
> set of cores as "-bind-to none" can't be set in the given `mpiexec` command
> because of "-map-by slot:pe=$OMP_NUM_THREADS" was used - which triggers
> "-bind-to core" indispensable and can't be switched off? I see the same
> cores being used for both jobs.
> >>
> >> Yeah, each mpirun executes completely independently of the other,
> so they have no idea what the other is doing. So the cores will be
> overloaded. Multi-pe's requires bind-to-core otherwise there is no way to
> implement the request
> >
> > Yep, and so it's no option in a mixed cluster. Why would it hurt to
> allow "-bind-to none" here?
> 
>  Guess I'm confused here - what does pe=N mean if we bind-to none?? If
> you are running on a mixed cluster and don't want binding, then just say
> bind-to none and leave the pe argument out entire

Re: [OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Syed Ahsan Ali
Hi Jeff and Ralph

I could have figured out the issue but the problem was that I cannot find
the exact error line in config.log just as you identified. The shared
library libquadmath is present in lib64 directory. So, adding the path to
the environment removed the error.

Thank you guys for helping me :)



On Tue, Aug 26, 2014 at 7:29 PM, Jeff Squyres (jsquyres)  wrote:

> Just to elaborate: as the error message implies, this error message was
> put there specifically to ensure that the Fortran compiler works before
> continuing any further.  If the Fortran compiler is busted, configure exits
> with this help message.
>
> You can either fix your Fortran compiler, or use --disable-mpi-fortran to
> disable all Fortran support from Open MPI (and therefore this "test whether
> the Fortran compiler works" test will be skipped).
>
> Here's the specific log section showing the failure:
>
> -
> configure:32389: checking if Fortran compiler works
> configure:32418: gfortran -o conftestconftest.f  >&5
> configure:32418: $? = 0
> configure:32418: ./conftest
> ./conftest: error while loading shared libraries: libquadmath.so.0: wrong
> ELF class: ELFCLASS32
> configure:32418: $? = 127
> configure: program exited with status 127
> configure: failed program was:
> |   program main
> |
> |   end
> configure:32434: result: no
> configure:32448: error: Could not run a simple Fortran program.  Aborting.
> -
>
>
> On Aug 26, 2014, at 10:10 AM, Ralph Castain  wrote:
>
> > Looks like there is something wrong with your gfortran install:
> >
> > *** Fortran compiler
> > checking for gfortran... gfortran
> > checking whether we are using the GNU Fortran compiler... yes
> > checking whether gfortran accepts -g... yes
> > checking whether ln -s works... yes
> > checking if Fortran compiler works... no
> > **
> > * It appears that your Fortran compiler is unable to produce working
> > * executables.  A simple test application failed to properly
> > * execute.  Note that this is likely not a problem with Open MPI,
> > * but a problem with the local compiler installation.  More
> > * information (including exactly what command was given to the
> > * compiler and what error resulted when the command was executed) is
> > * available in the config.log file in the Open MPI build directory.
> > **
> > configure: error: Could not run a simple Fortran program.  Aborting.
> >
> >
> > FWIW: I can compile and run on my CentOS6.5 system just fine. I have
> gfortran 4.4.7 installed on it
> >
> > On Aug 26, 2014, at 2:59 AM, Syed Ahsan Ali 
> wrote:
> >
> >>
> >>
> >> I have problems in compilation of openmpi-1.8.1 on Linux machine.
> Kindly see the logs attached.
> >>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25147.php
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25150.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25151.php
>



-- 
Syed Ahsan Ali Bokhari
Electronic Engineer (EE)

Research & Development Division
Pakistan Meteorological Department H-8/4, Islamabad.
Phone # off  +92518358714
Cell # +923155145014


Re: [OMPI users] long initialization

2014-08-26 Thread Ralph Castain
I think something may be messed up with your installation. I went ahead and 
tested this on a Slurm 2.5.4 cluster, and got the following:

$ time mpirun -np 1 --host bend001 ./hello
Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12

real0m0.086s
user0m0.039s
sys 0m0.046s

$ time mpirun -np 1 --host bend002 ./hello
Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12

real0m0.528s
user0m0.021s
sys 0m0.023s

Which is what I would have expected. With --host set to the local host, no 
daemons are being launched and so the time is quite short (just spent mapping 
and fork/exec). With --host set to a single remote host, you have the time it 
takes Slurm to launch our daemon on the remote host, so you get about half of a 
second.

IIRC, you were having some problems with the OOB setup. If you specify the TCP 
interface to use, does your time come down?


On Aug 26, 2014, at 8:32 AM, Timur Ismagilov  wrote:

> I'm using slurm 2.5.6
> 
> $salloc -N8 --exclusive -J ompi -p test
> 
> $ srun hostname
> node1-128-21
> node1-128-24
> node1-128-22
> node1-128-26
> node1-128-27
> node1-128-20
> node1-128-25
> node1-128-23
> 
> $ time mpirun -np 1 --host node1-128-21 ./hello_c
> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
> 21, 2014 (nightly snapshot tarball), 146)
> 
> real 1m3.932s
> user 0m0.035s
> sys 0m0.072s
> 
> 
> 
> 
> Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain :
> hmmmwhat is your allocation like? do you have a large hostfile, for 
> example?
> 
> if you add a --host argument that contains just the local host, what is the 
> time for that scenario?
> 
> On Aug 26, 2014, at 6:27 AM, Timur Ismagilov  wrote:
> 
>> Hello!
>> Here is my time results:
>> 
>> $time mpirun -n 1 ./hello_c
>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>> 21, 2014 (nightly snapshot tarball), 146)
>> 
>> real 1m3.985s
>> user 0m0.031s
>> sys 0m0.083s
>> 
>> 
>> 
>> 
>> Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain :
>> I'm also puzzled by your timing statement - I can't replicate it:
>> 
>> 07:41:43  $ time mpirun -n 1 ./hello_c
>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>> Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer 
>> copy, 125)
>> 
>> real 0m0.547s
>> user 0m0.043s
>> sys  0m0.046s
>> 
>> The entire thing ran in 0.5 seconds
>> 
>> 
>> On Aug 22, 2014, at 6:33 AM, Mike Dubman  wrote:
>> 
>>> Hi,
>>> The default delimiter is ";" . You can change delimiter with 
>>> mca_base_env_list_delimiter.
>>> 
>>> 
>>> 
>>> On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  wrote:
>>> Hello!
>>> If i use latest night snapshot:
>>> $ ompi_info -V
>>> Open MPI v1.9a1r32570
>>> 
>>> In programm hello_c initialization takes ~1 min
>>> In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>>> if i use 
>>> $mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
>>> --map-by slot:pe=8 -np 1 ./hello_c
>>> i got error 
>>> config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>>> 'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>>> but with -x all works fine (but with warn)
>>> $mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>>> WARNING: The mechanism by which environment variables are explicitly
>>> ..
>>> ..
>>> ..
>>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>> 21, 2014 (nightly snapshot tarball), 146)
>>> 
>>> 
>>> Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain :
>>> Not sure I understand. The problem has been fixed in both the trunk and the 
>>> 1.8 branch now, so you should be able to work with either of those nightly 
>>> builds.
>>> 
>>> On Aug 21, 2014, at 12:02 AM, Timur Ismagilov  wrote:
>>> 
 Have i I any opportunity to run mpi jobs?
 
 
 Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain :
 yes, i know - it is cmr'd
 
 On Aug 20, 2014, at 10:26 AM, Mike Dubman  wrote:
 
> btw, we get same error in v1.8 branch as well.
> 
> 
> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:
> It was not yet fixed - but should be now.
> 
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
> 
>> Hello!
>> 
>> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still 
>> have the problem
>> 
>> a)
>> $ mpirun  -np 1 ./hello_c
>> 
>> --
>> An ORTE daemon has unexpectedly failed after launch and before
>> communicating back to mpirun. This could be caused by a number
>> of factors, including an inability to create a connection back