Hello List !

It appeared that the file /etc/openmpi/openmpi-mca-params.conf on node green 
was the only one
into the cluster to contain the line

btl_tcp_port_min_v4 = 49152

Once the this line commented, the tests suggest below, and the sbatch script 
previously emailed,
work.

Now, if I put the above line, namely,

btl_tcp_port_min_v4 = 49152

in each /etc/openmpi/openmpi-mca-params.conf, then:

orterun -np 2 -H orange phello

gives

[orange][[42511,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
 bind() failed: Permission denied (13)
[orange][[42511,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
 bind() failed: Permission denied (13)
Hello world! I am 0 of 2 and my name is `orange'
Hello world! I am 1 of 2 and my name is `orange'

whereas

orterun -np 2 -H orange,yellow phello

gives

[orange][[42561,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
 bind() failed: Permission denied (13)
[yellow][[42561,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
 bind() failed: Permission denied (13)
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

 Process 1 ([[42561,1],0]) is on host: orange
 Process 2 ([[42561,1],1]) is on host: yellow
 BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[orange:9702] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[yellow:9704] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
--------------------------------------------------------------------------
orterun has exited due to process rank 0 with PID 9702 on
node orange exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).
--------------------------------------------------------------------------
[rainbow:07504] 1 more process has sent help message help-mca-bml-r2.txt / 
unreachable proc
[rainbow:07504] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
[rainbow:07504] 1 more process has sent help message help-mpi-runtime / 
mpi_init:startup:internal-failure


I would like to know what is to blame:
the btl_tcp_port_min_v4 (recent) feature ?
or the local SLURM set up ?
If the local SLURM set up is bad, what may be wrong ?

Thanks in advance,
Jerome

Jerome BENOIT wrote:
Hi !

Dirk Eddelbuettel wrote:
On 3 April 2009 at 03:33, Jerome BENOIT wrote:
| The above submission works the same on my clusters.
| But in fact, my issue involve interconnection between the nodes of the clusters:
| in the above examples involve no connection between nodes.
| | My cluster is a cluster of quadcore computers:
| if in the sbatch script
| | #SBATCH --nodes=7
| #SBATCH --ntasks=15
| | is replaced by
| | #SBATCH --nodes=1
| #SBATCH --ntasks=4
| | everything is fine as no interconnection is involved.
| | Can you test the inconnection part of the story ?

Again, think about in terms of layers. You have a problem with slurm on top
of Open MPI.


So before blaming Open MPI, I would try something like this:

~$ orterun -np 2 -H abc,xyz /tmp/jerome_hw
Hello world! I am 1 of 2 and my name is `abc'
Hello world! I am 0 of 2 and my name is `xyz'
~$

I got it: I am very new with openmpi.
It is working with each nodes except one (`green'):
I have to blame my cluster.

I will try to fix it soon.

Thanks you very much for you help,
Jerome


ie whether the simple MPI example can be launched successfully on two nodes or not.

Dirk

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to