Hello List !
It appeared that the file /etc/openmpi/openmpi-mca-params.conf on node green
was the only one
into the cluster to contain the line
btl_tcp_port_min_v4 = 49152
Once the this line commented, the tests suggest below, and the sbatch script
previously emailed,
work.
Now, if I put the above line, namely,
btl_tcp_port_min_v4 = 49152
in each /etc/openmpi/openmpi-mca-params.conf, then:
orterun -np 2 -H orange phello
gives
[orange][[42511,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
bind() failed: Permission denied (13)
[orange][[42511,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
bind() failed: Permission denied (13)
Hello world! I am 0 of 2 and my name is `orange'
Hello world! I am 1 of 2 and my name is `orange'
whereas
orterun -np 2 -H orange,yellow phello
gives
[orange][[42561,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
bind() failed: Permission denied (13)
[yellow][[42561,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_component.c:596:mca_btl_tcp_component_create_listen]
bind() failed: Permission denied (13)
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[42561,1],0]) is on host: orange
Process 2 ([[42561,1],1]) is on host: yellow
BTLs attempted: self sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[orange:9702] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[yellow:9704] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed!
--------------------------------------------------------------------------
orterun has exited due to process rank 0 with PID 9702 on
node orange exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).
--------------------------------------------------------------------------
[rainbow:07504] 1 more process has sent help message help-mca-bml-r2.txt /
unreachable proc
[rainbow:07504] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
[rainbow:07504] 1 more process has sent help message help-mpi-runtime /
mpi_init:startup:internal-failure
I would like to know what is to blame:
the btl_tcp_port_min_v4 (recent) feature ?
or the local SLURM set up ?
If the local SLURM set up is bad, what may be wrong ?
Thanks in advance,
Jerome
Jerome BENOIT wrote:
Hi !
Dirk Eddelbuettel wrote:
On 3 April 2009 at 03:33, Jerome BENOIT wrote:
| The above submission works the same on my clusters.
| But in fact, my issue involve interconnection between the nodes of
the clusters:
| in the above examples involve no connection between nodes.
| | My cluster is a cluster of quadcore computers:
| if in the sbatch script
| | #SBATCH --nodes=7
| #SBATCH --ntasks=15
| | is replaced by
| | #SBATCH --nodes=1
| #SBATCH --ntasks=4
| | everything is fine as no interconnection is involved.
| | Can you test the inconnection part of the story ?
Again, think about in terms of layers. You have a problem with slurm
on top
of Open MPI.
So before blaming Open MPI, I would try something like this:
~$ orterun -np 2 -H abc,xyz /tmp/jerome_hw
Hello world! I am 1 of 2 and my name is `abc'
Hello world! I am 0 of 2 and my name is `xyz'
~$
I got it: I am very new with openmpi.
It is working with each nodes except one (`green'):
I have to blame my cluster.
I will try to fix it soon.
Thanks you very much for you help,
Jerome
ie whether the simple MPI example can be launched successfully on two
nodes or not.
Dirk
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users