I cloned from github latest version of Open MPI on grid5000.
128 nodes was reserved from nancy site. During execution of my mpi code I
got error message below:
A process or daemon was unable to complete a TCP connection
to another process:
Local host:graphene-17
Remote host: graphene-91
Hi,
now this bug happens also when I launch my mpirun command from the
compute node.
Cyril.
Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit :
> I believe this has been fixed now - please let me know
>
>> On Mar 30, 2017, at 1:57 AM, Cyril Bordage wrote:
>>
>> Hello,
>>
>> I am using the git
Can you be a bit more specific?
- What version of Open MPI are you using?
- How did you configure Open MPI?
- How are you launching Open MPI applications?
> On Apr 13, 2017, at 9:08 AM, Cyril Bordage wrote:
>
> Hi,
>
> now this bug happens also when I launch my mpirun command from the
> compu
Also, can you please run
lstopo
on both your login and compute nodes ?
Cheers,
Gilles
- Original Message -
> Can you be a bit more specific?
>
> - What version of Open MPI are you using?
> - How did you configure Open MPI?
> - How are you launching Open MPI applications?
>
>
> > On A
I am using the 6886c12 commit.
I have no particular option for the configuration.
I launch my application in the same way as I presented in my firt email,
there is the exact line: mpirun -np 48 -machinefile mf -bind-to core
-report-bindings ./a.out
lstopo does give the same output on both types on
OK thanks,
we've had some issues in the past when Open MPI assumed that the (login)
node running mpirun has the same topology than the other (compute) nodes.
i just wanted to clear this scenario.
Cheers,
Gilles
- Original Message -
> I am using the 6886c12 commit.
> I have no particula
There are several kind of communications
- ssh from mpirun to compute nodes, and also between compute nodes
(assuming you use a machine file and no supported batch manager) to spawn
orted daemons
- oob/tcp connections between orted
- btl/tcp connections between MPI tasks
You can restrict the port
We are asking all these questions because we cannot replicate your problem - so
we are trying to help you figure out what is different or missing from your
machine. When I run your cmd line on my system, I get:
[rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]:
[BB/../../../
When I run this command from the compute node I have also that. But not
when I run it from a login node (with the same machine file).
Cyril.
Le 13/04/2017 à 16:22, r...@open-mpi.org a écrit :
> We are asking all these questions because we cannot replicate your problem -
> so we are trying to he
Try adding "-mca rmaps_base_verbose 5” and see what that output tells us - I
assume you have a debug build configured, yes (i.e., added --enable-debug to
configure line)?
> On Apr 13, 2017, at 7:28 AM, Cyril Bordage wrote:
>
> When I run this command from the compute node I have also that. Bu
Is your compute node included in your machine file ?
If yes, what if you invoke mpirun from a compute node not listed in your
machine file ?
It can also be helpful to post your machinefile
Cheers,
Gilles
On Thursday, April 13, 2017, Cyril Bordage wrote:
> When I run this command from the compu
There is the output:
##
[devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL
[devel11:80858] mca:rmaps:select: checking available component mindist
[devel11:80858] mca:rmaps:select: Querying component
My machine file is: miriel025*24 miriel026*24
Le 13/04/2017 à 16:46, Cyril Bordage a écrit :
> There is the output:
> ##
> [devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL
> [devel11:80858] mca:r
Okay, so you login node was able to figure out all the bindings. I don’t see
any debug output from your compute nodes, which is suspicious.
Try adding --leave-session-attached to the cmd line and let’s see if we can
capture the compute node daemon’s output
> On Apr 13, 2017, at 7:48 AM, Cyril B
devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL
[devel11:17550] mca:rmaps:select: checking available component mindist
[devel11:17550] mca:rmaps:select: Querying component [mindist]
[devel11:17550] mca:rmaps:select: checking available component ppr
[devel11:17550] mca:rm
Okay, so as far as OMPI is concerned, it correctly bound everyone! So how are
you generating this output claiming it isn’t bound?
> On Apr 13, 2017, at 7:57 AM, Cyril Bordage wrote:
>
> devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL
> [devel11:17550] mca:rmaps:selec
'-report-bindings' does that.
I used this option because the ranks did not seem to be binded (if I use
a rank file the performace is far better).
Le 13/04/2017 à 17:24, r...@open-mpi.org a écrit :
> Okay, so as far as OMPI is concerned, it correctly bound everyone! So how are
> you generating thi
All right, let’s replace rmaps_base_verbose with odls_base_verbose and see what
that saids
> On Apr 13, 2017, at 8:34 AM, Cyril Bordage wrote:
>
> '-report-bindings' does that.
> I used this option because the ranks did not seem to be binded (if I use
> a rank file the performace is far better)
Ralph,
i can simply reproduce the issue with two nodes and the latest master
all commands are ran on n1, which has the same topology (2 sockets * 8
cores each) than n2
1) everything works
$ mpirun -np 16 -bind-to core --report-bindings true
[n1:29794] MCW rank 0 bound to socket 0[core 0[hw
19 matches
Mail list logo