[OMPI devel] Unable to complete a TCP connection

2017-04-13 Thread Emin Nuriyev
I cloned from github latest version of Open MPI on grid5000.

128 nodes was reserved from nancy site. During execution of my mpi code I
got error message below:

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:graphene-17
  Remote host:   graphene-91
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

I deployed my OS image. Everything is OK with firewall. Consider that same
OS image was deployed on all reserved nodes, if Open MPI could connect some
of them and execute code it means firewall accepted input.

Thre is no problem to connect to graphene-91 with ssh.  But below comanline
does not work

mpirun -host  graphene-91 -n 1 exec_code

I get same message "Unable to complete a TCP connection"


Sometimes I got this error:

WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: graphene-27
--
[graphene-26][[56971,1],52][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect]
connect() to 172.18.64.25 failed: No route to host (113)
[graphene-29][[56971,1],60][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect]
connect() to 172.18.64.27 failed: No route to host (113)
[graphene-14.nancy.grid5000.fr:02890] 15 more processes have sent help
message help-mpi-btl-openib.txt / no active ports found
[graphene-14.nancy.grid5000.fr:02890] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages

When I change command line using mca parameter to select eth0, there is
another error. This is not stable version maybe therefore, I get such kind
of error ?

Your faithfully,
Emin Nuriyev
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
Hi,

now this bug happens also when I launch my mpirun command from the
compute node.


Cyril.

Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit :
> I believe this has been fixed now - please let me know
> 
>> On Mar 30, 2017, at 1:57 AM, Cyril Bordage  wrote:
>>
>> Hello,
>>
>> I am using the git version of MPI with "-bind-to core -report-bindings"
>> and I get that for all processes:
>> [miriel010:160662] MCW rank 0 not bound
>>
>>
>> When I use an old version I get:
>> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>> [B/././././././././././.][./././././././././././.]
>>
>> From git bisect the culprit seems to be: 48fc339
>>
>> This bug happends only when I launch my mpirun command from a login node
>> and not
>> from a compute node.
>>
>>
>> Cyril.
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Jeff Squyres (jsquyres)
Can you be a bit more specific?

- What version of Open MPI are you using?
- How did you configure Open MPI?
- How are you launching Open MPI applications?


> On Apr 13, 2017, at 9:08 AM, Cyril Bordage  wrote:
> 
> Hi,
> 
> now this bug happens also when I launch my mpirun command from the
> compute node.
> 
> 
> Cyril.
> 
> Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit :
>> I believe this has been fixed now - please let me know
>> 
>>> On Mar 30, 2017, at 1:57 AM, Cyril Bordage  wrote:
>>> 
>>> Hello,
>>> 
>>> I am using the git version of MPI with "-bind-to core -report-bindings"
>>> and I get that for all processes:
>>> [miriel010:160662] MCW rank 0 not bound
>>> 
>>> 
>>> When I use an old version I get:
>>> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>>> [B/././././././././././.][./././././././././././.]
>>> 
>>> From git bisect the culprit seems to be: 48fc339
>>> 
>>> This bug happends only when I launch my mpirun command from a login node
>>> and not
>>> from a compute node.
>>> 
>>> 
>>> Cyril.
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread gilles
Also, can you please run
lstopo
on both your login and compute nodes ?

Cheers,

Gilles


- Original Message -
> Can you be a bit more specific?
> 
> - What version of Open MPI are you using?
> - How did you configure Open MPI?
> - How are you launching Open MPI applications?
> 
> 
> > On Apr 13, 2017, at 9:08 AM, Cyril Bordage  
wrote:
> > 
> > Hi,
> > 
> > now this bug happens also when I launch my mpirun command from the
> > compute node.
> > 
> > 
> > Cyril.
> > 
> > Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit :
> >> I believe this has been fixed now - please let me know
> >> 
> >>> On Mar 30, 2017, at 1:57 AM, Cyril Bordage  wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> I am using the git version of MPI with "-bind-to core -report-
bindings"
> >>> and I get that for all processes:
> >>> [miriel010:160662] MCW rank 0 not bound
> >>> 
> >>> 
> >>> When I use an old version I get:
> >>> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> >>> [B/././././././././././.][./././././././././././.]
> >>> 
> >>> From git bisect the culprit seems to be: 48fc339
> >>> 
> >>> This bug happends only when I launch my mpirun command from a 
login node
> >>> and not
> >>> from a compute node.
> >>> 
> >>> 
> >>> Cyril.
> >>> ___
> >>> devel mailing list
> >>> devel@lists.open-mpi.org
> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >> 
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >> 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
I am using the 6886c12 commit.
I have no particular option for the configuration.
I launch my application in the same way as I presented in my firt email,
there is the exact line: mpirun -np 48 -machinefile mf -bind-to core
-report-bindings ./a.out

lstopo does give the same output on both types on nodes. What is the
purpose of that?

Thanks.


Cyril.

Le 13/04/2017 à 15:24, gil...@rist.or.jp a écrit :
> Also, can you please run
> lstopo
> on both your login and compute nodes ?
> 
> Cheers,
> 
> Gilles
> 
> 
> - Original Message -
>> Can you be a bit more specific?
>>
>> - What version of Open MPI are you using?
>> - How did you configure Open MPI?
>> - How are you launching Open MPI applications?
>>
>>
>>> On Apr 13, 2017, at 9:08 AM, Cyril Bordage  
> wrote:
>>>
>>> Hi,
>>>
>>> now this bug happens also when I launch my mpirun command from the
>>> compute node.
>>>
>>>
>>> Cyril.
>>>
>>> Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit :
 I believe this has been fixed now - please let me know

> On Mar 30, 2017, at 1:57 AM, Cyril Bordage > wrote:
>
> Hello,
>
> I am using the git version of MPI with "-bind-to core -report-
> bindings"
> and I get that for all processes:
> [miriel010:160662] MCW rank 0 not bound
>
>
> When I use an old version I get:
> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> [B/././././././././././.][./././././././././././.]
>
> From git bisect the culprit seems to be: 48fc339
>
> This bug happends only when I launch my mpirun command from a 
> login node
> and not
> from a compute node.
>
>
> Cyril.
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

 ___
 devel mailing list
 devel@lists.open-mpi.org
 https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread gilles
OK thanks,

we've had some issues in the past when Open MPI assumed that the (login) 
node running mpirun has the same topology than the other (compute) nodes.
i just wanted to clear this scenario.

Cheers,

Gilles

- Original Message -
> I am using the 6886c12 commit.
> I have no particular option for the configuration.
> I launch my application in the same way as I presented in my firt 
email,
> there is the exact line: mpirun -np 48 -machinefile mf -bind-to core
> -report-bindings ./a.out
> 
> lstopo does give the same output on both types on nodes. What is the
> purpose of that?
> 
> Thanks.
> 
> 
> Cyril.
> 
> Le 13/04/2017 à 15:24, gil...@rist.or.jp a écrit :
> > Also, can you please run
> > lstopo
> > on both your login and compute nodes ?
> > 
> > Cheers,
> > 
> > Gilles
> > 
> > 
> > - Original Message -
> >> Can you be a bit more specific?
> >>
> >> - What version of Open MPI are you using?
> >> - How did you configure Open MPI?
> >> - How are you launching Open MPI applications?
> >>
> >>
> >>> On Apr 13, 2017, at 9:08 AM, Cyril Bordage  
> > wrote:
> >>>
> >>> Hi,
> >>>
> >>> now this bug happens also when I launch my mpirun command from the
> >>> compute node.
> >>>
> >>>
> >>> Cyril.
> >>>
> >>> Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit :
>  I believe this has been fixed now - please let me know
> 
> > On Mar 30, 2017, at 1:57 AM, Cyril Bordage  >> wrote:
> >
> > Hello,
> >
> > I am using the git version of MPI with "-bind-to core -report-
> > bindings"
> > and I get that for all processes:
> > [miriel010:160662] MCW rank 0 not bound
> >
> >
> > When I use an old version I get:
> > [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> > [B/././././././././././.][./././././././././././.]
> >
> > From git bisect the culprit seems to be: 48fc339
> >
> > This bug happends only when I launch my mpirun command from a 
> > login node
> > and not
> > from a compute node.
> >
> >
> > Cyril.
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
>  ___
>  devel mailing list
>  devel@lists.open-mpi.org
>  https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> >>> ___
> >>> devel mailing list
> >>> devel@lists.open-mpi.org
> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >>
> >>
> >> -- 
> >> Jeff Squyres
> >> jsquy...@cisco.com
> >>
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> > 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Unable to complete a TCP connection

2017-04-13 Thread Gilles Gouaillardet
There are several kind of communications
- ssh from mpirun to compute nodes, and also between compute nodes
(assuming you use a machine file and no supported batch manager) to spawn
orted daemons
- oob/tcp connections between orted
- btl/tcp connections between MPI tasks

You can restrict the port ranges used by oob/tcp and btl/tcp and open them
if you really want a firewall. (I strongly suggest you try without a
firewall first)

Now looking at the error message "no route to hosts"
That could mean there is no route, so you should include/exclude some
subsets/interfaces
mpirun --mca btl_tcp_if_include ... --mca oob_tcp_if_include ... ..
Or there might be a route, but the firewall reports otherwise

Cheers,

Gilles

On Thursday, April 13, 2017, Emin Nuriyev 
wrote:

> I cloned from github latest version of Open MPI on grid5000.
>
> 128 nodes was reserved from nancy site. During execution of my mpi code I
> got error message below:
>
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:graphene-17
>   Remote host:   graphene-91
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
>
> I deployed my OS image. Everything is OK with firewall. Consider that same
> OS image was deployed on all reserved nodes, if Open MPI could connect some
> of them and execute code it means firewall accepted input.
>
> Thre is no problem to connect to graphene-91 with ssh.  But below
> comanline does not work
>
> mpirun -host  graphene-91 -n 1 exec_code
>
> I get same message "Unable to complete a TCP connection"
>
>
> Sometimes I got this error:
>
> WARNING: There is at least non-excluded one OpenFabrics device found,
> but there are no active ports detected (or Open MPI was unable to use
> them).  This is most certainly not what you wanted.  Check your
> cables, subnet manager configuration, etc.  The openib BTL will be
> ignored for this job.
>
>   Local host: graphene-27
> --
> [graphene-26][[56971,1],52][btl_tcp_endpoint.c:796:mca_
> btl_tcp_endpoint_complete_connect] connect() to 172.18.64.25 failed: No
> route to host (113)
> [graphene-29][[56971,1],60][btl_tcp_endpoint.c:796:mca_
> btl_tcp_endpoint_complete_connect] connect() to 172.18.64.27 failed: No
> route to host (113)
> [graphene-14.nancy.grid5000.fr:02890] 15 more processes have sent help
> message help-mpi-btl-openib.txt / no active ports found
> [graphene-14.nancy.grid5000.fr:02890] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
>
> When I change command line using mca parameter to select eth0, there is
> another error. This is not stable version maybe therefore, I get such kind
> of error ?
>
> Your faithfully,
> Emin Nuriyev
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread r...@open-mpi.org
We are asking all these questions because we cannot replicate your problem - so 
we are trying to help you figure out what is different or missing from your 
machine. When I run your cmd line on my system, I get:

[rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: 
[../../../../../../../../../../../..][BB/../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../../../../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: 
[../../../../../../../../../../../..][../BB/../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/../../../../../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: 
[../../../../../../../../../../../..][../../BB/../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: 
[../../../BB/../../../../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../BB/../../../../../../../..]
[rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/../../../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../BB/../../../../../../..]
[rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: 
[../../../../../BB/../../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../BB/../../../../../..]
[rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: 
[../../../../../../BB/../../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../../BB/../../../../..]
[rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: 
[../../../../../../../BB/../../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../../../BB/../../../..]
[rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: 
[../../../../../../../../BB/../../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../../../../BB/../../..]
[rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: 
[../../../../../../../../../BB/../..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../../../../../BB/../..]
[rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: 
[../../../../../../../../../../BB/..][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../../../../../../BB/..]
[rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: 
[../../../../../../../../../../../BB][../../../../../../../../../../../..]
[rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../../../../../../../../../BB]
[rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../../../../../..][../../../../../../../../../../../..]
[rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: 
[../../../../../../../../../../../..][BB/../../../../../../../../../../..]
[rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../../../../../../../..][../../../../../../../../../../../..]
[rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: 
[../../../../../../../../../../../..][../BB/../../../../../../../../../..]
[rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/../../../../../../../../..][../../../../../../../../../../../..]
[rhc001:197743] MCW rank 5 bound to socket 1[core 14[hwt 0-1]]: 
[../../../../../../../../../../../..][../../BB/../../../../../../../../..]
[rhc001:197743] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: 
[../../../BB/../../../../../../../..][../../../../../../../../../../../..]
[rhc001:197743] MCW rank 7 bound to socket 1[core 15[hwt 0-1]]: 
[../../../../../../../../../../../..][../../../BB/../../../../../../../..]
[rhc001:197743] MCW rank 8 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/../../../../../../..][../../../../../../../../.

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
When I run this command from the compute node I have also that. But not
when I run it from a login node (with the same machine file).


Cyril.

Le 13/04/2017 à 16:22, r...@open-mpi.org a écrit :
> We are asking all these questions because we cannot replicate your problem - 
> so we are trying to help you figure out what is different or missing from 
> your machine. When I run your cmd line on my system, I get:
> 
> [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: 
> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: 
> [../../../../../../../../../../../..][BB/../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: 
> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: 
> [../../../../../../../../../../../..][../BB/../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: 
> [../../BB/../../../../../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../BB/../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: 
> [../../../BB/../../../../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../BB/../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: 
> [../../../../BB/../../../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../BB/../../../../../../..]
> [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: 
> [../../../../../BB/../../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../BB/../../../../../..]
> [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: 
> [../../../../../../BB/../../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../../BB/../../../../..]
> [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: 
> [../../../../../../../BB/../../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../../../BB/../../../..]
> [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: 
> [../../../../../../../../BB/../../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../../../../BB/../../..]
> [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: 
> [../../../../../../../../../BB/../..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../../../../../BB/../..]
> [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: 
> [../../../../../../../../../../BB/..][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../../../../../../BB/..]
> [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: 
> [../../../../../../../../../../../BB][../../../../../../../../../../../..]
> [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../../../../../../../../../../BB]
> [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
> [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: 
> [../../../../../../../../../../../..][BB/../../../../../../../../../../..]
> [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: 
> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
> [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: 
> [../../../../../../../../../../../..][../BB/../../../../../../../../../..]
> [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: 
> [../../BB/../../../../../../../../..][../../../../../../../../../../../..]
> [rhc001:197743] MCW rank 5 bound to socket 1[core 14[hwt 0-1]]: 
> [../../../../../../../../../../../..][../../BB/../../../../../../../../..]
> [rhc001:197743] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: 
> [../../../BB/..

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread r...@open-mpi.org
Try adding "-mca rmaps_base_verbose 5” and see what that output tells us - I 
assume you have a debug build configured, yes (i.e., added --enable-debug to 
configure line)?


> On Apr 13, 2017, at 7:28 AM, Cyril Bordage  wrote:
> 
> When I run this command from the compute node I have also that. But not
> when I run it from a login node (with the same machine file).
> 
> 
> Cyril.
> 
> Le 13/04/2017 à 16:22, r...@open-mpi.org a écrit :
>> We are asking all these questions because we cannot replicate your problem - 
>> so we are trying to help you figure out what is different or missing from 
>> your machine. When I run your cmd line on my system, I get:
>> 
>> [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: 
>> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: 
>> [../../../../../../../../../../../..][BB/../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: 
>> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../BB/../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: 
>> [../../BB/../../../../../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../BB/../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: 
>> [../../../BB/../../../../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../BB/../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: 
>> [../../../../BB/../../../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../BB/../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: 
>> [../../../../../BB/../../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../BB/../../../../../..]
>> [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: 
>> [../../../../../../BB/../../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../../BB/../../../../..]
>> [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: 
>> [../../../../../../../BB/../../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../../../BB/../../../..]
>> [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: 
>> [../../../../../../../../BB/../../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../../../../BB/../../..]
>> [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: 
>> [../../../../../../../../../BB/../..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../../../../../BB/../..]
>> [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: 
>> [../../../../../../../../../../BB/..][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../../../../../../BB/..]
>> [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: 
>> [../../../../../../../../../../../BB][../../../../../../../../../../../..]
>> [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../../../../../../../../../../../BB]
>> [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
>> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
>> [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: 
>> [../../../../../../../../../../../..][BB/../../../../../../../../../../..]
>> [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: 
>> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
>> [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: 
>> [../../../../../../../../../../../..][../BB/../../../../../../../../../..]
>> [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]:

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Gilles Gouaillardet
Is your compute node included in your machine file ?
If yes, what if you invoke mpirun from a compute node not listed in your
machine file ?
It can also be helpful to post your machinefile

Cheers,

Gilles

On Thursday, April 13, 2017, Cyril Bordage  wrote:

> When I run this command from the compute node I have also that. But not
> when I run it from a login node (with the same machine file).
>
>
> Cyril.
>
> Le 13/04/2017 à 16:22, r...@open-mpi.org  a écrit :
> > We are asking all these questions because we cannot replicate your
> problem - so we are trying to help you figure out what is different or
> missing from your machine. When I run your cmd line on my system, I get:
> >
> > [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]:
> [../../../../../../../../../../../..][BB/../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]:
> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]:
> [../../../../../../../../../../../..][../BB/../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]:
> [../../BB/../../../../../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]:
> [../../../../../../../../../../../..][../../BB/../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]:
> [../../../BB/../../../../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../BB/../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]:
> [../../../../BB/../../../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../BB/../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]:
> [../../../../../BB/../../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../BB/../../../../../..]
> > [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]:
> [../../../../../../BB/../../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../../BB/../../../../..]
> > [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]:
> [../../../../../../../BB/../../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../../../BB/../../../..]
> > [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]:
> [../../../../../../../../BB/../../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../../../../BB/../../..]
> > [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]:
> [../../../../../../../../../BB/../..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../../../../../BB/../..]
> > [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]:
> [../../../../../../../../../../BB/..][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../../../../../../BB/..]
> > [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]:
> [../../../../../../../../../../../BB][../../../../../../../../../../../..]
> > [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]:
> [../../../../../../../../../../../..][../../../../../../../../../../../BB]
> > [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../../../../../..][../../../../../../../../../../../..]
> > [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]:
> [../../../../../../../../../../../..][BB/../../../../../../../../../../..]
> > [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]:
> [../BB/../../../../../../../../../..][../../../../../../../../../../../..]
> > [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]:
> [../../../../../../../../../../../..][../BB/../../../../../../../../../..]
> > [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]:
> [

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
There is the output:
##
[devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL
[devel11:80858] mca:rmaps:select: checking available component mindist
[devel11:80858] mca:rmaps:select: Querying component [mindist]
[devel11:80858] mca:rmaps:select: checking available component ppr
[devel11:80858] mca:rmaps:select: Querying component [ppr]
[devel11:80858] mca:rmaps:select: checking available component rank_file
[devel11:80858] mca:rmaps:select: Querying component [rank_file]
[devel11:80858] mca:rmaps:select: checking available component resilient
[devel11:80858] mca:rmaps:select: Querying component [resilient]
[devel11:80858] mca:rmaps:select: checking available component round_robin
[devel11:80858] mca:rmaps:select: Querying component [round_robin]
[devel11:80858] mca:rmaps:select: checking available component seq
[devel11:80858] mca:rmaps:select: Querying component [seq]
[devel11:80858] [[2965,0],0]: Final mapper priorities
[devel11:80858] Mapper: ppr Priority: 90
[devel11:80858] Mapper: seq Priority: 60
[devel11:80858] Mapper: resilient Priority: 40
[devel11:80858] Mapper: mindist Priority: 20
[devel11:80858] Mapper: round_robin Priority: 10
[devel11:80858] Mapper: rank_file Priority: 0
[devel11:80858] mca:rmaps: mapping job [2965,1]
[devel11:80858] mca:rmaps: setting mapping policies for job [2965,1]
nprocs 48
[devel11:80858] mca:rmaps[169] mapping not set by user - using bynuma
[devel11:80858] mca:rmaps:ppr: job [2965,1] not using ppr mapper PPR
NULL policy PPR NOTSET
[devel11:80858] [[2965,0],0] rmaps:seq called on job [2965,1]
[devel11:80858] mca:rmaps:seq: job [2965,1] not using seq mapper
[devel11:80858] mca:rmaps:resilient: cannot perform initial map of job
[2965,1] - no fault groups
[devel11:80858] mca:rmaps:mindist: job [2965,1] not using mindist mapper
[devel11:80858] mca:rmaps:rr: mapping job [2965,1]
[devel11:80858] [[2965,0],0] Starting with 2 nodes in list
[devel11:80858] [[2965,0],0] Filtering thru apps
[devel11:80858] [[2965,0],0] Retained 2 nodes in list
[devel11:80858] [[2965,0],0] node miriel025 has 24 slots available
[devel11:80858] [[2965,0],0] node miriel026 has 24 slots available
[devel11:80858] AVAILABLE NODES FOR MAPPING:
[devel11:80858] node: miriel025 daemon: 1
[devel11:80858] node: miriel026 daemon: 2
[devel11:80858] [[2965,0],0] Starting bookmark at node miriel025
[devel11:80858] [[2965,0],0] Starting at node miriel025
[devel11:80858] mca:rmaps:rr: mapping no-span by NUMANode for job
[2965,1] slots 48 num_procs 48
[devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel025
[devel11:80858] mca:rmaps:rr: calculated nprocs 24
[devel11:80858] mca:rmaps:rr: assigning nprocs 24
[devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel026
[devel11:80858] mca:rmaps:rr: calculated nprocs 24
[devel11:80858] mca:rmaps:rr: assigning nprocs 24
[devel11:80858] mca:rmaps:base: computing vpids by slot for job [2965,1]
[devel11:80858] mca:rmaps:base: assigning rank 0 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 1 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 2 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 3 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 4 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 5 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 6 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 7 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 8 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 9 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 10 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 11 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 12 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 13 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 14 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 15 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 16 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 17 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 18 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 19 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 20 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 21 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 22 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 23 to node miriel025
[devel11:80858] mca:rmaps:base: assigning rank 24 to node miriel026
[devel11:80858] mca:rmaps:base: assigning rank 25 to node miriel026
[devel11:80858] mca:rmaps:base: assigning rank 26 to node miriel026
[devel11:80858] mca:rmaps:base: assigning rank 27 to node miriel026
[devel11:80858] mca:rmaps:base: assigning ra

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
My machine file is: miriel025*24 miriel026*24

Le 13/04/2017 à 16:46, Cyril Bordage a écrit :
> There is the output:
> ##
> [devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL
> [devel11:80858] mca:rmaps:select: checking available component mindist
> [devel11:80858] mca:rmaps:select: Querying component [mindist]
> [devel11:80858] mca:rmaps:select: checking available component ppr
> [devel11:80858] mca:rmaps:select: Querying component [ppr]
> [devel11:80858] mca:rmaps:select: checking available component rank_file
> [devel11:80858] mca:rmaps:select: Querying component [rank_file]
> [devel11:80858] mca:rmaps:select: checking available component resilient
> [devel11:80858] mca:rmaps:select: Querying component [resilient]
> [devel11:80858] mca:rmaps:select: checking available component round_robin
> [devel11:80858] mca:rmaps:select: Querying component [round_robin]
> [devel11:80858] mca:rmaps:select: checking available component seq
> [devel11:80858] mca:rmaps:select: Querying component [seq]
> [devel11:80858] [[2965,0],0]: Final mapper priorities
> [devel11:80858] Mapper: ppr Priority: 90
> [devel11:80858] Mapper: seq Priority: 60
> [devel11:80858] Mapper: resilient Priority: 40
> [devel11:80858] Mapper: mindist Priority: 20
> [devel11:80858] Mapper: round_robin Priority: 10
> [devel11:80858] Mapper: rank_file Priority: 0
> [devel11:80858] mca:rmaps: mapping job [2965,1]
> [devel11:80858] mca:rmaps: setting mapping policies for job [2965,1]
> nprocs 48
> [devel11:80858] mca:rmaps[169] mapping not set by user - using bynuma
> [devel11:80858] mca:rmaps:ppr: job [2965,1] not using ppr mapper PPR
> NULL policy PPR NOTSET
> [devel11:80858] [[2965,0],0] rmaps:seq called on job [2965,1]
> [devel11:80858] mca:rmaps:seq: job [2965,1] not using seq mapper
> [devel11:80858] mca:rmaps:resilient: cannot perform initial map of job
> [2965,1] - no fault groups
> [devel11:80858] mca:rmaps:mindist: job [2965,1] not using mindist mapper
> [devel11:80858] mca:rmaps:rr: mapping job [2965,1]
> [devel11:80858] [[2965,0],0] Starting with 2 nodes in list
> [devel11:80858] [[2965,0],0] Filtering thru apps
> [devel11:80858] [[2965,0],0] Retained 2 nodes in list
> [devel11:80858] [[2965,0],0] node miriel025 has 24 slots available
> [devel11:80858] [[2965,0],0] node miriel026 has 24 slots available
> [devel11:80858] AVAILABLE NODES FOR MAPPING:
> [devel11:80858] node: miriel025 daemon: 1
> [devel11:80858] node: miriel026 daemon: 2
> [devel11:80858] [[2965,0],0] Starting bookmark at node miriel025
> [devel11:80858] [[2965,0],0] Starting at node miriel025
> [devel11:80858] mca:rmaps:rr: mapping no-span by NUMANode for job
> [2965,1] slots 48 num_procs 48
> [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel025
> [devel11:80858] mca:rmaps:rr: calculated nprocs 24
> [devel11:80858] mca:rmaps:rr: assigning nprocs 24
> [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel026
> [devel11:80858] mca:rmaps:rr: calculated nprocs 24
> [devel11:80858] mca:rmaps:rr: assigning nprocs 24
> [devel11:80858] mca:rmaps:base: computing vpids by slot for job [2965,1]
> [devel11:80858] mca:rmaps:base: assigning rank 0 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 1 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 2 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 3 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 4 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 5 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 6 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 7 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 8 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 9 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 10 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 11 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 12 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 13 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 14 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 15 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 16 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 17 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 18 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 19 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 20 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 21 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 22 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 23 to node miriel025
> [devel11:80858] mca:rmaps:base: assigning rank 24 to node miriel02

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread r...@open-mpi.org
Okay, so you login node was able to figure out all the bindings. I don’t see 
any debug output from your compute nodes, which is suspicious.

Try adding --leave-session-attached to the cmd line and let’s see if we can 
capture the compute node daemon’s output

> On Apr 13, 2017, at 7:48 AM, Cyril Bordage  wrote:
> 
> My machine file is: miriel025*24 miriel026*24
> 
> Le 13/04/2017 à 16:46, Cyril Bordage a écrit :
>> There is the output:
>> ##
>> [devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL
>> [devel11:80858] mca:rmaps:select: checking available component mindist
>> [devel11:80858] mca:rmaps:select: Querying component [mindist]
>> [devel11:80858] mca:rmaps:select: checking available component ppr
>> [devel11:80858] mca:rmaps:select: Querying component [ppr]
>> [devel11:80858] mca:rmaps:select: checking available component rank_file
>> [devel11:80858] mca:rmaps:select: Querying component [rank_file]
>> [devel11:80858] mca:rmaps:select: checking available component resilient
>> [devel11:80858] mca:rmaps:select: Querying component [resilient]
>> [devel11:80858] mca:rmaps:select: checking available component round_robin
>> [devel11:80858] mca:rmaps:select: Querying component [round_robin]
>> [devel11:80858] mca:rmaps:select: checking available component seq
>> [devel11:80858] mca:rmaps:select: Querying component [seq]
>> [devel11:80858] [[2965,0],0]: Final mapper priorities
>> [devel11:80858] Mapper: ppr Priority: 90
>> [devel11:80858] Mapper: seq Priority: 60
>> [devel11:80858] Mapper: resilient Priority: 40
>> [devel11:80858] Mapper: mindist Priority: 20
>> [devel11:80858] Mapper: round_robin Priority: 10
>> [devel11:80858] Mapper: rank_file Priority: 0
>> [devel11:80858] mca:rmaps: mapping job [2965,1]
>> [devel11:80858] mca:rmaps: setting mapping policies for job [2965,1]
>> nprocs 48
>> [devel11:80858] mca:rmaps[169] mapping not set by user - using bynuma
>> [devel11:80858] mca:rmaps:ppr: job [2965,1] not using ppr mapper PPR
>> NULL policy PPR NOTSET
>> [devel11:80858] [[2965,0],0] rmaps:seq called on job [2965,1]
>> [devel11:80858] mca:rmaps:seq: job [2965,1] not using seq mapper
>> [devel11:80858] mca:rmaps:resilient: cannot perform initial map of job
>> [2965,1] - no fault groups
>> [devel11:80858] mca:rmaps:mindist: job [2965,1] not using mindist mapper
>> [devel11:80858] mca:rmaps:rr: mapping job [2965,1]
>> [devel11:80858] [[2965,0],0] Starting with 2 nodes in list
>> [devel11:80858] [[2965,0],0] Filtering thru apps
>> [devel11:80858] [[2965,0],0] Retained 2 nodes in list
>> [devel11:80858] [[2965,0],0] node miriel025 has 24 slots available
>> [devel11:80858] [[2965,0],0] node miriel026 has 24 slots available
>> [devel11:80858] AVAILABLE NODES FOR MAPPING:
>> [devel11:80858] node: miriel025 daemon: 1
>> [devel11:80858] node: miriel026 daemon: 2
>> [devel11:80858] [[2965,0],0] Starting bookmark at node miriel025
>> [devel11:80858] [[2965,0],0] Starting at node miriel025
>> [devel11:80858] mca:rmaps:rr: mapping no-span by NUMANode for job
>> [2965,1] slots 48 num_procs 48
>> [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel025
>> [devel11:80858] mca:rmaps:rr: calculated nprocs 24
>> [devel11:80858] mca:rmaps:rr: assigning nprocs 24
>> [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel026
>> [devel11:80858] mca:rmaps:rr: calculated nprocs 24
>> [devel11:80858] mca:rmaps:rr: assigning nprocs 24
>> [devel11:80858] mca:rmaps:base: computing vpids by slot for job [2965,1]
>> [devel11:80858] mca:rmaps:base: assigning rank 0 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 1 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 2 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 3 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 4 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 5 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 6 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 7 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 8 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 9 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 10 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 11 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 12 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 13 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 14 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 15 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 16 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 17 to node miriel025
>> [devel11:80858] mca:rmaps:base: assigning rank 18 to node miriel025
>> [devel11:80858] mca:

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL
[devel11:17550] mca:rmaps:select: checking available component mindist
[devel11:17550] mca:rmaps:select: Querying component [mindist]
[devel11:17550] mca:rmaps:select: checking available component ppr
[devel11:17550] mca:rmaps:select: Querying component [ppr]
[devel11:17550] mca:rmaps:select: checking available component rank_file
[devel11:17550] mca:rmaps:select: Querying component [rank_file]
[devel11:17550] mca:rmaps:select: checking available component resilient
[devel11:17550] mca:rmaps:select: Querying component [resilient]
[devel11:17550] mca:rmaps:select: checking available component round_robin
[devel11:17550] mca:rmaps:select: Querying component [round_robin]
[devel11:17550] mca:rmaps:select: checking available component seq
[devel11:17550] mca:rmaps:select: Querying component [seq]
[devel11:17550] [[29888,0],0]: Final mapper priorities
[devel11:17550] Mapper: ppr Priority: 90
[devel11:17550] Mapper: seq Priority: 60
[devel11:17550] Mapper: resilient Priority: 40
[devel11:17550] Mapper: mindist Priority: 20
[devel11:17550] Mapper: round_robin Priority: 10
[devel11:17550] Mapper: rank_file Priority: 0
[miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device
NONNULL
[miriel025:62329] mca:rmaps:select: checking available component mindist
[miriel025:62329] mca:rmaps:select: Querying component [mindist]
[miriel025:62329] mca:rmaps:select: checking available component ppr
[miriel025:62329] mca:rmaps:select: Querying component [ppr]
[miriel025:62329] mca:rmaps:select: checking available component rank_file
[miriel025:62329] mca:rmaps:select: Querying component [rank_file]
[miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device
NONNULL
[miriel026:165125] mca:rmaps:select: checking available component mindist
[miriel026:165125] mca:rmaps:select: Querying component [mindist]
[miriel026:165125] mca:rmaps:select: checking available component ppr
[miriel026:165125] mca:rmaps:select: Querying component [ppr]
[miriel026:165125] mca:rmaps:select: checking available component rank_file
[miriel026:165125] mca:rmaps:select: Querying component [rank_file]
[miriel026:165125] mca:rmaps:select: checking available component resilient
[miriel026:165125] mca:rmaps:select: Querying component [resilient]
[miriel026:165125] mca:rmaps:select: checking available component
round_robin
[miriel026:165125] mca:rmaps:select: Querying component [round_robin]
[miriel026:165125] mca:rmaps:select: checking available component seq
[miriel026:165125] mca:rmaps:select: Querying component [seq]
[miriel026:165125] [[29888,0],2]: Final mapper priorities
[miriel026:165125]  Mapper: ppr Priority: 90
[[devel11:17550] mca:rmaps: mapping job [29888,1]
[devel11:17550] mca:rmaps: setting mapping policies for job [29888,1]
nprocs 48
[devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma
[devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR
NULL policy PPR NOTSET
[devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1]
[devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper
[devel11:17550] mca:rmaps:resilient: cannot perform initial map of job
[29888,1] - no fault groups
[devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper
[devel11:17550] mca:rmaps:rr: mapping job [29888,1]
[devel11:17550] [[29888,0],0] Starting with 2 nodes in list
[devel11:17550] [[29888,0],0] Filtering thru apps
[miriel025:62329] mca:rmaps:select: checking available component resilient
[miriel025:62329] mca:rmaps:select: Querying component [resilient]
[miriel025:62329] mca:rmaps:select: checking available component round_robin
[miriel025:62329] mca:rmaps:select: Querying component [round_robin]
[miriel025:62329] mca:rmaps:select: checking available component seq
[miriel025:62329] mca:rmaps:select: Querying component [seq]
[miriel025:62329] [[29888,0],1]: Final mapper priorities
[miriel025:62329]   Mapper: ppr Priority: 90
[miriel025:62329]   Mapper: seq Priority: 60
[miriel025:62329]   Mapper: resilient Priority: 40
[miriel025:62329]   Mapper: mindist Priority: 20
[miriel025:62329]   Mapper: round_robin Priority: 10
[miriel025:62329]   Mapper: rank_file Priority: 0
[miriel025:62329] mca:rmaps: mapping job [29888,1]
[miriel025:62329] mca:rmaps: setting mapping policies for job [29888,1]
nprocs 48
[miriel025:62329] mca:rmaps[169] mapping not set by user - using bynuma
[miriel025:62329] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR
NULL policy PPR NOTSET
[miriel025:62329] [[29888,0],1] rmaps:seq called on job [29888,1]
[miriel025:62329] mca:rmaps:seq: job [29888,1] not using seq mapper
[miriel025:62329] mca:rmaps:resilient: cannot perform initial map of job
[29888,1] - no fault groups
[miriel025:62329] mca:rmaps:mindist: job [29888,1] not using mindist mapper
[miriel025:62329] mca:rmaps:rr: mapping job [29888,1]
[miriel025:62329

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread r...@open-mpi.org
Okay, so as far as OMPI is concerned, it correctly bound everyone! So how are 
you generating this output claiming it isn’t bound?

> On Apr 13, 2017, at 7:57 AM, Cyril Bordage  wrote:
> 
> devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL
> [devel11:17550] mca:rmaps:select: checking available component mindist
> [devel11:17550] mca:rmaps:select: Querying component [mindist]
> [devel11:17550] mca:rmaps:select: checking available component ppr
> [devel11:17550] mca:rmaps:select: Querying component [ppr]
> [devel11:17550] mca:rmaps:select: checking available component rank_file
> [devel11:17550] mca:rmaps:select: Querying component [rank_file]
> [devel11:17550] mca:rmaps:select: checking available component resilient
> [devel11:17550] mca:rmaps:select: Querying component [resilient]
> [devel11:17550] mca:rmaps:select: checking available component round_robin
> [devel11:17550] mca:rmaps:select: Querying component [round_robin]
> [devel11:17550] mca:rmaps:select: checking available component seq
> [devel11:17550] mca:rmaps:select: Querying component [seq]
> [devel11:17550] [[29888,0],0]: Final mapper priorities
> [devel11:17550] Mapper: ppr Priority: 90
> [devel11:17550] Mapper: seq Priority: 60
> [devel11:17550] Mapper: resilient Priority: 40
> [devel11:17550] Mapper: mindist Priority: 20
> [devel11:17550] Mapper: round_robin Priority: 10
> [devel11:17550] Mapper: rank_file Priority: 0
> [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device
> NONNULL
> [miriel025:62329] mca:rmaps:select: checking available component mindist
> [miriel025:62329] mca:rmaps:select: Querying component [mindist]
> [miriel025:62329] mca:rmaps:select: checking available component ppr
> [miriel025:62329] mca:rmaps:select: Querying component [ppr]
> [miriel025:62329] mca:rmaps:select: checking available component rank_file
> [miriel025:62329] mca:rmaps:select: Querying component [rank_file]
> [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device
> NONNULL
> [miriel026:165125] mca:rmaps:select: checking available component mindist
> [miriel026:165125] mca:rmaps:select: Querying component [mindist]
> [miriel026:165125] mca:rmaps:select: checking available component ppr
> [miriel026:165125] mca:rmaps:select: Querying component [ppr]
> [miriel026:165125] mca:rmaps:select: checking available component rank_file
> [miriel026:165125] mca:rmaps:select: Querying component [rank_file]
> [miriel026:165125] mca:rmaps:select: checking available component resilient
> [miriel026:165125] mca:rmaps:select: Querying component [resilient]
> [miriel026:165125] mca:rmaps:select: checking available component
> round_robin
> [miriel026:165125] mca:rmaps:select: Querying component [round_robin]
> [miriel026:165125] mca:rmaps:select: checking available component seq
> [miriel026:165125] mca:rmaps:select: Querying component [seq]
> [miriel026:165125] [[29888,0],2]: Final mapper priorities
> [miriel026:165125]  Mapper: ppr Priority: 90
> [[devel11:17550] mca:rmaps: mapping job [29888,1]
> [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1]
> nprocs 48
> [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma
> [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR
> NULL policy PPR NOTSET
> [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1]
> [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper
> [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job
> [29888,1] - no fault groups
> [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper
> [devel11:17550] mca:rmaps:rr: mapping job [29888,1]
> [devel11:17550] [[29888,0],0] Starting with 2 nodes in list
> [devel11:17550] [[29888,0],0] Filtering thru apps
> [miriel025:62329] mca:rmaps:select: checking available component resilient
> [miriel025:62329] mca:rmaps:select: Querying component [resilient]
> [miriel025:62329] mca:rmaps:select: checking available component round_robin
> [miriel025:62329] mca:rmaps:select: Querying component [round_robin]
> [miriel025:62329] mca:rmaps:select: checking available component seq
> [miriel025:62329] mca:rmaps:select: Querying component [seq]
> [miriel025:62329] [[29888,0],1]: Final mapper priorities
> [miriel025:62329]   Mapper: ppr Priority: 90
> [miriel025:62329]   Mapper: seq Priority: 60
> [miriel025:62329]   Mapper: resilient Priority: 40
> [miriel025:62329]   Mapper: mindist Priority: 20
> [miriel025:62329]   Mapper: round_robin Priority: 10
> [miriel025:62329]   Mapper: rank_file Priority: 0
> [miriel025:62329] mca:rmaps: mapping job [29888,1]
> [miriel025:62329] mca:rmaps: setting mapping policies for job [29888,1]
> nprocs 48
> [miriel025:62329] mca:rmaps[169] mapping not set by user - using bynuma
> [miriel025:62329] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR
> NULL policy PPR NOTSET
> [miriel025:62329] [[29888,0],1] rma

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Cyril Bordage
'-report-bindings' does that.
I used this option because the ranks did not seem to be binded (if I use
a rank file the performace is far better).

Le 13/04/2017 à 17:24, r...@open-mpi.org a écrit :
> Okay, so as far as OMPI is concerned, it correctly bound everyone! So how are 
> you generating this output claiming it isn’t bound?
> 
>> On Apr 13, 2017, at 7:57 AM, Cyril Bordage  wrote:
>>
>> devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL
>> [devel11:17550] mca:rmaps:select: checking available component mindist
>> [devel11:17550] mca:rmaps:select: Querying component [mindist]
>> [devel11:17550] mca:rmaps:select: checking available component ppr
>> [devel11:17550] mca:rmaps:select: Querying component [ppr]
>> [devel11:17550] mca:rmaps:select: checking available component rank_file
>> [devel11:17550] mca:rmaps:select: Querying component [rank_file]
>> [devel11:17550] mca:rmaps:select: checking available component resilient
>> [devel11:17550] mca:rmaps:select: Querying component [resilient]
>> [devel11:17550] mca:rmaps:select: checking available component round_robin
>> [devel11:17550] mca:rmaps:select: Querying component [round_robin]
>> [devel11:17550] mca:rmaps:select: checking available component seq
>> [devel11:17550] mca:rmaps:select: Querying component [seq]
>> [devel11:17550] [[29888,0],0]: Final mapper priorities
>> [devel11:17550] Mapper: ppr Priority: 90
>> [devel11:17550] Mapper: seq Priority: 60
>> [devel11:17550] Mapper: resilient Priority: 40
>> [devel11:17550] Mapper: mindist Priority: 20
>> [devel11:17550] Mapper: round_robin Priority: 10
>> [devel11:17550] Mapper: rank_file Priority: 0
>> [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device
>> NONNULL
>> [miriel025:62329] mca:rmaps:select: checking available component mindist
>> [miriel025:62329] mca:rmaps:select: Querying component [mindist]
>> [miriel025:62329] mca:rmaps:select: checking available component ppr
>> [miriel025:62329] mca:rmaps:select: Querying component [ppr]
>> [miriel025:62329] mca:rmaps:select: checking available component rank_file
>> [miriel025:62329] mca:rmaps:select: Querying component [rank_file]
>> [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device
>> NONNULL
>> [miriel026:165125] mca:rmaps:select: checking available component mindist
>> [miriel026:165125] mca:rmaps:select: Querying component [mindist]
>> [miriel026:165125] mca:rmaps:select: checking available component ppr
>> [miriel026:165125] mca:rmaps:select: Querying component [ppr]
>> [miriel026:165125] mca:rmaps:select: checking available component rank_file
>> [miriel026:165125] mca:rmaps:select: Querying component [rank_file]
>> [miriel026:165125] mca:rmaps:select: checking available component resilient
>> [miriel026:165125] mca:rmaps:select: Querying component [resilient]
>> [miriel026:165125] mca:rmaps:select: checking available component
>> round_robin
>> [miriel026:165125] mca:rmaps:select: Querying component [round_robin]
>> [miriel026:165125] mca:rmaps:select: checking available component seq
>> [miriel026:165125] mca:rmaps:select: Querying component [seq]
>> [miriel026:165125] [[29888,0],2]: Final mapper priorities
>> [miriel026:165125]  Mapper: ppr Priority: 90
>> [[devel11:17550] mca:rmaps: mapping job [29888,1]
>> [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1]
>> nprocs 48
>> [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma
>> [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR
>> NULL policy PPR NOTSET
>> [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1]
>> [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper
>> [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job
>> [29888,1] - no fault groups
>> [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper
>> [devel11:17550] mca:rmaps:rr: mapping job [29888,1]
>> [devel11:17550] [[29888,0],0] Starting with 2 nodes in list
>> [devel11:17550] [[29888,0],0] Filtering thru apps
>> [miriel025:62329] mca:rmaps:select: checking available component resilient
>> [miriel025:62329] mca:rmaps:select: Querying component [resilient]
>> [miriel025:62329] mca:rmaps:select: checking available component round_robin
>> [miriel025:62329] mca:rmaps:select: Querying component [round_robin]
>> [miriel025:62329] mca:rmaps:select: checking available component seq
>> [miriel025:62329] mca:rmaps:select: Querying component [seq]
>> [miriel025:62329] [[29888,0],1]: Final mapper priorities
>> [miriel025:62329]   Mapper: ppr Priority: 90
>> [miriel025:62329]   Mapper: seq Priority: 60
>> [miriel025:62329]   Mapper: resilient Priority: 40
>> [miriel025:62329]   Mapper: mindist Priority: 20
>> [miriel025:62329]   Mapper: round_robin Priority: 10
>> [miriel025:62329]   Mapper: rank_file Priority: 0
>> [miriel025:62329] mca:rmaps: mapping job [29888,1]
>> [miriel025:6232

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread r...@open-mpi.org
All right, let’s replace rmaps_base_verbose with odls_base_verbose and see what 
that saids

> On Apr 13, 2017, at 8:34 AM, Cyril Bordage  wrote:
> 
> '-report-bindings' does that.
> I used this option because the ranks did not seem to be binded (if I use
> a rank file the performace is far better).
> 
> Le 13/04/2017 à 17:24, r...@open-mpi.org a écrit :
>> Okay, so as far as OMPI is concerned, it correctly bound everyone! So how 
>> are you generating this output claiming it isn’t bound?
>> 
>>> On Apr 13, 2017, at 7:57 AM, Cyril Bordage  wrote:
>>> 
>>> devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL
>>> [devel11:17550] mca:rmaps:select: checking available component mindist
>>> [devel11:17550] mca:rmaps:select: Querying component [mindist]
>>> [devel11:17550] mca:rmaps:select: checking available component ppr
>>> [devel11:17550] mca:rmaps:select: Querying component [ppr]
>>> [devel11:17550] mca:rmaps:select: checking available component rank_file
>>> [devel11:17550] mca:rmaps:select: Querying component [rank_file]
>>> [devel11:17550] mca:rmaps:select: checking available component resilient
>>> [devel11:17550] mca:rmaps:select: Querying component [resilient]
>>> [devel11:17550] mca:rmaps:select: checking available component round_robin
>>> [devel11:17550] mca:rmaps:select: Querying component [round_robin]
>>> [devel11:17550] mca:rmaps:select: checking available component seq
>>> [devel11:17550] mca:rmaps:select: Querying component [seq]
>>> [devel11:17550] [[29888,0],0]: Final mapper priorities
>>> [devel11:17550] Mapper: ppr Priority: 90
>>> [devel11:17550] Mapper: seq Priority: 60
>>> [devel11:17550] Mapper: resilient Priority: 40
>>> [devel11:17550] Mapper: mindist Priority: 20
>>> [devel11:17550] Mapper: round_robin Priority: 10
>>> [devel11:17550] Mapper: rank_file Priority: 0
>>> [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device
>>> NONNULL
>>> [miriel025:62329] mca:rmaps:select: checking available component mindist
>>> [miriel025:62329] mca:rmaps:select: Querying component [mindist]
>>> [miriel025:62329] mca:rmaps:select: checking available component ppr
>>> [miriel025:62329] mca:rmaps:select: Querying component [ppr]
>>> [miriel025:62329] mca:rmaps:select: checking available component rank_file
>>> [miriel025:62329] mca:rmaps:select: Querying component [rank_file]
>>> [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device
>>> NONNULL
>>> [miriel026:165125] mca:rmaps:select: checking available component mindist
>>> [miriel026:165125] mca:rmaps:select: Querying component [mindist]
>>> [miriel026:165125] mca:rmaps:select: checking available component ppr
>>> [miriel026:165125] mca:rmaps:select: Querying component [ppr]
>>> [miriel026:165125] mca:rmaps:select: checking available component rank_file
>>> [miriel026:165125] mca:rmaps:select: Querying component [rank_file]
>>> [miriel026:165125] mca:rmaps:select: checking available component resilient
>>> [miriel026:165125] mca:rmaps:select: Querying component [resilient]
>>> [miriel026:165125] mca:rmaps:select: checking available component
>>> round_robin
>>> [miriel026:165125] mca:rmaps:select: Querying component [round_robin]
>>> [miriel026:165125] mca:rmaps:select: checking available component seq
>>> [miriel026:165125] mca:rmaps:select: Querying component [seq]
>>> [miriel026:165125] [[29888,0],2]: Final mapper priorities
>>> [miriel026:165125]  Mapper: ppr Priority: 90
>>> [[devel11:17550] mca:rmaps: mapping job [29888,1]
>>> [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1]
>>> nprocs 48
>>> [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma
>>> [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR
>>> NULL policy PPR NOTSET
>>> [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1]
>>> [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper
>>> [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job
>>> [29888,1] - no fault groups
>>> [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper
>>> [devel11:17550] mca:rmaps:rr: mapping job [29888,1]
>>> [devel11:17550] [[29888,0],0] Starting with 2 nodes in list
>>> [devel11:17550] [[29888,0],0] Filtering thru apps
>>> [miriel025:62329] mca:rmaps:select: checking available component resilient
>>> [miriel025:62329] mca:rmaps:select: Querying component [resilient]
>>> [miriel025:62329] mca:rmaps:select: checking available component round_robin
>>> [miriel025:62329] mca:rmaps:select: Querying component [round_robin]
>>> [miriel025:62329] mca:rmaps:select: checking available component seq
>>> [miriel025:62329] mca:rmaps:select: Querying component [seq]
>>> [miriel025:62329] [[29888,0],1]: Final mapper priorities
>>> [miriel025:62329]   Mapper: ppr Priority: 90
>>> [miriel025:62329]   Mapper: seq Priority: 60
>>> [miriel025:62329]   Mapper: resilient Priority: 40
>>> [miriel

Re: [OMPI devel] Problem with bind-to

2017-04-13 Thread Gilles Gouaillardet

Ralph,


i can simply reproduce the issue with two nodes and the latest master

all commands are ran on n1, which has the same topology (2 sockets * 8 
cores each) than n2



1) everything works

$ mpirun -np 16 -bind-to core --report-bindings true
[n1:29794] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././.][./././././././.]
[n1:29794] MCW rank 1 bound to socket 1[core 8[hwt 0]]: 
[./././././././.][B/././././././.]
[n1:29794] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././.][./././././././.]
[n1:29794] MCW rank 3 bound to socket 1[core 9[hwt 0]]: 
[./././././././.][./B/./././././.]
[n1:29794] MCW rank 4 bound to socket 0[core 2[hwt 0]]: 
[././B/././././.][./././././././.]
[n1:29794] MCW rank 5 bound to socket 1[core 10[hwt 0]]: 
[./././././././.][././B/././././.]
[n1:29794] MCW rank 6 bound to socket 0[core 3[hwt 0]]: 
[./././B/./././.][./././././././.]
[n1:29794] MCW rank 7 bound to socket 1[core 11[hwt 0]]: 
[./././././././.][./././B/./././.]
[n1:29794] MCW rank 8 bound to socket 0[core 4[hwt 0]]: 
[././././B/././.][./././././././.]
[n1:29794] MCW rank 9 bound to socket 1[core 12[hwt 0]]: 
[./././././././.][././././B/././.]
[n1:29794] MCW rank 10 bound to socket 0[core 5[hwt 0]]: 
[./././././B/./.][./././././././.]
[n1:29794] MCW rank 11 bound to socket 1[core 13[hwt 0]]: 
[./././././././.][./././././B/./.]
[n1:29794] MCW rank 12 bound to socket 0[core 6[hwt 0]]: 
[././././././B/.][./././././././.]
[n1:29794] MCW rank 13 bound to socket 1[core 14[hwt 0]]: 
[./././././././.][././././././B/.]
[n1:29794] MCW rank 14 bound to socket 0[core 7[hwt 0]]: 
[./././././././B][./././././././.]
[n1:29794] MCW rank 15 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]


$ mpirun -np 16 -bind-to core --host n1:16 --report-bindings true
[n1:29850] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././.][./././././././.]
[n1:29850] MCW rank 1 bound to socket 1[core 8[hwt 0]]: 
[./././././././.][B/././././././.]
[n1:29850] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././.][./././././././.]
[n1:29850] MCW rank 3 bound to socket 1[core 9[hwt 0]]: 
[./././././././.][./B/./././././.]
[n1:29850] MCW rank 4 bound to socket 0[core 2[hwt 0]]: 
[././B/././././.][./././././././.]
[n1:29850] MCW rank 5 bound to socket 1[core 10[hwt 0]]: 
[./././././././.][././B/././././.]
[n1:29850] MCW rank 6 bound to socket 0[core 3[hwt 0]]: 
[./././B/./././.][./././././././.]
[n1:29850] MCW rank 7 bound to socket 1[core 11[hwt 0]]: 
[./././././././.][./././B/./././.]
[n1:29850] MCW rank 8 bound to socket 0[core 4[hwt 0]]: 
[././././B/././.][./././././././.]
[n1:29850] MCW rank 9 bound to socket 1[core 12[hwt 0]]: 
[./././././././.][././././B/././.]
[n1:29850] MCW rank 10 bound to socket 0[core 5[hwt 0]]: 
[./././././B/./.][./././././././.]
[n1:29850] MCW rank 11 bound to socket 1[core 13[hwt 0]]: 
[./././././././.][./././././B/./.]
[n1:29850] MCW rank 12 bound to socket 0[core 6[hwt 0]]: 
[././././././B/.][./././././././.]
[n1:29850] MCW rank 13 bound to socket 1[core 14[hwt 0]]: 
[./././././././.][././././././B/.]
[n1:29850] MCW rank 14 bound to socket 0[core 7[hwt 0]]: 
[./././././././B][./././././././.]
[n1:29850] MCW rank 15 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]


2) with an other node

$ mpirun -np 16 -bind-to core --host n2:16 --report-bindings true

/* no output with a non MPI app !*/

$ mpirun -np 16 -bind-to core --host n2:16 --report-bindings ./hello_c
[n2:52851] MCW rank 0 not bound
[n2:52852] MCW rank 1 not bound
[n2:52853] MCW rank 2 not bound
[n2:52854] MCW rank 3 not bound
[n2:52855] MCW rank 4 not bound
[n2:52856] MCW rank 5 not bound
[n2:52857] MCW rank 6 not bound
[n2:52859] MCW rank 7 not bound
[n2:52861] MCW rank 8 not bound
[n2:52864] MCW rank 9 not bound
[n2:52866] MCW rank 10 not bound
[n2:52869] MCW rank 11 not bound
[n2:52877] MCW rank 15 not bound
[n2:52871] MCW rank 12 not bound
[n2:52873] MCW rank 13 not bound
[n2:52876] MCW rank 14 not bound
Hello, world, I am 0 of 16, (Open MPI v4.0.0a1, package: Open MPI 
gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, 
Unreleased developer copy, 165)
Hello, world, I am 1 of 16, (Open MPI v4.0.0a1, package: Open MPI 
gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, 
Unreleased developer copy, 165)
Hello, world, I am 2 of 16, (Open MPI v4.0.0a1, package: Open MPI 
gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, 
Unreleased developer copy, 165)
Hello, world, I am 3 of 16, (Open MPI v4.0.0a1, package: Open MPI 
gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, 
Unreleased developer copy, 165)
Hello, world, I am 4 of 16, (Open MPI v4.0.0a1, package: Open MPI 
gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, 
Unreleased developer copy, 165)
Hello, world, I am 5 of 16, (Open MPI v4.0.0a1, package: Open MPI 
gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-