[OMPI devel] Unable to complete a TCP connection
I cloned from github latest version of Open MPI on grid5000. 128 nodes was reserved from nancy site. During execution of my mpi code I got error message below: A process or daemon was unable to complete a TCP connection to another process: Local host:graphene-17 Remote host: graphene-91 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again. I deployed my OS image. Everything is OK with firewall. Consider that same OS image was deployed on all reserved nodes, if Open MPI could connect some of them and execute code it means firewall accepted input. Thre is no problem to connect to graphene-91 with ssh. But below comanline does not work mpirun -host graphene-91 -n 1 exec_code I get same message "Unable to complete a TCP connection" Sometimes I got this error: WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: graphene-27 -- [graphene-26][[56971,1],52][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 172.18.64.25 failed: No route to host (113) [graphene-29][[56971,1],60][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 172.18.64.27 failed: No route to host (113) [graphene-14.nancy.grid5000.fr:02890] 15 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [graphene-14.nancy.grid5000.fr:02890] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages When I change command line using mca parameter to select eth0, there is another error. This is not stable version maybe therefore, I get such kind of error ? Your faithfully, Emin Nuriyev ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with bind-to
Hi, now this bug happens also when I launch my mpirun command from the compute node. Cyril. Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit : > I believe this has been fixed now - please let me know > >> On Mar 30, 2017, at 1:57 AM, Cyril Bordage wrote: >> >> Hello, >> >> I am using the git version of MPI with "-bind-to core -report-bindings" >> and I get that for all processes: >> [miriel010:160662] MCW rank 0 not bound >> >> >> When I use an old version I get: >> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >> [B/././././././././././.][./././././././././././.] >> >> From git bisect the culprit seems to be: 48fc339 >> >> This bug happends only when I launch my mpirun command from a login node >> and not >> from a compute node. >> >> >> Cyril. >> ___ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with bind-to
Can you be a bit more specific? - What version of Open MPI are you using? - How did you configure Open MPI? - How are you launching Open MPI applications? > On Apr 13, 2017, at 9:08 AM, Cyril Bordage wrote: > > Hi, > > now this bug happens also when I launch my mpirun command from the > compute node. > > > Cyril. > > Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit : >> I believe this has been fixed now - please let me know >> >>> On Mar 30, 2017, at 1:57 AM, Cyril Bordage wrote: >>> >>> Hello, >>> >>> I am using the git version of MPI with "-bind-to core -report-bindings" >>> and I get that for all processes: >>> [miriel010:160662] MCW rank 0 not bound >>> >>> >>> When I use an old version I get: >>> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >>> [B/././././././././././.][./././././././././././.] >>> >>> From git bisect the culprit seems to be: 48fc339 >>> >>> This bug happends only when I launch my mpirun command from a login node >>> and not >>> from a compute node. >>> >>> >>> Cyril. >>> ___ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> ___ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with bind-to
Also, can you please run lstopo on both your login and compute nodes ? Cheers, Gilles - Original Message - > Can you be a bit more specific? > > - What version of Open MPI are you using? > - How did you configure Open MPI? > - How are you launching Open MPI applications? > > > > On Apr 13, 2017, at 9:08 AM, Cyril Bordage wrote: > > > > Hi, > > > > now this bug happens also when I launch my mpirun command from the > > compute node. > > > > > > Cyril. > > > > Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit : > >> I believe this has been fixed now - please let me know > >> > >>> On Mar 30, 2017, at 1:57 AM, Cyril Bordage wrote: > >>> > >>> Hello, > >>> > >>> I am using the git version of MPI with "-bind-to core -report- bindings" > >>> and I get that for all processes: > >>> [miriel010:160662] MCW rank 0 not bound > >>> > >>> > >>> When I use an old version I get: > >>> [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > >>> [B/././././././././././.][./././././././././././.] > >>> > >>> From git bisect the culprit seems to be: 48fc339 > >>> > >>> This bug happends only when I launch my mpirun command from a login node > >>> and not > >>> from a compute node. > >>> > >>> > >>> Cyril. > >>> ___ > >>> devel mailing list > >>> devel@lists.open-mpi.org > >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > >> > >> ___ > >> devel mailing list > >> devel@lists.open-mpi.org > >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > >> > > ___ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with bind-to
I am using the 6886c12 commit. I have no particular option for the configuration. I launch my application in the same way as I presented in my firt email, there is the exact line: mpirun -np 48 -machinefile mf -bind-to core -report-bindings ./a.out lstopo does give the same output on both types on nodes. What is the purpose of that? Thanks. Cyril. Le 13/04/2017 à 15:24, gil...@rist.or.jp a écrit : > Also, can you please run > lstopo > on both your login and compute nodes ? > > Cheers, > > Gilles > > > - Original Message - >> Can you be a bit more specific? >> >> - What version of Open MPI are you using? >> - How did you configure Open MPI? >> - How are you launching Open MPI applications? >> >> >>> On Apr 13, 2017, at 9:08 AM, Cyril Bordage > wrote: >>> >>> Hi, >>> >>> now this bug happens also when I launch my mpirun command from the >>> compute node. >>> >>> >>> Cyril. >>> >>> Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit : I believe this has been fixed now - please let me know > On Mar 30, 2017, at 1:57 AM, Cyril Bordage > wrote: > > Hello, > > I am using the git version of MPI with "-bind-to core -report- > bindings" > and I get that for all processes: > [miriel010:160662] MCW rank 0 not bound > > > When I use an old version I get: > [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././././././.][./././././././././././.] > > From git bisect the culprit seems to be: 48fc339 > > This bug happends only when I launch my mpirun command from a > login node > and not > from a compute node. > > > Cyril. > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> ___ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> >> ___ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with bind-to
OK thanks, we've had some issues in the past when Open MPI assumed that the (login) node running mpirun has the same topology than the other (compute) nodes. i just wanted to clear this scenario. Cheers, Gilles - Original Message - > I am using the 6886c12 commit. > I have no particular option for the configuration. > I launch my application in the same way as I presented in my firt email, > there is the exact line: mpirun -np 48 -machinefile mf -bind-to core > -report-bindings ./a.out > > lstopo does give the same output on both types on nodes. What is the > purpose of that? > > Thanks. > > > Cyril. > > Le 13/04/2017 à 15:24, gil...@rist.or.jp a écrit : > > Also, can you please run > > lstopo > > on both your login and compute nodes ? > > > > Cheers, > > > > Gilles > > > > > > - Original Message - > >> Can you be a bit more specific? > >> > >> - What version of Open MPI are you using? > >> - How did you configure Open MPI? > >> - How are you launching Open MPI applications? > >> > >> > >>> On Apr 13, 2017, at 9:08 AM, Cyril Bordage > > wrote: > >>> > >>> Hi, > >>> > >>> now this bug happens also when I launch my mpirun command from the > >>> compute node. > >>> > >>> > >>> Cyril. > >>> > >>> Le 06/04/2017 à 05:38, r...@open-mpi.org a écrit : > I believe this has been fixed now - please let me know > > > On Mar 30, 2017, at 1:57 AM, Cyril Bordage >> wrote: > > > > Hello, > > > > I am using the git version of MPI with "-bind-to core -report- > > bindings" > > and I get that for all processes: > > [miriel010:160662] MCW rank 0 not bound > > > > > > When I use an old version I get: > > [miriel010:44921] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././././././././.][./././././././././././.] > > > > From git bisect the culprit seems to be: 48fc339 > > > > This bug happends only when I launch my mpirun command from a > > login node > > and not > > from a compute node. > > > > > > Cyril. > > ___ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > >>> ___ > >>> devel mailing list > >>> devel@lists.open-mpi.org > >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > >> > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> > >> ___ > >> devel mailing list > >> devel@lists.open-mpi.org > >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > ___ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > ___ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Unable to complete a TCP connection
There are several kind of communications - ssh from mpirun to compute nodes, and also between compute nodes (assuming you use a machine file and no supported batch manager) to spawn orted daemons - oob/tcp connections between orted - btl/tcp connections between MPI tasks You can restrict the port ranges used by oob/tcp and btl/tcp and open them if you really want a firewall. (I strongly suggest you try without a firewall first) Now looking at the error message "no route to hosts" That could mean there is no route, so you should include/exclude some subsets/interfaces mpirun --mca btl_tcp_if_include ... --mca oob_tcp_if_include ... .. Or there might be a route, but the firewall reports otherwise Cheers, Gilles On Thursday, April 13, 2017, Emin Nuriyev wrote: > I cloned from github latest version of Open MPI on grid5000. > > 128 nodes was reserved from nancy site. During execution of my mpi code I > got error message below: > > A process or daemon was unable to complete a TCP connection > to another process: > Local host:graphene-17 > Remote host: graphene-91 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > > I deployed my OS image. Everything is OK with firewall. Consider that same > OS image was deployed on all reserved nodes, if Open MPI could connect some > of them and execute code it means firewall accepted input. > > Thre is no problem to connect to graphene-91 with ssh. But below > comanline does not work > > mpirun -host graphene-91 -n 1 exec_code > > I get same message "Unable to complete a TCP connection" > > > Sometimes I got this error: > > WARNING: There is at least non-excluded one OpenFabrics device found, > but there are no active ports detected (or Open MPI was unable to use > them). This is most certainly not what you wanted. Check your > cables, subnet manager configuration, etc. The openib BTL will be > ignored for this job. > > Local host: graphene-27 > -- > [graphene-26][[56971,1],52][btl_tcp_endpoint.c:796:mca_ > btl_tcp_endpoint_complete_connect] connect() to 172.18.64.25 failed: No > route to host (113) > [graphene-29][[56971,1],60][btl_tcp_endpoint.c:796:mca_ > btl_tcp_endpoint_complete_connect] connect() to 172.18.64.27 failed: No > route to host (113) > [graphene-14.nancy.grid5000.fr:02890] 15 more processes have sent help > message help-mpi-btl-openib.txt / no active ports found > [graphene-14.nancy.grid5000.fr:02890] Set MCA parameter > "orte_base_help_aggregate" to 0 to see all help / error messages > > When I change command line using mca parameter to select eth0, there is > another error. This is not stable version maybe therefore, I get such kind > of error ? > > Your faithfully, > Emin Nuriyev > ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Problem with bind-to
We are asking all these questions because we cannot replicate your problem - so we are trying to help you figure out what is different or missing from your machine. When I run your cmd line on my system, I get: [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../../../../../..][BB/../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../..][../BB/../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../..][../../BB/../../../../../../../../..] [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../..][../../../BB/../../../../../../../..] [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: [../../../../../../../../../../../..][../../../../BB/../../../../../../..] [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../BB/../../../../../..] [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/../../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../BB/../../../../..] [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/../../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../BB/../../../..] [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: [../../../../../../../../BB/../../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../../BB/../../..] [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: [../../../../../../../../../BB/../..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../../../BB/../..] [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: [../../../../../../../../../../BB/..][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../../../../BB/..] [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: [../../../../../../../../../../../BB][../../../../../../../../../../../..] [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: [../../../../../../../../../../../..][../../../../../../../../../../../BB] [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../..][../../../../../../../../../../../..] [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: [../../../../../../../../../../../..][BB/../../../../../../../../../../..] [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../..][../../../../../../../../../../../..] [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: [../../../../../../../../../../../..][../BB/../../../../../../../../../..] [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../..][../../../../../../../../../../../..] [rhc001:197743] MCW rank 5 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../..][../../BB/../../../../../../../../..] [rhc001:197743] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../..][../../../../../../../../../../../..] [rhc001:197743] MCW rank 7 bound to socket 1[core 15[hwt 0-1]]: [../../../../../../../../../../../..][../../../BB/../../../../../../../..] [rhc001:197743] MCW rank 8 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../../../../../..][../../../../../../../../.
Re: [OMPI devel] Problem with bind-to
When I run this command from the compute node I have also that. But not when I run it from a login node (with the same machine file). Cyril. Le 13/04/2017 à 16:22, r...@open-mpi.org a écrit : > We are asking all these questions because we cannot replicate your problem - > so we are trying to help you figure out what is different or missing from > your machine. When I run your cmd line on my system, I get: > > [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: > [../../../../../../../../../../../..][BB/../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: > [../BB/../../../../../../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: > [../../../../../../../../../../../..][../BB/../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: > [../../BB/../../../../../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: > [../../../../../../../../../../../..][../../BB/../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: > [../../../BB/../../../../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: > [../../../../../../../../../../../..][../../../BB/../../../../../../../..] > [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: > [../../../../BB/../../../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../BB/../../../../../../..] > [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: > [../../../../../BB/../../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../BB/../../../../../..] > [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: > [../../../../../../BB/../../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../BB/../../../../..] > [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: > [../../../../../../../BB/../../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../BB/../../../..] > [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: > [../../../../../../../../BB/../../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../BB/../../..] > [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: > [../../../../../../../../../BB/../..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../../BB/../..] > [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: > [../../../../../../../../../../BB/..][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../../../BB/..] > [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: > [../../../../../../../../../../../BB][../../../../../../../../../../../..] > [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../../../../BB] > [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../../../../../..][../../../../../../../../../../../..] > [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: > [../../../../../../../../../../../..][BB/../../../../../../../../../../..] > [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: > [../BB/../../../../../../../../../..][../../../../../../../../../../../..] > [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: > [../../../../../../../../../../../..][../BB/../../../../../../../../../..] > [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: > [../../BB/../../../../../../../../..][../../../../../../../../../../../..] > [rhc001:197743] MCW rank 5 bound to socket 1[core 14[hwt 0-1]]: > [../../../../../../../../../../../..][../../BB/../../../../../../../../..] > [rhc001:197743] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: > [../../../BB/..
Re: [OMPI devel] Problem with bind-to
Try adding "-mca rmaps_base_verbose 5” and see what that output tells us - I assume you have a debug build configured, yes (i.e., added --enable-debug to configure line)? > On Apr 13, 2017, at 7:28 AM, Cyril Bordage wrote: > > When I run this command from the compute node I have also that. But not > when I run it from a login node (with the same machine file). > > > Cyril. > > Le 13/04/2017 à 16:22, r...@open-mpi.org a écrit : >> We are asking all these questions because we cannot replicate your problem - >> so we are trying to help you figure out what is different or missing from >> your machine. When I run your cmd line on my system, I get: >> >> [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: >> [BB/../../../../../../../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: >> [../../../../../../../../../../../..][BB/../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: >> [../BB/../../../../../../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: >> [../../../../../../../../../../../..][../BB/../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: >> [../../BB/../../../../../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: >> [../../../../../../../../../../../..][../../BB/../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: >> [../../../BB/../../../../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../BB/../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: >> [../../../../BB/../../../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../BB/../../../../../../..] >> [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: >> [../../../../../BB/../../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../BB/../../../../../..] >> [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: >> [../../../../../../BB/../../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../../BB/../../../../..] >> [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: >> [../../../../../../../BB/../../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../../../BB/../../../..] >> [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: >> [../../../../../../../../BB/../../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../../../../BB/../../..] >> [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: >> [../../../../../../../../../BB/../..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../../../../../BB/../..] >> [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: >> [../../../../../../../../../../BB/..][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../../../../../../BB/..] >> [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: >> [../../../../../../../../../../../BB][../../../../../../../../../../../..] >> [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: >> [../../../../../../../../../../../..][../../../../../../../../../../../BB] >> [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: >> [BB/../../../../../../../../../../..][../../../../../../../../../../../..] >> [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: >> [../../../../../../../../../../../..][BB/../../../../../../../../../../..] >> [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: >> [../BB/../../../../../../../../../..][../../../../../../../../../../../..] >> [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: >> [../../../../../../../../../../../..][../BB/../../../../../../../../../..] >> [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]:
Re: [OMPI devel] Problem with bind-to
Is your compute node included in your machine file ? If yes, what if you invoke mpirun from a compute node not listed in your machine file ? It can also be helpful to post your machinefile Cheers, Gilles On Thursday, April 13, 2017, Cyril Bordage wrote: > When I run this command from the compute node I have also that. But not > when I run it from a login node (with the same machine file). > > > Cyril. > > Le 13/04/2017 à 16:22, r...@open-mpi.org a écrit : > > We are asking all these questions because we cannot replicate your > problem - so we are trying to help you figure out what is different or > missing from your machine. When I run your cmd line on my system, I get: > > > > [rhc002.cluster:55965] MCW rank 24 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 25 bound to socket 1[core 12[hwt 0-1]]: > [../../../../../../../../../../../..][BB/../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 26 bound to socket 0[core 1[hwt 0-1]]: > [../BB/../../../../../../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 27 bound to socket 1[core 13[hwt 0-1]]: > [../../../../../../../../../../../..][../BB/../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 28 bound to socket 0[core 2[hwt 0-1]]: > [../../BB/../../../../../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 29 bound to socket 1[core 14[hwt 0-1]]: > [../../../../../../../../../../../..][../../BB/../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 30 bound to socket 0[core 3[hwt 0-1]]: > [../../../BB/../../../../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 31 bound to socket 1[core 15[hwt 0-1]]: > [../../../../../../../../../../../..][../../../BB/../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 32 bound to socket 0[core 4[hwt 0-1]]: > [../../../../BB/../../../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 33 bound to socket 1[core 16[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../BB/../../../../../../..] > > [rhc002.cluster:55965] MCW rank 34 bound to socket 0[core 5[hwt 0-1]]: > [../../../../../BB/../../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 35 bound to socket 1[core 17[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../BB/../../../../../..] > > [rhc002.cluster:55965] MCW rank 36 bound to socket 0[core 6[hwt 0-1]]: > [../../../../../../BB/../../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 37 bound to socket 1[core 18[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../BB/../../../../..] > > [rhc002.cluster:55965] MCW rank 38 bound to socket 0[core 7[hwt 0-1]]: > [../../../../../../../BB/../../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 39 bound to socket 1[core 19[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../BB/../../../..] > > [rhc002.cluster:55965] MCW rank 40 bound to socket 0[core 8[hwt 0-1]]: > [../../../../../../../../BB/../../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 41 bound to socket 1[core 20[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../BB/../../..] > > [rhc002.cluster:55965] MCW rank 42 bound to socket 0[core 9[hwt 0-1]]: > [../../../../../../../../../BB/../..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 43 bound to socket 1[core 21[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../../BB/../..] > > [rhc002.cluster:55965] MCW rank 44 bound to socket 0[core 10[hwt 0-1]]: > [../../../../../../../../../../BB/..][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 45 bound to socket 1[core 22[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../../../BB/..] > > [rhc002.cluster:55965] MCW rank 46 bound to socket 0[core 11[hwt 0-1]]: > [../../../../../../../../../../../BB][../../../../../../../../../../../..] > > [rhc002.cluster:55965] MCW rank 47 bound to socket 1[core 23[hwt 0-1]]: > [../../../../../../../../../../../..][../../../../../../../../../../../BB] > > [rhc001:197743] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../../../../../..][../../../../../../../../../../../..] > > [rhc001:197743] MCW rank 1 bound to socket 1[core 12[hwt 0-1]]: > [../../../../../../../../../../../..][BB/../../../../../../../../../../..] > > [rhc001:197743] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: > [../BB/../../../../../../../../../..][../../../../../../../../../../../..] > > [rhc001:197743] MCW rank 3 bound to socket 1[core 13[hwt 0-1]]: > [../../../../../../../../../../../..][../BB/../../../../../../../../../..] > > [rhc001:197743] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: > [
Re: [OMPI devel] Problem with bind-to
There is the output: ## [devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL [devel11:80858] mca:rmaps:select: checking available component mindist [devel11:80858] mca:rmaps:select: Querying component [mindist] [devel11:80858] mca:rmaps:select: checking available component ppr [devel11:80858] mca:rmaps:select: Querying component [ppr] [devel11:80858] mca:rmaps:select: checking available component rank_file [devel11:80858] mca:rmaps:select: Querying component [rank_file] [devel11:80858] mca:rmaps:select: checking available component resilient [devel11:80858] mca:rmaps:select: Querying component [resilient] [devel11:80858] mca:rmaps:select: checking available component round_robin [devel11:80858] mca:rmaps:select: Querying component [round_robin] [devel11:80858] mca:rmaps:select: checking available component seq [devel11:80858] mca:rmaps:select: Querying component [seq] [devel11:80858] [[2965,0],0]: Final mapper priorities [devel11:80858] Mapper: ppr Priority: 90 [devel11:80858] Mapper: seq Priority: 60 [devel11:80858] Mapper: resilient Priority: 40 [devel11:80858] Mapper: mindist Priority: 20 [devel11:80858] Mapper: round_robin Priority: 10 [devel11:80858] Mapper: rank_file Priority: 0 [devel11:80858] mca:rmaps: mapping job [2965,1] [devel11:80858] mca:rmaps: setting mapping policies for job [2965,1] nprocs 48 [devel11:80858] mca:rmaps[169] mapping not set by user - using bynuma [devel11:80858] mca:rmaps:ppr: job [2965,1] not using ppr mapper PPR NULL policy PPR NOTSET [devel11:80858] [[2965,0],0] rmaps:seq called on job [2965,1] [devel11:80858] mca:rmaps:seq: job [2965,1] not using seq mapper [devel11:80858] mca:rmaps:resilient: cannot perform initial map of job [2965,1] - no fault groups [devel11:80858] mca:rmaps:mindist: job [2965,1] not using mindist mapper [devel11:80858] mca:rmaps:rr: mapping job [2965,1] [devel11:80858] [[2965,0],0] Starting with 2 nodes in list [devel11:80858] [[2965,0],0] Filtering thru apps [devel11:80858] [[2965,0],0] Retained 2 nodes in list [devel11:80858] [[2965,0],0] node miriel025 has 24 slots available [devel11:80858] [[2965,0],0] node miriel026 has 24 slots available [devel11:80858] AVAILABLE NODES FOR MAPPING: [devel11:80858] node: miriel025 daemon: 1 [devel11:80858] node: miriel026 daemon: 2 [devel11:80858] [[2965,0],0] Starting bookmark at node miriel025 [devel11:80858] [[2965,0],0] Starting at node miriel025 [devel11:80858] mca:rmaps:rr: mapping no-span by NUMANode for job [2965,1] slots 48 num_procs 48 [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel025 [devel11:80858] mca:rmaps:rr: calculated nprocs 24 [devel11:80858] mca:rmaps:rr: assigning nprocs 24 [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel026 [devel11:80858] mca:rmaps:rr: calculated nprocs 24 [devel11:80858] mca:rmaps:rr: assigning nprocs 24 [devel11:80858] mca:rmaps:base: computing vpids by slot for job [2965,1] [devel11:80858] mca:rmaps:base: assigning rank 0 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 1 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 2 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 3 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 4 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 5 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 6 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 7 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 8 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 9 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 10 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 11 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 12 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 13 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 14 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 15 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 16 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 17 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 18 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 19 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 20 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 21 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 22 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 23 to node miriel025 [devel11:80858] mca:rmaps:base: assigning rank 24 to node miriel026 [devel11:80858] mca:rmaps:base: assigning rank 25 to node miriel026 [devel11:80858] mca:rmaps:base: assigning rank 26 to node miriel026 [devel11:80858] mca:rmaps:base: assigning rank 27 to node miriel026 [devel11:80858] mca:rmaps:base: assigning ra
Re: [OMPI devel] Problem with bind-to
My machine file is: miriel025*24 miriel026*24 Le 13/04/2017 à 16:46, Cyril Bordage a écrit : > There is the output: > ## > [devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL > [devel11:80858] mca:rmaps:select: checking available component mindist > [devel11:80858] mca:rmaps:select: Querying component [mindist] > [devel11:80858] mca:rmaps:select: checking available component ppr > [devel11:80858] mca:rmaps:select: Querying component [ppr] > [devel11:80858] mca:rmaps:select: checking available component rank_file > [devel11:80858] mca:rmaps:select: Querying component [rank_file] > [devel11:80858] mca:rmaps:select: checking available component resilient > [devel11:80858] mca:rmaps:select: Querying component [resilient] > [devel11:80858] mca:rmaps:select: checking available component round_robin > [devel11:80858] mca:rmaps:select: Querying component [round_robin] > [devel11:80858] mca:rmaps:select: checking available component seq > [devel11:80858] mca:rmaps:select: Querying component [seq] > [devel11:80858] [[2965,0],0]: Final mapper priorities > [devel11:80858] Mapper: ppr Priority: 90 > [devel11:80858] Mapper: seq Priority: 60 > [devel11:80858] Mapper: resilient Priority: 40 > [devel11:80858] Mapper: mindist Priority: 20 > [devel11:80858] Mapper: round_robin Priority: 10 > [devel11:80858] Mapper: rank_file Priority: 0 > [devel11:80858] mca:rmaps: mapping job [2965,1] > [devel11:80858] mca:rmaps: setting mapping policies for job [2965,1] > nprocs 48 > [devel11:80858] mca:rmaps[169] mapping not set by user - using bynuma > [devel11:80858] mca:rmaps:ppr: job [2965,1] not using ppr mapper PPR > NULL policy PPR NOTSET > [devel11:80858] [[2965,0],0] rmaps:seq called on job [2965,1] > [devel11:80858] mca:rmaps:seq: job [2965,1] not using seq mapper > [devel11:80858] mca:rmaps:resilient: cannot perform initial map of job > [2965,1] - no fault groups > [devel11:80858] mca:rmaps:mindist: job [2965,1] not using mindist mapper > [devel11:80858] mca:rmaps:rr: mapping job [2965,1] > [devel11:80858] [[2965,0],0] Starting with 2 nodes in list > [devel11:80858] [[2965,0],0] Filtering thru apps > [devel11:80858] [[2965,0],0] Retained 2 nodes in list > [devel11:80858] [[2965,0],0] node miriel025 has 24 slots available > [devel11:80858] [[2965,0],0] node miriel026 has 24 slots available > [devel11:80858] AVAILABLE NODES FOR MAPPING: > [devel11:80858] node: miriel025 daemon: 1 > [devel11:80858] node: miriel026 daemon: 2 > [devel11:80858] [[2965,0],0] Starting bookmark at node miriel025 > [devel11:80858] [[2965,0],0] Starting at node miriel025 > [devel11:80858] mca:rmaps:rr: mapping no-span by NUMANode for job > [2965,1] slots 48 num_procs 48 > [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel025 > [devel11:80858] mca:rmaps:rr: calculated nprocs 24 > [devel11:80858] mca:rmaps:rr: assigning nprocs 24 > [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel026 > [devel11:80858] mca:rmaps:rr: calculated nprocs 24 > [devel11:80858] mca:rmaps:rr: assigning nprocs 24 > [devel11:80858] mca:rmaps:base: computing vpids by slot for job [2965,1] > [devel11:80858] mca:rmaps:base: assigning rank 0 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 1 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 2 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 3 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 4 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 5 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 6 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 7 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 8 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 9 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 10 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 11 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 12 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 13 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 14 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 15 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 16 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 17 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 18 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 19 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 20 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 21 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 22 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 23 to node miriel025 > [devel11:80858] mca:rmaps:base: assigning rank 24 to node miriel02
Re: [OMPI devel] Problem with bind-to
Okay, so you login node was able to figure out all the bindings. I don’t see any debug output from your compute nodes, which is suspicious. Try adding --leave-session-attached to the cmd line and let’s see if we can capture the compute node daemon’s output > On Apr 13, 2017, at 7:48 AM, Cyril Bordage wrote: > > My machine file is: miriel025*24 miriel026*24 > > Le 13/04/2017 à 16:46, Cyril Bordage a écrit : >> There is the output: >> ## >> [devel11:80858] [[2965,0],0] rmaps:base set policy with NULL device NONNULL >> [devel11:80858] mca:rmaps:select: checking available component mindist >> [devel11:80858] mca:rmaps:select: Querying component [mindist] >> [devel11:80858] mca:rmaps:select: checking available component ppr >> [devel11:80858] mca:rmaps:select: Querying component [ppr] >> [devel11:80858] mca:rmaps:select: checking available component rank_file >> [devel11:80858] mca:rmaps:select: Querying component [rank_file] >> [devel11:80858] mca:rmaps:select: checking available component resilient >> [devel11:80858] mca:rmaps:select: Querying component [resilient] >> [devel11:80858] mca:rmaps:select: checking available component round_robin >> [devel11:80858] mca:rmaps:select: Querying component [round_robin] >> [devel11:80858] mca:rmaps:select: checking available component seq >> [devel11:80858] mca:rmaps:select: Querying component [seq] >> [devel11:80858] [[2965,0],0]: Final mapper priorities >> [devel11:80858] Mapper: ppr Priority: 90 >> [devel11:80858] Mapper: seq Priority: 60 >> [devel11:80858] Mapper: resilient Priority: 40 >> [devel11:80858] Mapper: mindist Priority: 20 >> [devel11:80858] Mapper: round_robin Priority: 10 >> [devel11:80858] Mapper: rank_file Priority: 0 >> [devel11:80858] mca:rmaps: mapping job [2965,1] >> [devel11:80858] mca:rmaps: setting mapping policies for job [2965,1] >> nprocs 48 >> [devel11:80858] mca:rmaps[169] mapping not set by user - using bynuma >> [devel11:80858] mca:rmaps:ppr: job [2965,1] not using ppr mapper PPR >> NULL policy PPR NOTSET >> [devel11:80858] [[2965,0],0] rmaps:seq called on job [2965,1] >> [devel11:80858] mca:rmaps:seq: job [2965,1] not using seq mapper >> [devel11:80858] mca:rmaps:resilient: cannot perform initial map of job >> [2965,1] - no fault groups >> [devel11:80858] mca:rmaps:mindist: job [2965,1] not using mindist mapper >> [devel11:80858] mca:rmaps:rr: mapping job [2965,1] >> [devel11:80858] [[2965,0],0] Starting with 2 nodes in list >> [devel11:80858] [[2965,0],0] Filtering thru apps >> [devel11:80858] [[2965,0],0] Retained 2 nodes in list >> [devel11:80858] [[2965,0],0] node miriel025 has 24 slots available >> [devel11:80858] [[2965,0],0] node miriel026 has 24 slots available >> [devel11:80858] AVAILABLE NODES FOR MAPPING: >> [devel11:80858] node: miriel025 daemon: 1 >> [devel11:80858] node: miriel026 daemon: 2 >> [devel11:80858] [[2965,0],0] Starting bookmark at node miriel025 >> [devel11:80858] [[2965,0],0] Starting at node miriel025 >> [devel11:80858] mca:rmaps:rr: mapping no-span by NUMANode for job >> [2965,1] slots 48 num_procs 48 >> [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel025 >> [devel11:80858] mca:rmaps:rr: calculated nprocs 24 >> [devel11:80858] mca:rmaps:rr: assigning nprocs 24 >> [devel11:80858] mca:rmaps:rr: found 4 NUMANode objects on node miriel026 >> [devel11:80858] mca:rmaps:rr: calculated nprocs 24 >> [devel11:80858] mca:rmaps:rr: assigning nprocs 24 >> [devel11:80858] mca:rmaps:base: computing vpids by slot for job [2965,1] >> [devel11:80858] mca:rmaps:base: assigning rank 0 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 1 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 2 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 3 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 4 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 5 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 6 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 7 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 8 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 9 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 10 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 11 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 12 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 13 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 14 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 15 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 16 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 17 to node miriel025 >> [devel11:80858] mca:rmaps:base: assigning rank 18 to node miriel025 >> [devel11:80858] mca:
Re: [OMPI devel] Problem with bind-to
devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL [devel11:17550] mca:rmaps:select: checking available component mindist [devel11:17550] mca:rmaps:select: Querying component [mindist] [devel11:17550] mca:rmaps:select: checking available component ppr [devel11:17550] mca:rmaps:select: Querying component [ppr] [devel11:17550] mca:rmaps:select: checking available component rank_file [devel11:17550] mca:rmaps:select: Querying component [rank_file] [devel11:17550] mca:rmaps:select: checking available component resilient [devel11:17550] mca:rmaps:select: Querying component [resilient] [devel11:17550] mca:rmaps:select: checking available component round_robin [devel11:17550] mca:rmaps:select: Querying component [round_robin] [devel11:17550] mca:rmaps:select: checking available component seq [devel11:17550] mca:rmaps:select: Querying component [seq] [devel11:17550] [[29888,0],0]: Final mapper priorities [devel11:17550] Mapper: ppr Priority: 90 [devel11:17550] Mapper: seq Priority: 60 [devel11:17550] Mapper: resilient Priority: 40 [devel11:17550] Mapper: mindist Priority: 20 [devel11:17550] Mapper: round_robin Priority: 10 [devel11:17550] Mapper: rank_file Priority: 0 [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device NONNULL [miriel025:62329] mca:rmaps:select: checking available component mindist [miriel025:62329] mca:rmaps:select: Querying component [mindist] [miriel025:62329] mca:rmaps:select: checking available component ppr [miriel025:62329] mca:rmaps:select: Querying component [ppr] [miriel025:62329] mca:rmaps:select: checking available component rank_file [miriel025:62329] mca:rmaps:select: Querying component [rank_file] [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device NONNULL [miriel026:165125] mca:rmaps:select: checking available component mindist [miriel026:165125] mca:rmaps:select: Querying component [mindist] [miriel026:165125] mca:rmaps:select: checking available component ppr [miriel026:165125] mca:rmaps:select: Querying component [ppr] [miriel026:165125] mca:rmaps:select: checking available component rank_file [miriel026:165125] mca:rmaps:select: Querying component [rank_file] [miriel026:165125] mca:rmaps:select: checking available component resilient [miriel026:165125] mca:rmaps:select: Querying component [resilient] [miriel026:165125] mca:rmaps:select: checking available component round_robin [miriel026:165125] mca:rmaps:select: Querying component [round_robin] [miriel026:165125] mca:rmaps:select: checking available component seq [miriel026:165125] mca:rmaps:select: Querying component [seq] [miriel026:165125] [[29888,0],2]: Final mapper priorities [miriel026:165125] Mapper: ppr Priority: 90 [[devel11:17550] mca:rmaps: mapping job [29888,1] [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1] nprocs 48 [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR NULL policy PPR NOTSET [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1] [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job [29888,1] - no fault groups [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper [devel11:17550] mca:rmaps:rr: mapping job [29888,1] [devel11:17550] [[29888,0],0] Starting with 2 nodes in list [devel11:17550] [[29888,0],0] Filtering thru apps [miriel025:62329] mca:rmaps:select: checking available component resilient [miriel025:62329] mca:rmaps:select: Querying component [resilient] [miriel025:62329] mca:rmaps:select: checking available component round_robin [miriel025:62329] mca:rmaps:select: Querying component [round_robin] [miriel025:62329] mca:rmaps:select: checking available component seq [miriel025:62329] mca:rmaps:select: Querying component [seq] [miriel025:62329] [[29888,0],1]: Final mapper priorities [miriel025:62329] Mapper: ppr Priority: 90 [miriel025:62329] Mapper: seq Priority: 60 [miriel025:62329] Mapper: resilient Priority: 40 [miriel025:62329] Mapper: mindist Priority: 20 [miriel025:62329] Mapper: round_robin Priority: 10 [miriel025:62329] Mapper: rank_file Priority: 0 [miriel025:62329] mca:rmaps: mapping job [29888,1] [miriel025:62329] mca:rmaps: setting mapping policies for job [29888,1] nprocs 48 [miriel025:62329] mca:rmaps[169] mapping not set by user - using bynuma [miriel025:62329] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR NULL policy PPR NOTSET [miriel025:62329] [[29888,0],1] rmaps:seq called on job [29888,1] [miriel025:62329] mca:rmaps:seq: job [29888,1] not using seq mapper [miriel025:62329] mca:rmaps:resilient: cannot perform initial map of job [29888,1] - no fault groups [miriel025:62329] mca:rmaps:mindist: job [29888,1] not using mindist mapper [miriel025:62329] mca:rmaps:rr: mapping job [29888,1] [miriel025:62329
Re: [OMPI devel] Problem with bind-to
Okay, so as far as OMPI is concerned, it correctly bound everyone! So how are you generating this output claiming it isn’t bound? > On Apr 13, 2017, at 7:57 AM, Cyril Bordage wrote: > > devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL > [devel11:17550] mca:rmaps:select: checking available component mindist > [devel11:17550] mca:rmaps:select: Querying component [mindist] > [devel11:17550] mca:rmaps:select: checking available component ppr > [devel11:17550] mca:rmaps:select: Querying component [ppr] > [devel11:17550] mca:rmaps:select: checking available component rank_file > [devel11:17550] mca:rmaps:select: Querying component [rank_file] > [devel11:17550] mca:rmaps:select: checking available component resilient > [devel11:17550] mca:rmaps:select: Querying component [resilient] > [devel11:17550] mca:rmaps:select: checking available component round_robin > [devel11:17550] mca:rmaps:select: Querying component [round_robin] > [devel11:17550] mca:rmaps:select: checking available component seq > [devel11:17550] mca:rmaps:select: Querying component [seq] > [devel11:17550] [[29888,0],0]: Final mapper priorities > [devel11:17550] Mapper: ppr Priority: 90 > [devel11:17550] Mapper: seq Priority: 60 > [devel11:17550] Mapper: resilient Priority: 40 > [devel11:17550] Mapper: mindist Priority: 20 > [devel11:17550] Mapper: round_robin Priority: 10 > [devel11:17550] Mapper: rank_file Priority: 0 > [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device > NONNULL > [miriel025:62329] mca:rmaps:select: checking available component mindist > [miriel025:62329] mca:rmaps:select: Querying component [mindist] > [miriel025:62329] mca:rmaps:select: checking available component ppr > [miriel025:62329] mca:rmaps:select: Querying component [ppr] > [miriel025:62329] mca:rmaps:select: checking available component rank_file > [miriel025:62329] mca:rmaps:select: Querying component [rank_file] > [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device > NONNULL > [miriel026:165125] mca:rmaps:select: checking available component mindist > [miriel026:165125] mca:rmaps:select: Querying component [mindist] > [miriel026:165125] mca:rmaps:select: checking available component ppr > [miriel026:165125] mca:rmaps:select: Querying component [ppr] > [miriel026:165125] mca:rmaps:select: checking available component rank_file > [miriel026:165125] mca:rmaps:select: Querying component [rank_file] > [miriel026:165125] mca:rmaps:select: checking available component resilient > [miriel026:165125] mca:rmaps:select: Querying component [resilient] > [miriel026:165125] mca:rmaps:select: checking available component > round_robin > [miriel026:165125] mca:rmaps:select: Querying component [round_robin] > [miriel026:165125] mca:rmaps:select: checking available component seq > [miriel026:165125] mca:rmaps:select: Querying component [seq] > [miriel026:165125] [[29888,0],2]: Final mapper priorities > [miriel026:165125] Mapper: ppr Priority: 90 > [[devel11:17550] mca:rmaps: mapping job [29888,1] > [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1] > nprocs 48 > [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma > [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR > NULL policy PPR NOTSET > [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1] > [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper > [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job > [29888,1] - no fault groups > [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper > [devel11:17550] mca:rmaps:rr: mapping job [29888,1] > [devel11:17550] [[29888,0],0] Starting with 2 nodes in list > [devel11:17550] [[29888,0],0] Filtering thru apps > [miriel025:62329] mca:rmaps:select: checking available component resilient > [miriel025:62329] mca:rmaps:select: Querying component [resilient] > [miriel025:62329] mca:rmaps:select: checking available component round_robin > [miriel025:62329] mca:rmaps:select: Querying component [round_robin] > [miriel025:62329] mca:rmaps:select: checking available component seq > [miriel025:62329] mca:rmaps:select: Querying component [seq] > [miriel025:62329] [[29888,0],1]: Final mapper priorities > [miriel025:62329] Mapper: ppr Priority: 90 > [miriel025:62329] Mapper: seq Priority: 60 > [miriel025:62329] Mapper: resilient Priority: 40 > [miriel025:62329] Mapper: mindist Priority: 20 > [miriel025:62329] Mapper: round_robin Priority: 10 > [miriel025:62329] Mapper: rank_file Priority: 0 > [miriel025:62329] mca:rmaps: mapping job [29888,1] > [miriel025:62329] mca:rmaps: setting mapping policies for job [29888,1] > nprocs 48 > [miriel025:62329] mca:rmaps[169] mapping not set by user - using bynuma > [miriel025:62329] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR > NULL policy PPR NOTSET > [miriel025:62329] [[29888,0],1] rma
Re: [OMPI devel] Problem with bind-to
'-report-bindings' does that. I used this option because the ranks did not seem to be binded (if I use a rank file the performace is far better). Le 13/04/2017 à 17:24, r...@open-mpi.org a écrit : > Okay, so as far as OMPI is concerned, it correctly bound everyone! So how are > you generating this output claiming it isn’t bound? > >> On Apr 13, 2017, at 7:57 AM, Cyril Bordage wrote: >> >> devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL >> [devel11:17550] mca:rmaps:select: checking available component mindist >> [devel11:17550] mca:rmaps:select: Querying component [mindist] >> [devel11:17550] mca:rmaps:select: checking available component ppr >> [devel11:17550] mca:rmaps:select: Querying component [ppr] >> [devel11:17550] mca:rmaps:select: checking available component rank_file >> [devel11:17550] mca:rmaps:select: Querying component [rank_file] >> [devel11:17550] mca:rmaps:select: checking available component resilient >> [devel11:17550] mca:rmaps:select: Querying component [resilient] >> [devel11:17550] mca:rmaps:select: checking available component round_robin >> [devel11:17550] mca:rmaps:select: Querying component [round_robin] >> [devel11:17550] mca:rmaps:select: checking available component seq >> [devel11:17550] mca:rmaps:select: Querying component [seq] >> [devel11:17550] [[29888,0],0]: Final mapper priorities >> [devel11:17550] Mapper: ppr Priority: 90 >> [devel11:17550] Mapper: seq Priority: 60 >> [devel11:17550] Mapper: resilient Priority: 40 >> [devel11:17550] Mapper: mindist Priority: 20 >> [devel11:17550] Mapper: round_robin Priority: 10 >> [devel11:17550] Mapper: rank_file Priority: 0 >> [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device >> NONNULL >> [miriel025:62329] mca:rmaps:select: checking available component mindist >> [miriel025:62329] mca:rmaps:select: Querying component [mindist] >> [miriel025:62329] mca:rmaps:select: checking available component ppr >> [miriel025:62329] mca:rmaps:select: Querying component [ppr] >> [miriel025:62329] mca:rmaps:select: checking available component rank_file >> [miriel025:62329] mca:rmaps:select: Querying component [rank_file] >> [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device >> NONNULL >> [miriel026:165125] mca:rmaps:select: checking available component mindist >> [miriel026:165125] mca:rmaps:select: Querying component [mindist] >> [miriel026:165125] mca:rmaps:select: checking available component ppr >> [miriel026:165125] mca:rmaps:select: Querying component [ppr] >> [miriel026:165125] mca:rmaps:select: checking available component rank_file >> [miriel026:165125] mca:rmaps:select: Querying component [rank_file] >> [miriel026:165125] mca:rmaps:select: checking available component resilient >> [miriel026:165125] mca:rmaps:select: Querying component [resilient] >> [miriel026:165125] mca:rmaps:select: checking available component >> round_robin >> [miriel026:165125] mca:rmaps:select: Querying component [round_robin] >> [miriel026:165125] mca:rmaps:select: checking available component seq >> [miriel026:165125] mca:rmaps:select: Querying component [seq] >> [miriel026:165125] [[29888,0],2]: Final mapper priorities >> [miriel026:165125] Mapper: ppr Priority: 90 >> [[devel11:17550] mca:rmaps: mapping job [29888,1] >> [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1] >> nprocs 48 >> [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma >> [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR >> NULL policy PPR NOTSET >> [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1] >> [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper >> [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job >> [29888,1] - no fault groups >> [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper >> [devel11:17550] mca:rmaps:rr: mapping job [29888,1] >> [devel11:17550] [[29888,0],0] Starting with 2 nodes in list >> [devel11:17550] [[29888,0],0] Filtering thru apps >> [miriel025:62329] mca:rmaps:select: checking available component resilient >> [miriel025:62329] mca:rmaps:select: Querying component [resilient] >> [miriel025:62329] mca:rmaps:select: checking available component round_robin >> [miriel025:62329] mca:rmaps:select: Querying component [round_robin] >> [miriel025:62329] mca:rmaps:select: checking available component seq >> [miriel025:62329] mca:rmaps:select: Querying component [seq] >> [miriel025:62329] [[29888,0],1]: Final mapper priorities >> [miriel025:62329] Mapper: ppr Priority: 90 >> [miriel025:62329] Mapper: seq Priority: 60 >> [miriel025:62329] Mapper: resilient Priority: 40 >> [miriel025:62329] Mapper: mindist Priority: 20 >> [miriel025:62329] Mapper: round_robin Priority: 10 >> [miriel025:62329] Mapper: rank_file Priority: 0 >> [miriel025:62329] mca:rmaps: mapping job [29888,1] >> [miriel025:6232
Re: [OMPI devel] Problem with bind-to
All right, let’s replace rmaps_base_verbose with odls_base_verbose and see what that saids > On Apr 13, 2017, at 8:34 AM, Cyril Bordage wrote: > > '-report-bindings' does that. > I used this option because the ranks did not seem to be binded (if I use > a rank file the performace is far better). > > Le 13/04/2017 à 17:24, r...@open-mpi.org a écrit : >> Okay, so as far as OMPI is concerned, it correctly bound everyone! So how >> are you generating this output claiming it isn’t bound? >> >>> On Apr 13, 2017, at 7:57 AM, Cyril Bordage wrote: >>> >>> devel11:17550] [[29888,0],0] rmaps:base set policy with NULL device NONNULL >>> [devel11:17550] mca:rmaps:select: checking available component mindist >>> [devel11:17550] mca:rmaps:select: Querying component [mindist] >>> [devel11:17550] mca:rmaps:select: checking available component ppr >>> [devel11:17550] mca:rmaps:select: Querying component [ppr] >>> [devel11:17550] mca:rmaps:select: checking available component rank_file >>> [devel11:17550] mca:rmaps:select: Querying component [rank_file] >>> [devel11:17550] mca:rmaps:select: checking available component resilient >>> [devel11:17550] mca:rmaps:select: Querying component [resilient] >>> [devel11:17550] mca:rmaps:select: checking available component round_robin >>> [devel11:17550] mca:rmaps:select: Querying component [round_robin] >>> [devel11:17550] mca:rmaps:select: checking available component seq >>> [devel11:17550] mca:rmaps:select: Querying component [seq] >>> [devel11:17550] [[29888,0],0]: Final mapper priorities >>> [devel11:17550] Mapper: ppr Priority: 90 >>> [devel11:17550] Mapper: seq Priority: 60 >>> [devel11:17550] Mapper: resilient Priority: 40 >>> [devel11:17550] Mapper: mindist Priority: 20 >>> [devel11:17550] Mapper: round_robin Priority: 10 >>> [devel11:17550] Mapper: rank_file Priority: 0 >>> [miriel025:62329] [[29888,0],1] rmaps:base set policy with NULL device >>> NONNULL >>> [miriel025:62329] mca:rmaps:select: checking available component mindist >>> [miriel025:62329] mca:rmaps:select: Querying component [mindist] >>> [miriel025:62329] mca:rmaps:select: checking available component ppr >>> [miriel025:62329] mca:rmaps:select: Querying component [ppr] >>> [miriel025:62329] mca:rmaps:select: checking available component rank_file >>> [miriel025:62329] mca:rmaps:select: Querying component [rank_file] >>> [miriel026:165125] [[29888,0],2] rmaps:base set policy with NULL device >>> NONNULL >>> [miriel026:165125] mca:rmaps:select: checking available component mindist >>> [miriel026:165125] mca:rmaps:select: Querying component [mindist] >>> [miriel026:165125] mca:rmaps:select: checking available component ppr >>> [miriel026:165125] mca:rmaps:select: Querying component [ppr] >>> [miriel026:165125] mca:rmaps:select: checking available component rank_file >>> [miriel026:165125] mca:rmaps:select: Querying component [rank_file] >>> [miriel026:165125] mca:rmaps:select: checking available component resilient >>> [miriel026:165125] mca:rmaps:select: Querying component [resilient] >>> [miriel026:165125] mca:rmaps:select: checking available component >>> round_robin >>> [miriel026:165125] mca:rmaps:select: Querying component [round_robin] >>> [miriel026:165125] mca:rmaps:select: checking available component seq >>> [miriel026:165125] mca:rmaps:select: Querying component [seq] >>> [miriel026:165125] [[29888,0],2]: Final mapper priorities >>> [miriel026:165125] Mapper: ppr Priority: 90 >>> [[devel11:17550] mca:rmaps: mapping job [29888,1] >>> [devel11:17550] mca:rmaps: setting mapping policies for job [29888,1] >>> nprocs 48 >>> [devel11:17550] mca:rmaps[169] mapping not set by user - using bynuma >>> [devel11:17550] mca:rmaps:ppr: job [29888,1] not using ppr mapper PPR >>> NULL policy PPR NOTSET >>> [devel11:17550] [[29888,0],0] rmaps:seq called on job [29888,1] >>> [devel11:17550] mca:rmaps:seq: job [29888,1] not using seq mapper >>> [devel11:17550] mca:rmaps:resilient: cannot perform initial map of job >>> [29888,1] - no fault groups >>> [devel11:17550] mca:rmaps:mindist: job [29888,1] not using mindist mapper >>> [devel11:17550] mca:rmaps:rr: mapping job [29888,1] >>> [devel11:17550] [[29888,0],0] Starting with 2 nodes in list >>> [devel11:17550] [[29888,0],0] Filtering thru apps >>> [miriel025:62329] mca:rmaps:select: checking available component resilient >>> [miriel025:62329] mca:rmaps:select: Querying component [resilient] >>> [miriel025:62329] mca:rmaps:select: checking available component round_robin >>> [miriel025:62329] mca:rmaps:select: Querying component [round_robin] >>> [miriel025:62329] mca:rmaps:select: checking available component seq >>> [miriel025:62329] mca:rmaps:select: Querying component [seq] >>> [miriel025:62329] [[29888,0],1]: Final mapper priorities >>> [miriel025:62329] Mapper: ppr Priority: 90 >>> [miriel025:62329] Mapper: seq Priority: 60 >>> [miriel025:62329] Mapper: resilient Priority: 40 >>> [miriel
Re: [OMPI devel] Problem with bind-to
Ralph, i can simply reproduce the issue with two nodes and the latest master all commands are ran on n1, which has the same topology (2 sockets * 8 cores each) than n2 1) everything works $ mpirun -np 16 -bind-to core --report-bindings true [n1:29794] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [n1:29794] MCW rank 1 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.] [n1:29794] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.] [n1:29794] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.] [n1:29794] MCW rank 4 bound to socket 0[core 2[hwt 0]]: [././B/././././.][./././././././.] [n1:29794] MCW rank 5 bound to socket 1[core 10[hwt 0]]: [./././././././.][././B/././././.] [n1:29794] MCW rank 6 bound to socket 0[core 3[hwt 0]]: [./././B/./././.][./././././././.] [n1:29794] MCW rank 7 bound to socket 1[core 11[hwt 0]]: [./././././././.][./././B/./././.] [n1:29794] MCW rank 8 bound to socket 0[core 4[hwt 0]]: [././././B/././.][./././././././.] [n1:29794] MCW rank 9 bound to socket 1[core 12[hwt 0]]: [./././././././.][././././B/././.] [n1:29794] MCW rank 10 bound to socket 0[core 5[hwt 0]]: [./././././B/./.][./././././././.] [n1:29794] MCW rank 11 bound to socket 1[core 13[hwt 0]]: [./././././././.][./././././B/./.] [n1:29794] MCW rank 12 bound to socket 0[core 6[hwt 0]]: [././././././B/.][./././././././.] [n1:29794] MCW rank 13 bound to socket 1[core 14[hwt 0]]: [./././././././.][././././././B/.] [n1:29794] MCW rank 14 bound to socket 0[core 7[hwt 0]]: [./././././././B][./././././././.] [n1:29794] MCW rank 15 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] $ mpirun -np 16 -bind-to core --host n1:16 --report-bindings true [n1:29850] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [n1:29850] MCW rank 1 bound to socket 1[core 8[hwt 0]]: [./././././././.][B/././././././.] [n1:29850] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [./B/./././././.][./././././././.] [n1:29850] MCW rank 3 bound to socket 1[core 9[hwt 0]]: [./././././././.][./B/./././././.] [n1:29850] MCW rank 4 bound to socket 0[core 2[hwt 0]]: [././B/././././.][./././././././.] [n1:29850] MCW rank 5 bound to socket 1[core 10[hwt 0]]: [./././././././.][././B/././././.] [n1:29850] MCW rank 6 bound to socket 0[core 3[hwt 0]]: [./././B/./././.][./././././././.] [n1:29850] MCW rank 7 bound to socket 1[core 11[hwt 0]]: [./././././././.][./././B/./././.] [n1:29850] MCW rank 8 bound to socket 0[core 4[hwt 0]]: [././././B/././.][./././././././.] [n1:29850] MCW rank 9 bound to socket 1[core 12[hwt 0]]: [./././././././.][././././B/././.] [n1:29850] MCW rank 10 bound to socket 0[core 5[hwt 0]]: [./././././B/./.][./././././././.] [n1:29850] MCW rank 11 bound to socket 1[core 13[hwt 0]]: [./././././././.][./././././B/./.] [n1:29850] MCW rank 12 bound to socket 0[core 6[hwt 0]]: [././././././B/.][./././././././.] [n1:29850] MCW rank 13 bound to socket 1[core 14[hwt 0]]: [./././././././.][././././././B/.] [n1:29850] MCW rank 14 bound to socket 0[core 7[hwt 0]]: [./././././././B][./././././././.] [n1:29850] MCW rank 15 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] 2) with an other node $ mpirun -np 16 -bind-to core --host n2:16 --report-bindings true /* no output with a non MPI app !*/ $ mpirun -np 16 -bind-to core --host n2:16 --report-bindings ./hello_c [n2:52851] MCW rank 0 not bound [n2:52852] MCW rank 1 not bound [n2:52853] MCW rank 2 not bound [n2:52854] MCW rank 3 not bound [n2:52855] MCW rank 4 not bound [n2:52856] MCW rank 5 not bound [n2:52857] MCW rank 6 not bound [n2:52859] MCW rank 7 not bound [n2:52861] MCW rank 8 not bound [n2:52864] MCW rank 9 not bound [n2:52866] MCW rank 10 not bound [n2:52869] MCW rank 11 not bound [n2:52877] MCW rank 15 not bound [n2:52871] MCW rank 12 not bound [n2:52873] MCW rank 13 not bound [n2:52876] MCW rank 14 not bound Hello, world, I am 0 of 16, (Open MPI v4.0.0a1, package: Open MPI gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, Unreleased developer copy, 165) Hello, world, I am 1 of 16, (Open MPI v4.0.0a1, package: Open MPI gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, Unreleased developer copy, 165) Hello, world, I am 2 of 16, (Open MPI v4.0.0a1, package: Open MPI gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, Unreleased developer copy, 165) Hello, world, I am 3 of 16, (Open MPI v4.0.0a1, package: Open MPI gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, Unreleased developer copy, 165) Hello, world, I am 4 of 16, (Open MPI v4.0.0a1, package: Open MPI gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-g3202de8, Unreleased developer copy, 165) Hello, world, I am 5 of 16, (Open MPI v4.0.0a1, package: Open MPI gilles@n Distribution, ident: 4.0.0a1, repo rev: v2.x-dev-4028-