You are welcome to raise the question of default mapping behavior on master yet 
again, but please do so on a separate thread so we can make sense of it.

Note that I will not be making more modifications of that behavior, so if 
someone feels strongly that they want it to change, please go ahead and do so. 
I’m tired of chasing this tiger’s tail.


> On May 16, 2016, at 5:59 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Thanks Nathan,
> 
> 
> sorry for the confusion, what i observed was a consequence of something else 
> ...
> 
> mpirun --host n0,n1 -np 4 a.out
> 
> /* n0 and n1 have 16 cores each */
> runs 4 instances of a.out on n0 (and nothing on n1)
> 
> if i run with -np 32, then 16 tasks run on each node.
> 
> 
> with v2.x, the --oversubscribe option is needed and 2 tasks run on each node
> 
> 
> is this really the intended behavior on master ?
> i mean, i am fine with detecting the number of slots automatically so 
> --oversubscribe is not needed up to 32 tasks. my point is, shouldn't we have 
> a different mapping policy in this case ? for example, allocate the tasks 
> round robin, or allocate <total number of slots> / <number of slots per node> 
> consecutive tasks per node ?
> 
> Cheers,
> 
> Gilles
> 
> On 5/17/2016 1:13 AM, Nathan Hjelm wrote:
>> add_procs is always called at least once. This is how we set up shared
>> memory communication. It will then be invoked on-demand for non-local
>> peers with the reachability argument set to NULL (because the bitmask
>> doesn't provide any benefit when adding only 1 peer).
>> 
>> -Nathan
>> 
>> On Tue, May 17, 2016 at 12:00:38AM +0900, Gilles Gouaillardet wrote:
>>>    Jeff,
>>>    this is not what I observed
>>>    (tcp btl, 2 to 4 nodes with one task per node, cutoff=0)
>>>    the add_procs of the tcp btl is invoked once with the 4 tasks.
>>>    I checked the sources and found cutoff only controls if the modex is
>>>    invoked once for all at init, or on demand.
>>>    Cheers,
>>>    Gilles
>>> 
>>>    On Monday, May 16, 2016, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>> <mailto:jsquy...@cisco.com>
>>>    wrote:
>>> 
>>>      We changed the way BTL add_procs is invoked on master and v2.x for
>>>      scalability reasons.
>>> 
>>>      In short: add_procs is only invoked the first time you talk to a given
>>>      peer.  The cutoff switch is an override to that -- if the sizeof
>>>      COMM_WORLD is less than the cutoff, we revert to the old behavior of
>>>      calling add_procs for all procs.
>>> 
>>>      As for why one BTL would be chosen over another, be sure to look at not
>>>      only the priority of the component/module, but also the exclusivity
>>>      level.  In short, only BTLs with the same exclusivity level will be
>>>      considered (e.g., this is how we exclude TCP when using HPC-class
>>>      networks), and then the BTL modules with the highest priority will be
>>>      used for a given peer.
>>> 
>>>      > On May 16, 2016, at 7:19 AM, Gilles Gouaillardet
>>>      <gilles.gouaillar...@gmail.com> <mailto:gilles.gouaillar...@gmail.com> 
>>> wrote:
>>>      >
>>>      > it seems I misunderstood some things ...
>>>      >
>>>      > add_procs is always invoked, regardless the cutoff value.
>>>      > cutoff is used to retrieve processes info via the modex "on demand" 
>>> vs
>>>      at init time.
>>>      >
>>>      > Please someone correct me and/or elaborate if needed
>>>      >
>>>      > Cheers,
>>>      >
>>>      > Gilles
>>>      >
>>>      > On Monday, May 16, 2016, Gilles Gouaillardet <gil...@rist.or.jp> 
>>> <mailto:gil...@rist.or.jp>
>>>      wrote:
>>>      > i cannot reproduce this behavior.
>>>      >
>>>      > note mca_btl_tcp_add_procs is invoked once per tcp component (e.g.
>>>      once per physical NIC)
>>>      >
>>>      > so you might want to explicitly select one nic
>>>      >
>>>      > mpirun --mca btl_tcp_if_include xxx ...
>>>      >
>>>      > my printf output are the same and regardless the mpi_add_procs_cutoff
>>>      value
>>>      >
>>>      >
>>>      > Cheers,
>>>      >
>>>      >
>>>      > Gilles
>>>      > On 5/16/2016 12:22 AM, dpchoudh . wrote:
>>>      >> Sorry, I accidentally pressed 'Send' before I was done writing the
>>>      last mail. What I wanted to ask was what is the parameter
>>>      mpi_add_procs_cutoff and why adding it seems to make a difference in 
>>> the
>>>      code path but not in the end result of the program? How would it help 
>>> me
>>>      debug my problem?
>>>      >>
>>>      >> Thank you
>>>      >> Durga
>>>      >>
>>>      >> The surgeon general advises you to eat right, exercise regularly and
>>>      quit ageing.
>>>      >>
>>>      >> On Sun, May 15, 2016 at 11:17 AM, dpchoudh . <dpcho...@gmail.com> 
>>> <mailto:dpcho...@gmail.com>
>>>      wrote:
>>>      >> Hello Gilles
>>>      >>
>>>      >> Setting -mca mpi_add_procs_cutoff 1024 indeed makes a difference to
>>>      the output, as follows:
>>>      >>
>>>      >> With -mca mpi_add_procs_cutoff 1024:
>>>      >> reachable =     0x1
>>>      >> (Note that add_procs was called once and the value of 'reachable is
>>>      correct')
>>>      >>
>>>      >> Without -mca mpi_add_procs_cutoff 1024
>>>      >> reachable =     0x0
>>>      >> reachable = NULL
>>>      >> reachable = NULL
>>>      >> (Note that add_procs() was caklled three times and the value of
>>>      'reachable' seems wrong.
>>>      >>
>>>      >> The program does run correctly in either case. The program listing 
>>> is
>>>      as below (note that I have removed output from the program itself in 
>>> the
>>>      above reporting.)
>>>      >>
>>>      >> The code that prints 'reachable' is as follows:
>>>      >>
>>>      >> if (reachable == NULL)
>>>      >>     printf("reachable = NULL\n");
>>>      >> else
>>>      >> {
>>>      >>     int i;
>>>      >>     printf("reachable = ");
>>>      >>     for (i = 0; i < reachable->array_size; i++)
>>>      >>     printf("\t0x%llu", reachable->bitmap[i]);
>>>      >>     printf("\n\n");
>>>      >> }
>>>      >> return OPAL_SUCCESS;
>>>      >>
>>>      >> And the code for the test program is as follows:
>>>      >>
>>>      >> #include <mpi.h>
>>>      >> #include <stdio.h>
>>>      >> #include <string.h>
>>>      >> #include <stdlib.h>
>>>      >>
>>>      >> int main(int argc, char *argv[])
>>>      >> {
>>>      >>     int world_size, world_rank, name_len;
>>>      >>     char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];
>>>      >>
>>>      >>     MPI_Init(&argc, &argv);
>>>      >>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>>>      >>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>>>      >>     MPI_Get_processor_name(hostname, &name_len);
>>>      >>     printf("Hello world from processor %s, rank %d out of %d
>>>      processors\n", hostname, world_rank, world_size);
>>>      >>     if (world_rank == 1)
>>>      >>     {
>>>      >>     MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
>>>      MPI_STATUS_IGNORE);
>>>      >>     printf("%s received %s, rank %d\n", hostname, buf, world_rank);
>>>      >>     }
>>>      >>     else
>>>      >>     {
>>>      >>     strcpy(buf, "haha!");
>>>      >>     MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
>>>      >>     printf("%s sent %s, rank %d\n", hostname, buf, world_rank);
>>>      >>     }
>>>      >>     MPI_Barrier(MPI_COMM_WORLD);
>>>      >>     MPI_Finalize();
>>>      >>     return 0;
>>>      >> }
>>>      >>
>>>      >>
>>>      >>
>>>      >> The surgeon general advises you to eat right, exercise regularly and
>>>      quit ageing.
>>>      >>
>>>      >> On Sun, May 15, 2016 at 10:49 AM, Gilles Gouaillardet
>>>      <gilles.gouaillar...@gmail.com> <mailto:gilles.gouaillar...@gmail.com> 
>>> wrote:
>>>      >> At first glance, that seems a bit odd...
>>>      >> are you sure you correctly print the reachable bitmap ?
>>>      >> I would suggest you add some instrumentation to understand what
>>>      happens
>>>      >> (e.g., printf before opal_bitmap_set_bit() and other places that
>>>      prevent this from happening)
>>>      >>
>>>      >> one more thing ...
>>>      >> now, master default behavior is
>>>      >> mpirun --mca mpi_add_procs_cutoff 0 ...
>>>      >> you might want to try
>>>      >> mpirun --mca mpi_add_procs_cutoff 1024 ...
>>>      >> and see if things make more sense.
>>>      >> if it helps, and iirc, there is a parameter so a btl can report it
>>>      does not support cutoff.
>>>      >>
>>>      >>
>>>      >> Cheers,
>>>      >>
>>>      >> Gilles
>>>      >>
>>>      >> On Sunday, May 15, 2016, dpchoudh . <dpcho...@gmail.com> 
>>> <mailto:dpcho...@gmail.com> wrote:
>>>      >> Hello Gilles
>>>      >>
>>>      >> Thanks for jumping in to help again. Actually, I had already tried
>>>      some of your suggestions before asking for help.
>>>      >>
>>>      >> I have several interconnects that can run both openib and tcp BTL. 
>>> To
>>>      simplify things, I explicitly mentioned TCP:
>>>      >>
>>>      >> mpirun -np 2 -hostfile ~/hostfile -mca pml ob1 -mca btl self.tcp
>>>      ./mpitest
>>>      >>
>>>      >> where mpitest is a small program that does MPI_Send()/MPI_Recv() on 
>>> a
>>>      small string, and then does an MPI_Barrier(). The program does work as
>>>      expected.
>>>      >>
>>>      >> I put a printf on the last line of mca_tcp_add_procs() to print the
>>>      value of 'reachable'. What I saw was that the value was always 0 when 
>>> it
>>>      was invoked for Send()/Recv() and the pointer itself was NULL when
>>>      invoked for Barrier()
>>>      >>
>>>      >> Next I looked at pml_ob1_add_procs(), where the call chain starts,
>>>      and found that it initializes and passes an opal_bitmap_t reachable 
>>> down
>>>      the call chain, but the resulting value is not used later in the code
>>>      (the memory is simply freed later).
>>>      >>
>>>      >> That, coupled with the fact that I am trying to imitate what the
>>>      other BTL implementations are doing, yet in
>>>      mca_bml_r2_endpoint_add_btl() by BTL is not being picked up, left me
>>>      puzzled. Please note that the interconnect that I am developing for is
>>>      on a different cluster (than where I ran the above test for TCP BTL.)
>>>      >>
>>>      >> Thanks again
>>>      >> Durga
>>>      >>
>>>      >> The surgeon general advises you to eat right, exercise regularly and
>>>      quit ageing.
>>>      >>
>>>      >> On Sun, May 15, 2016 at 10:20 AM, Gilles Gouaillardet
>>>      <gilles.gouaillar...@gmail.com> <mailto:gilles.gouaillar...@gmail.com> 
>>> wrote:
>>>      >> did you check the add_procs callbacks ?
>>>      >> (e.g. mca_btl_tcp_add_procs() for the tcp btl)
>>>      >> this is where the reachable bitmap is set, and I guess this is what
>>>      you are looking for.
>>>      >>
>>>      >> keep in mind that if several btl can be used, the one with the 
>>> higher
>>>      exclusivity is used
>>>      >> (e.g. tcp is never used if openib is available)
>>>      >> you can simply force your btl and self, and the ob1 pml, so you do
>>>      not have to worry about other btl exclusivity.
>>>      >>
>>>      >> Cheers,
>>>      >>
>>>      >> Gilles
>>>      >>
>>>      >>
>>>      >> On Sunday, May 15, 2016, dpchoudh . <dpcho...@gmail.com> 
>>> <mailto:dpcho...@gmail.com> wrote:
>>>      >> Hello all
>>>      >>
>>>      >> I have been struggling with this issue for a while and figured it
>>>      might be a good idea to ask for help.
>>>      >>
>>>      >> Where (in the code path) is the connectivity map created?
>>>      >>
>>>      >> I can see that it is *used* in mca_bml_r2_endpoint_add_btl(), but
>>>      obviously I am not setting it up right, because this routine is not
>>>      finding the BTL corresponding to my interconnect.
>>>      >>
>>>      >> Thanks in advance
>>>      >> Durga
>>>      >>
>>>      >> The surgeon general advises you to eat right, exercise regularly and
>>>      quit ageing.
>>>      >>
>>>      >> _______________________________________________
>>>      >> devel mailing list
>>>      >> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>      >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>      >> Link to this post:
>>>      http://www.open-mpi.org/community/lists/devel/2016/05/18975.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/18975.php>
>>>      >>
>>>      >>
>>>      >> _______________________________________________
>>>      >> devel mailing list
>>>      >> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>      >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>      >> Link to this post:
>>>      http://www.open-mpi.org/community/lists/devel/2016/05/18977.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/18977.php>
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >> _______________________________________________
>>>      >> devel mailing list
>>>      >>
>>>      >> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>      >>
>>>      >> Subscription:
>>>      >> https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>      >>
>>>      >> Link to this post:
>>>      >> http://www.open-mpi.org/community/lists/devel/2016/05/18979.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/18979.php>
>>>      >
>>>      > _______________________________________________
>>>      > devel mailing list
>>>      > de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>      > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>      > Link to this post:
>>>      http://www.open-mpi.org/community/lists/devel/2016/05/18981.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/18981.php>
>>> 
>>>      --
>>>      Jeff Squyres
>>>      jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>      For corporate legal information go to:
>>>      http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>> 
>>>      _______________________________________________
>>>      devel mailing list
>>>      de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>      Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>      Link to this post:
>>>      http://www.open-mpi.org/community/lists/devel/2016/05/18982.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/18982.php>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/05/18983.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/05/18983.php>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/05/18986.php 
>> <http://www.open-mpi.org/community/lists/devel/2016/05/18986.php>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/05/18988.php

Reply via email to