George,

i digged into the code, and it is now not working as expected :-(


here is a snippet of my test code :


   putenv("OMPI_MCA_coll_tuned_bcast_algorithm=1");

   putenv("OMPI_MCA_coll_tuned_bcast_algorithm=1");

   MPI_Init(&argc,&argv);

   MPI_Bcast(out,MAXLEN,MPI_INT,root,MPI_COMM_WORLD);

   putenv("OMPI_MCA_coll_tuned_bcast_algorithm=2");
   MPI_Comm_dup(MPI_COMM_WORLD, &comm);

   MPI_Bcast(out,MAXLEN,MPI_INT,root,comm);


This is working just fine with 1.6 (e.g. the first MPI_Bcast uses algo 1, the second MPI_Bcast uses algo 2),
but starting from v1.8, only algo 1 is used.


my guess is that starting from v1.8, MCA params are "cached", which means the OMPI_MCA_coll_tuned_bcast_algorithm environment variable is no more re-evaluated when MPI_Comm_dup() indirectly invokes ompi_coll_tuned_forced_getvalues().

Nathan, can you please confirm this is the intended behavior ?


If yes, how should we move forward ?
- force ompi_coll_tuned_forced_getvalues() to re-evaluate MCA params (assuming there is an existing mechanism for that) ?
- use the MPI_T interface ?

Cheers,

Gilles


On 6/12/2016 7:55 PM, George Bosilca wrote:
This is my understanding of the MPI standard. Obviously some combination of op and datatype are practically associative and commutative, in which case the pattern you describe would be legal. Technically, we could add an MCA parameter to allow the users to specify that the op should be considered as associative (especially for operations on floating point numbers), in which case any we are free to choose any communication pattern.

  George.

PS: for the dynamic selection use ompi_coll_tuned_forced_getvalues as a starting point.

On Fri, Jun 10, 2016 at 10:23 AM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Thanks George, i will try to find it.


    for the second part, and if i read between the lines, that means a
    collective operation cannot have non deterministic paths, such as

    for (...) MPI_Irecv();

    for (...) { MPI_Waitany(); ompi_op_reduce(); }

    is that *really* prohibited ? i thought it was "only" *strongly
    discouraged* ...


    Cheers,


    Gilles


    On 6/10/2016 5:10 PM, George Bosilca wrote:
    There is a mechanism to select the collective algorithm upon
    communicator creation. It is not using MPI_T (as this
    mechanism didn't exist at the tuned conception), but it behave in
    a similar manner. You simply update an MCA param (I do not
    remember the name and I'm not close to my computer), and the next
    communicator creation will automatically adapt its behavior.

    That being said it would be illegal in MQPI lingo to change the
    collective algorithm on an existing comomunicator, especially for
    reduction operations. It is clearly specified that if you execute
    multiple times a collective between the same processes with the
    same values and in the context of the same run you should get the
    exact same result.

    George.

    On Friday, June 10, 2016, Gilles Gouaillardet <gil...@rist.or.jp
    <mailto:gil...@rist.or.jp>> wrote:

        Folks,


        i was thinking of using the MPI_T interface in order to try
        within the same MPI test program *all* the available algo of
        a given collective.


        That cannot currently be done because the mca parameter is
        registered with

        {flag=0, scope=MCA_BASE_VAR_SCOPE_READONLY}


        i made a proof of concept by changing this to

        {flag=MCA_BASE_VAR_FLAG_SETTABLE, scope=MCA_VAR_SCOPE_ALL}

        (see the inline patch below)


        strictly speaking, it does not work since the updated values
        are used next time a communicator is created.

        for example, changing a value on MPI_COMM_WORLD has no effect,

        but changing a value, MPI_Comm_dup(MPI_COMM_WORLD) and using
        the dup'ed communicator works.

        btw, i guess any communicator could be used to set the value.


        as far as i am concerned, that is good enough for me


        any objections to make some coll/tuned parameters writable by
        MPI_T ?

        if no, did i implement it correctly ?


        Cheers,


        Gilles


        here is the function that sets a value :

        int setValue_int_comm(int index, MPI_Comm comm, int *val) {
          int err,count;
          MPI_T_cvar_handle handle;
          /* This example assumes that the variable index */
          /* can be bound to a communicator */
        err=MPI_T_cvar_handle_alloc(index,&comm,&handle,&count);
          if (err!=MPI_SUCCESS) return err;
          /* The following assumes that the variable is */
          /* represented by a single integer */
          err=MPI_T_cvar_write(handle,val);
          if (err!=MPI_SUCCESS) return err;
          err=MPI_T_cvar_handle_free(&handle);
          return err;
        }

        and here is the proof of concept

        diff --git a/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
        b/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
        index 81345b2..31ca217 100644
        --- a/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
        +++ b/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
        @@ -76,9 +76,9 @@ int
        ompi_coll_tuned_bcast_intra_check_forced_init
        (coll_tuned_force_algorithm_mc
        
mca_base_component_var_register(&mca_coll_tuned_component.super.collm_version,
         "bcast_algorithm",
         "Which bcast algorithm is used. Can be locked down to choice
        of: 0 ignore, 1 basic linear, 2 chain, 3: pipeline, 4: split
        binary tree, 5: binary tree, 6: binomial tree.",
        - MCA_BASE_VAR_TYPE_INT, new_enum, 0, 0,
        + MCA_BASE_VAR_TYPE_INT, new_enum, 0, MCA_BASE_VAR_FLAG_SETTABLE,
         OPAL_INFO_LVL_5,
        - MCA_BASE_VAR_SCOPE_READONLY,
        + MCA_BASE_VAR_SCOPE_ALL,
        &coll_tuned_bcast_forced_algorithm);
             OBJ_RELEASE(new_enum);
             if (mca_param_indices->algorithm_param_index < 0) {
        diff --git a/ompi/mca/coll/tuned/coll_tuned_component.c
        b/ompi/mca/coll/tuned/coll_tuned_component.c
        index 9756359..ea389fd 100644
        --- a/ompi/mca/coll/tuned/coll_tuned_component.c
        +++ b/ompi/mca/coll/tuned/coll_tuned_component.c
        @@ -164,9 +164,9 @@ static int tuned_register(void)
             (void)
        
mca_base_component_var_register(&mca_coll_tuned_component.super.collm_version,
        "use_dynamic_rules",
        "Switch used to decide if we use static (compiled/if
        statements) or dynamic (built at runtime) decision function
        rules",
        - MCA_BASE_VAR_TYPE_BOOL, NULL, 0, 0,
        + MCA_BASE_VAR_TYPE_BOOL, NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE,
        OPAL_INFO_LVL_6,
        - MCA_BASE_VAR_SCOPE_READONLY,
        + MCA_BASE_VAR_SCOPE_ALL,
        &ompi_coll_tuned_use_dynamic_rules);

             ompi_coll_tuned_dynamic_rules_filename = NULL;

        _______________________________________________
        devel mailing list
        de...@open-mpi.org
        Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2016/06/19094.php



    _______________________________________________ devel mailing
    list de...@open-mpi.org <mailto:de...@open-mpi.org> Subscription:
    https://www.open-mpi.org/mailman/listinfo.cgi/devel

    Link to this 
post:http://www.open-mpi.org/community/lists/devel/2016/06/19095.php


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2016/06/19096.php




_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this 
post:http://www.open-mpi.org/community/lists/devel/2016/06/19098.php

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to