[OMPI devel] openib btl and 10 GbE port

2016-06-12 Thread Gilles Gouaillardet

Folks,


this is a follow up of a user report available at 
http://www.open-mpi.org/community/lists/users/2016/06/29423.php



Basically, one node has a dual port ConnectX3 card, with one IB port and 
one 10 GbE port.


When diagnosing some RDMA errors (not the point of this email), the user 
was surprised to find that both IB and 10 GbE port were used. Both ports 
appear to be managed by the openib btl, so from an Open MPI point of 
view, i can only guess they have the same btl exclusivity (1024 by default).


Is this an intended behavior ?

or should the exclusivity of the 10 GbE port be lower and the one of the 
IB port *by default* ?


/* so only the IB port is used by default */


Cheers,


Gilles



Re: [OMPI devel] MPI_T and coll/tuned module

2016-06-12 Thread George Bosilca
This is my understanding of the MPI standard. Obviously some combination of
op and datatype are practically associative and commutative, in which case
the pattern you describe would be legal. Technically, we could add an MCA
parameter to allow the users to specify that the op should be considered as
associative (especially for operations on floating point numbers), in which
case any we are free to choose any communication pattern.

  George.

PS: for the dynamic selection use ompi_coll_tuned_forced_getvalues as a
starting point.

On Fri, Jun 10, 2016 at 10:23 AM, Gilles Gouaillardet 
wrote:

> Thanks George, i will try to find it.
>
>
> for the second part, and if i read between the lines, that means a
> collective operation cannot have non deterministic paths, such as
>
> for (...) MPI_Irecv();
>
> for (...) { MPI_Waitany(); ompi_op_reduce(); }
>
> is that *really* prohibited ? i thought it was "only" *strongly
> discouraged* ...
>
>
> Cheers,
>
>
> Gilles
>
> On 6/10/2016 5:10 PM, George Bosilca wrote:
>
> There is a mechanism to select the collective algorithm upon communicator
> creation. It is not using MPI_T (as this mechanism didn't exist at the
> tuned conception), but it behave in a similar manner. You simply update an
> MCA param (I do not remember the name and I'm not close to my computer),
> and the next communicator creation will automatically adapt its behavior.
>
> That being said it would be illegal in MQPI lingo to change the collective
> algorithm on an existing comomunicator, especially for reduction
> operations. It is clearly specified that if you execute multiple times
> a collective between the same processes with the same values and in the
> context of the same run you should get the exact same result.
>
> George.
>
> On Friday, June 10, 2016, Gilles Gouaillardet < 
> gil...@rist.or.jp> wrote:
>
>> Folks,
>>
>>
>> i was thinking of using the MPI_T interface in order to try within the
>> same MPI test program *all* the available algo of a given collective.
>>
>>
>> That cannot currently be done because the mca parameter is registered with
>>
>> {flag=0, scope=MCA_BASE_VAR_SCOPE_READONLY}
>>
>>
>> i made a proof of concept by changing this to
>>
>> {flag=MCA_BASE_VAR_FLAG_SETTABLE, scope=MCA_VAR_SCOPE_ALL}
>>
>> (see the inline patch below)
>>
>>
>> strictly speaking, it does not work since the updated values are used
>> next time a communicator is created.
>>
>> for example, changing a value on MPI_COMM_WORLD has no effect,
>>
>> but changing a value, MPI_Comm_dup(MPI_COMM_WORLD) and using the dup'ed
>> communicator works.
>>
>> btw, i guess any communicator could be used to set the value.
>>
>>
>> as far as i am concerned, that is good enough for me
>>
>>
>> any objections to make some coll/tuned parameters writable by MPI_T ?
>>
>> if no, did i implement it correctly ?
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> here is the function that sets a value :
>>
>> int setValue_int_comm(int index, MPI_Comm comm, int *val) {
>>   int err,count;
>>   MPI_T_cvar_handle handle;
>>   /* This example assumes that the variable index */
>>   /* can be bound to a communicator */
>> err=MPI_T_cvar_handle_alloc(index,,,);
>>   if (err!=MPI_SUCCESS) return err;
>>   /* The following assumes that the variable is */
>>   /* represented by a single integer */
>>   err=MPI_T_cvar_write(handle,val);
>>   if (err!=MPI_SUCCESS) return err;
>>   err=MPI_T_cvar_handle_free();
>>   return err;
>> }
>>
>> and here is the proof of concept
>>
>> diff --git a/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
>> b/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
>> index 81345b2..31ca217 100644
>> --- a/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
>> +++ b/ompi/mca/coll/tuned/coll_tuned_bcast_decision.c
>> @@ -76,9 +76,9 @@ int ompi_coll_tuned_bcast_intra_check_forced_init
>> (coll_tuned_force_algorithm_mc
>>
>> mca_base_component_var_register(_coll_tuned_component.super.collm_version,
>>  "bcast_algorithm",
>>  "Which bcast algorithm is used.
>> Can be locked down to choice of: 0 ignore, 1 basic linear, 2 chain, 3:
>> pipeline, 4: split binary tree, 5: binary tree, 6: binomial tree.",
>> -MCA_BASE_VAR_TYPE_INT, new_enum,
>> 0, 0,
>> +MCA_BASE_VAR_TYPE_INT, new_enum,
>> 0, MCA_BASE_VAR_FLAG_SETTABLE,
>>  OPAL_INFO_LVL_5,
>> - MCA_BASE_VAR_SCOPE_READONLY,
>> +MCA_BASE_VAR_SCOPE_ALL,
>> _tuned_bcast_forced_algorithm);
>>  OBJ_RELEASE(new_enum);
>>  if (mca_param_indices->algorithm_param_index < 0) {
>> diff --git a/ompi/mca/coll/tuned/coll_tuned_component.c
>> b/ompi/mca/coll/tuned/coll_tuned_component.c
>> index 9756359..ea389fd 100644
>> --- a/ompi/mca/coll/tuned/coll_tuned_component.c
>> +++