Yes, --mca btl tcp,self always used. We found the problem, we have
restricted the interfaces with --mca btl_tcp_if_include eth0 and now we are
at same performance (actually it seems that multiple orteds case is
slightly faster). I think there is some mess with other interfaces, however
I cannot figure out why the "standard" 1 orted configuration is faster with
eth0 rather than lo,eth0.

oob/btl tcp seem working normal with multiple orted. We have only changed a
small thing in btl sm to work with multiple orted, due to a problem with
same shared memory directory (anyway sm is not used in this benchmark)

Cheers,
Federico

On Mon, 25 Jan 2016, 18:02 Ralph Castain <r...@open-mpi.org> wrote:

> I also assumed that was true. However, when communicating between two
> procs, the TCP stack will use a shortcut in the loopback code if the two
> procs are known to be on the same node. In the case of multiple orteds, it
> isn't clear to me that the stack knows this situation as the orteds, at
> least, must have unique IP addresses and think they are on separate nodes.
>
> On Mon, Jan 25, 2016 at 6:32 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Though I did not repeat it, I assumed --mca btl tcp,self is always used,
>> as described in the initial email
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Monday, January 25, 2016, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> I believe the performance penalty will still always be greater than
>>> zero, however, as the TCP stack is smart enough to take an optimized path
>>> when doing a loopback as opposed to inter-node communication.
>>>
>>>
>>> On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> Federico,
>>>>
>>>> I did not expect 0% degradation, since you are now comparing two
>>>> different cases
>>>> 1 orted means tasks are bound on sockets
>>>> 16 orted means tasks are not bound.
>>>>
>>>> a quick way to improve things is to use a wrapper that binds MPI tasks
>>>> mpirun --bind-to none wrapper.sh skampi
>>>>
>>>> wrapper.sh can use environment variable to retrieve the rank id
>>>> (PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter
>>>> utils
>>>>
>>>> mpirun --tag-output grep Cpus_allowed_list /proc/self/status
>>>> with 1 orted should return the same output than
>>>> mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list
>>>> /proc/self/status
>>>> with 16 orted
>>>>
>>>> when wrapper.sh works fine, skampi degradation should be smaller with
>>>> 16 orted
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>> On Monday, January 25, 2016, Federico Reghenzani <
>>>> federico1.reghenz...@mail.polimi.it> wrote:
>>>>
>>>>> Thank you Gilles, you're right, with --bind-to none we have ~ 15% of
>>>>> degradation rather than 50%.
>>>>>
>>>>> It's much better now, but I think it should be (in theory) around 0%.
>>>>> The benchmark is MPI bound (the standard benchmark provided with
>>>>> SkaMPI), it tests these functions: MPI_Bcast, MPI_Barrier,
>>>>> MPI_Reduce, MPI_Allreduce, MPI_Alltoall, MPI_Gather, MPI_Scatter, 
>>>>> MPI_Scan,
>>>>> MPI_Send/Recv
>>>>>
>>>>> Cheers,
>>>>> Federico
>>>>> __
>>>>> Federico Reghenzani
>>>>> M.Eng. Student @ Politecnico di Milano
>>>>> Computer Science and Engineering
>>>>>
>>>>>
>>>>>
>>>>> 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet <
>>>>> gilles.gouaillar...@gmail.com>:
>>>>>
>>>>>> Federico,
>>>>>>
>>>>>> unless you already took care of that, I would guess all 16 orted
>>>>>> bound their children MPI tasks on socket 0
>>>>>>
>>>>>> can you try
>>>>>> mpirun --bind-to none ...
>>>>>>
>>>>>> btw, is your benchmark application cpu bound ? memory bound ? MPI
>>>>>> bound ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>> On Monday, January 25, 2016, Federico Reghenzani <
>>>>>> federico1.reghenz...@mail.polimi.it> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> we have executed a benchmark (SkaMPI) on the same machine (32 core
>>>>>>> Intel Xeon 86_64) with these two configurations:
>>>>>>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl
>>>>>>> self,tcp)
>>>>>>> - 16 orted with each 1 process (that uses TCP)
>>>>>>>
>>>>>>> We use a custom RAS to allow multiple orted on the same machine (I
>>>>>>> know that it seems non-sense to have multiple orteds on the same machine
>>>>>>> for the same application, but we are doing some experiments for 
>>>>>>> migration).
>>>>>>>
>>>>>>> Initially we have expected approximately the same performance in
>>>>>>> both cases (we have 16 processes communicating via TCP in both cases), 
>>>>>>> but
>>>>>>> we have a degradation of 50%, and we are sure that is not an overhead 
>>>>>>> due
>>>>>>> to orteds initialization.
>>>>>>>
>>>>>>> Do you have any idea how can multiple orteds influence the
>>>>>>> processess performance?
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Federico
>>>>>>> __
>>>>>>> Federico Reghenzani
>>>>>>> M.Eng. Student @ Politecnico di Milano
>>>>>>> Computer Science and Engineering
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php
>>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2016/01/18501.php
>>>>
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/01/18504.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/01/18506.php

Reply via email to