Federico,

I did not expect 0% degradation, since you are now comparing two different
cases
1 orted means tasks are bound on sockets
16 orted means tasks are not bound.

a quick way to improve things is to use a wrapper that binds MPI tasks
mpirun --bind-to none wrapper.sh skampi

wrapper.sh can use environment variable to retrieve the rank id
(PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter utils

mpirun --tag-output grep Cpus_allowed_list /proc/self/status
with 1 orted should return the same output than
mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list
/proc/self/status
with 16 orted

when wrapper.sh works fine, skampi degradation should be smaller with 16
orted

Cheers,

Gilles

On Monday, January 25, 2016, Federico Reghenzani <
federico1.reghenz...@mail.polimi.it> wrote:

> Thank you Gilles, you're right, with --bind-to none we have ~ 15% of
> degradation rather than 50%.
>
> It's much better now, but I think it should be (in theory) around 0%.
> The benchmark is MPI bound (the standard benchmark provided with SkaMPI),
> it tests these functions: MPI_Bcast, MPI_Barrier, MPI_Reduce, MPI_Allreduce,
> MPI_Alltoall, MPI_Gather, MPI_Scatter, MPI_Scan, MPI_Send/Recv
>
> Cheers,
> Federico
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
>
> 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>>:
>
>> Federico,
>>
>> unless you already took care of that, I would guess all 16 orted
>> bound their children MPI tasks on socket 0
>>
>> can you try
>> mpirun --bind-to none ...
>>
>> btw, is your benchmark application cpu bound ? memory bound ? MPI bound ?
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Monday, January 25, 2016, Federico Reghenzani <
>> federico1.reghenz...@mail.polimi.it
>> <javascript:_e(%7B%7D,'cvml','federico1.reghenz...@mail.polimi.it');>>
>> wrote:
>>
>>> Hello,
>>>
>>> we have executed a benchmark (SkaMPI) on the same machine (32 core Intel
>>> Xeon 86_64) with these two configurations:
>>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl self,tcp)
>>> - 16 orted with each 1 process (that uses TCP)
>>>
>>> We use a custom RAS to allow multiple orted on the same machine (I know
>>> that it seems non-sense to have multiple orteds on the same machine for the
>>> same application, but we are doing some experiments for migration).
>>>
>>> Initially we have expected approximately the same performance in both
>>> cases (we have 16 processes communicating via TCP in both cases), but we
>>> have a degradation of 50%, and we are sure that is not an overhead due to
>>> orteds initialization.
>>>
>>> Do you have any idea how can multiple orteds influence the processess
>>> performance?
>>>
>>>
>>> Cheers,
>>> Federico
>>> __
>>> Federico Reghenzani
>>> M.Eng. Student @ Politecnico di Milano
>>> Computer Science and Engineering
>>>
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php
>>
>
>

Reply via email to