Federico, I did not expect 0% degradation, since you are now comparing two different cases 1 orted means tasks are bound on sockets 16 orted means tasks are not bound.
a quick way to improve things is to use a wrapper that binds MPI tasks mpirun --bind-to none wrapper.sh skampi wrapper.sh can use environment variable to retrieve the rank id (PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter utils mpirun --tag-output grep Cpus_allowed_list /proc/self/status with 1 orted should return the same output than mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list /proc/self/status with 16 orted when wrapper.sh works fine, skampi degradation should be smaller with 16 orted Cheers, Gilles On Monday, January 25, 2016, Federico Reghenzani < federico1.reghenz...@mail.polimi.it> wrote: > Thank you Gilles, you're right, with --bind-to none we have ~ 15% of > degradation rather than 50%. > > It's much better now, but I think it should be (in theory) around 0%. > The benchmark is MPI bound (the standard benchmark provided with SkaMPI), > it tests these functions: MPI_Bcast, MPI_Barrier, MPI_Reduce, MPI_Allreduce, > MPI_Alltoall, MPI_Gather, MPI_Scatter, MPI_Scan, MPI_Send/Recv > > Cheers, > Federico > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > > 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>>: > >> Federico, >> >> unless you already took care of that, I would guess all 16 orted >> bound their children MPI tasks on socket 0 >> >> can you try >> mpirun --bind-to none ... >> >> btw, is your benchmark application cpu bound ? memory bound ? MPI bound ? >> >> Cheers, >> >> Gilles >> >> >> On Monday, January 25, 2016, Federico Reghenzani < >> federico1.reghenz...@mail.polimi.it >> <javascript:_e(%7B%7D,'cvml','federico1.reghenz...@mail.polimi.it');>> >> wrote: >> >>> Hello, >>> >>> we have executed a benchmark (SkaMPI) on the same machine (32 core Intel >>> Xeon 86_64) with these two configurations: >>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl self,tcp) >>> - 16 orted with each 1 process (that uses TCP) >>> >>> We use a custom RAS to allow multiple orted on the same machine (I know >>> that it seems non-sense to have multiple orteds on the same machine for the >>> same application, but we are doing some experiments for migration). >>> >>> Initially we have expected approximately the same performance in both >>> cases (we have 16 processes communicating via TCP in both cases), but we >>> have a degradation of 50%, and we are sure that is not an overhead due to >>> orteds initialization. >>> >>> Do you have any idea how can multiple orteds influence the processess >>> performance? >>> >>> >>> Cheers, >>> Federico >>> __ >>> Federico Reghenzani >>> M.Eng. Student @ Politecnico di Milano >>> Computer Science and Engineering >>> >>> >>> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php >> > >