I believe the performance penalty will still always be greater than zero, however, as the TCP stack is smart enough to take an optimized path when doing a loopback as opposed to inter-node communication.
On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Federico, > > I did not expect 0% degradation, since you are now comparing two different > cases > 1 orted means tasks are bound on sockets > 16 orted means tasks are not bound. > > a quick way to improve things is to use a wrapper that binds MPI tasks > mpirun --bind-to none wrapper.sh skampi > > wrapper.sh can use environment variable to retrieve the rank id > (PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter utils > > mpirun --tag-output grep Cpus_allowed_list /proc/self/status > with 1 orted should return the same output than > mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list > /proc/self/status > with 16 orted > > when wrapper.sh works fine, skampi degradation should be smaller with 16 > orted > > Cheers, > > Gilles > > On Monday, January 25, 2016, Federico Reghenzani < > federico1.reghenz...@mail.polimi.it> wrote: > >> Thank you Gilles, you're right, with --bind-to none we have ~ 15% of >> degradation rather than 50%. >> >> It's much better now, but I think it should be (in theory) around 0%. >> The benchmark is MPI bound (the standard benchmark provided with SkaMPI), >> it tests these functions: MPI_Bcast, MPI_Barrier, MPI_Reduce, MPI_Allreduce, >> MPI_Alltoall, MPI_Gather, MPI_Scatter, MPI_Scan, MPI_Send/Recv >> >> Cheers, >> Federico >> __ >> Federico Reghenzani >> M.Eng. Student @ Politecnico di Milano >> Computer Science and Engineering >> >> >> >> 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com>: >> >>> Federico, >>> >>> unless you already took care of that, I would guess all 16 orted >>> bound their children MPI tasks on socket 0 >>> >>> can you try >>> mpirun --bind-to none ... >>> >>> btw, is your benchmark application cpu bound ? memory bound ? MPI bound ? >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On Monday, January 25, 2016, Federico Reghenzani < >>> federico1.reghenz...@mail.polimi.it> wrote: >>> >>>> Hello, >>>> >>>> we have executed a benchmark (SkaMPI) on the same machine (32 core >>>> Intel Xeon 86_64) with these two configurations: >>>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl self,tcp) >>>> - 16 orted with each 1 process (that uses TCP) >>>> >>>> We use a custom RAS to allow multiple orted on the same machine (I know >>>> that it seems non-sense to have multiple orteds on the same machine for the >>>> same application, but we are doing some experiments for migration). >>>> >>>> Initially we have expected approximately the same performance in both >>>> cases (we have 16 processes communicating via TCP in both cases), but we >>>> have a degradation of 50%, and we are sure that is not an overhead due to >>>> orteds initialization. >>>> >>>> Do you have any idea how can multiple orteds influence the processess >>>> performance? >>>> >>>> >>>> Cheers, >>>> Federico >>>> __ >>>> Federico Reghenzani >>>> M.Eng. Student @ Politecnico di Milano >>>> Computer Science and Engineering >>>> >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php >>> >> >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/01/18501.php >