Yes, --mca btl tcp,self always used. We found the problem, we have restricted the interfaces with --mca btl_tcp_if_include eth0 and now we are at same performance (actually it seems that multiple orteds case is slightly faster). I think there is some mess with other interfaces, however I cannot figure out why the "standard" 1 orted configuration is faster with eth0 rather than lo,eth0.
oob/btl tcp seem working normal with multiple orted. We have only changed a small thing in btl sm to work with multiple orted, due to a problem with same shared memory directory (anyway sm is not used in this benchmark) Cheers, Federico On Mon, 25 Jan 2016, 18:02 Ralph Castain <r...@open-mpi.org> wrote: > I also assumed that was true. However, when communicating between two > procs, the TCP stack will use a shortcut in the loopback code if the two > procs are known to be on the same node. In the case of multiple orteds, it > isn't clear to me that the stack knows this situation as the orteds, at > least, must have unique IP addresses and think they are on separate nodes. > > On Mon, Jan 25, 2016 at 6:32 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Though I did not repeat it, I assumed --mca btl tcp,self is always used, >> as described in the initial email >> >> Cheers, >> >> Gilles >> >> >> On Monday, January 25, 2016, Ralph Castain <r...@open-mpi.org> wrote: >> >>> I believe the performance penalty will still always be greater than >>> zero, however, as the TCP stack is smart enough to take an optimized path >>> when doing a loopback as opposed to inter-node communication. >>> >>> >>> On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet < >>> gilles.gouaillar...@gmail.com> wrote: >>> >>>> Federico, >>>> >>>> I did not expect 0% degradation, since you are now comparing two >>>> different cases >>>> 1 orted means tasks are bound on sockets >>>> 16 orted means tasks are not bound. >>>> >>>> a quick way to improve things is to use a wrapper that binds MPI tasks >>>> mpirun --bind-to none wrapper.sh skampi >>>> >>>> wrapper.sh can use environment variable to retrieve the rank id >>>> (PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter >>>> utils >>>> >>>> mpirun --tag-output grep Cpus_allowed_list /proc/self/status >>>> with 1 orted should return the same output than >>>> mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list >>>> /proc/self/status >>>> with 16 orted >>>> >>>> when wrapper.sh works fine, skampi degradation should be smaller with >>>> 16 orted >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Monday, January 25, 2016, Federico Reghenzani < >>>> federico1.reghenz...@mail.polimi.it> wrote: >>>> >>>>> Thank you Gilles, you're right, with --bind-to none we have ~ 15% of >>>>> degradation rather than 50%. >>>>> >>>>> It's much better now, but I think it should be (in theory) around 0%. >>>>> The benchmark is MPI bound (the standard benchmark provided with >>>>> SkaMPI), it tests these functions: MPI_Bcast, MPI_Barrier, >>>>> MPI_Reduce, MPI_Allreduce, MPI_Alltoall, MPI_Gather, MPI_Scatter, >>>>> MPI_Scan, >>>>> MPI_Send/Recv >>>>> >>>>> Cheers, >>>>> Federico >>>>> __ >>>>> Federico Reghenzani >>>>> M.Eng. Student @ Politecnico di Milano >>>>> Computer Science and Engineering >>>>> >>>>> >>>>> >>>>> 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet < >>>>> gilles.gouaillar...@gmail.com>: >>>>> >>>>>> Federico, >>>>>> >>>>>> unless you already took care of that, I would guess all 16 orted >>>>>> bound their children MPI tasks on socket 0 >>>>>> >>>>>> can you try >>>>>> mpirun --bind-to none ... >>>>>> >>>>>> btw, is your benchmark application cpu bound ? memory bound ? MPI >>>>>> bound ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> On Monday, January 25, 2016, Federico Reghenzani < >>>>>> federico1.reghenz...@mail.polimi.it> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> we have executed a benchmark (SkaMPI) on the same machine (32 core >>>>>>> Intel Xeon 86_64) with these two configurations: >>>>>>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl >>>>>>> self,tcp) >>>>>>> - 16 orted with each 1 process (that uses TCP) >>>>>>> >>>>>>> We use a custom RAS to allow multiple orted on the same machine (I >>>>>>> know that it seems non-sense to have multiple orteds on the same machine >>>>>>> for the same application, but we are doing some experiments for >>>>>>> migration). >>>>>>> >>>>>>> Initially we have expected approximately the same performance in >>>>>>> both cases (we have 16 processes communicating via TCP in both cases), >>>>>>> but >>>>>>> we have a degradation of 50%, and we are sure that is not an overhead >>>>>>> due >>>>>>> to orteds initialization. >>>>>>> >>>>>>> Do you have any idea how can multiple orteds influence the >>>>>>> processess performance? >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Federico >>>>>>> __ >>>>>>> Federico Reghenzani >>>>>>> M.Eng. Student @ Politecnico di Milano >>>>>>> Computer Science and Engineering >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php >>>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/01/18501.php >>>> >>> >>> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/01/18504.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/01/18506.php