Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Ralph Castain
I also assumed that was true. However, when communicating between two procs, the TCP stack will use a shortcut in the loopback code if the two procs are known to be on the same node. In the case of multiple orteds, it isn't clear to me that the stack knows this situation as the orteds, at least,

Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Gilles Gouaillardet
Though I did not repeat it, I assumed --mca btl tcp,self is always used, as described in the initial email Cheers, Gilles On Monday, January 25, 2016, Ralph Castain wrote: > I believe the performance penalty will still always be greater than zero, > however, as the TCP

Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Federico Reghenzani
Ok, thank you Ralph and Gilles, I will continue testing and I'll update you if there is any news. Cheers, Federico 2016-01-25 14:23 GMT+01:00 Ralph Castain : > I believe the performance penalty will still always be greater than zero, > however, as the TCP stack is smart

Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Ralph Castain
I believe the performance penalty will still always be greater than zero, however, as the TCP stack is smart enough to take an optimized path when doing a loopback as opposed to inter-node communication. On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote:

Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Gilles Gouaillardet
Federico, I did not expect 0% degradation, since you are now comparing two different cases 1 orted means tasks are bound on sockets 16 orted means tasks are not bound. a quick way to improve things is to use a wrapper that binds MPI tasks mpirun --bind-to none wrapper.sh skampi wrapper.sh can

Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Federico Reghenzani
Thank you Gilles, you're right, with --bind-to none we have ~ 15% of degradation rather than 50%. It's much better now, but I think it should be (in theory) around 0%. The benchmark is MPI bound (the standard benchmark provided with SkaMPI), it tests these functions: MPI_Bcast, MPI_Barrier,

Re: [OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Gilles Gouaillardet
Federico, unless you already took care of that, I would guess all 16 orted bound their children MPI tasks on socket 0 can you try mpirun --bind-to none ... btw, is your benchmark application cpu bound ? memory bound ? MPI bound ? Cheers, Gilles On Monday, January 25, 2016, Federico

[OMPI devel] Benchmark with multiple orteds

2016-01-25 Thread Federico Reghenzani
Hello, we have executed a benchmark (SkaMPI) on the same machine (32 core Intel Xeon 86_64) with these two configurations: - 1 orted with 16 processes with BTL forced to TCP (--mca btl self,tcp) - 16 orted with each 1 process (that uses TCP) We use a custom RAS to allow multiple orted on the