I also assumed that was true. However, when communicating between two
procs, the TCP stack will use a shortcut in the loopback code if the two
procs are known to be on the same node. In the case of multiple orteds, it
isn't clear to me that the stack knows this situation as the orteds, at
least,
Though I did not repeat it, I assumed --mca btl tcp,self is always used, as
described in the initial email
Cheers,
Gilles
On Monday, January 25, 2016, Ralph Castain wrote:
> I believe the performance penalty will still always be greater than zero,
> however, as the TCP
Ok, thank you Ralph and Gilles, I will continue testing and I'll update you
if there is any news.
Cheers,
Federico
2016-01-25 14:23 GMT+01:00 Ralph Castain :
> I believe the performance penalty will still always be greater than zero,
> however, as the TCP stack is smart
I believe the performance penalty will still always be greater than zero,
however, as the TCP stack is smart enough to take an optimized path when
doing a loopback as opposed to inter-node communication.
On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
Federico,
I did not expect 0% degradation, since you are now comparing two different
cases
1 orted means tasks are bound on sockets
16 orted means tasks are not bound.
a quick way to improve things is to use a wrapper that binds MPI tasks
mpirun --bind-to none wrapper.sh skampi
wrapper.sh can
Thank you Gilles, you're right, with --bind-to none we have ~ 15% of
degradation rather than 50%.
It's much better now, but I think it should be (in theory) around 0%.
The benchmark is MPI bound (the standard benchmark provided with SkaMPI),
it tests these functions: MPI_Bcast, MPI_Barrier,
Federico,
unless you already took care of that, I would guess all 16 orted
bound their children MPI tasks on socket 0
can you try
mpirun --bind-to none ...
btw, is your benchmark application cpu bound ? memory bound ? MPI bound ?
Cheers,
Gilles
On Monday, January 25, 2016, Federico
Hello,
we have executed a benchmark (SkaMPI) on the same machine (32 core Intel
Xeon 86_64) with these two configurations:
- 1 orted with 16 processes with BTL forced to TCP (--mca btl self,tcp)
- 16 orted with each 1 process (that uses TCP)
We use a custom RAS to allow multiple orted on the