Hello,
On 28/09/12 10:00 AM, Jeff Squyres wrote:
> On Sep 28, 2012, at 9:50 AM, Sébastien Boisvert wrote:
>
>> I did not know about shared queues.
>>
>> It does not run out of memory. ;-)
>
> It runs out of *registered* memory, which could be far less than your actual
> RAM. Check this FAQ item in particular:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
>
I see.
$ cat /sys/module/mlx4_core/parameters/log_num_mtt
0
$ cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
0
$ getconf PAGE_SIZE
4096
With the formula
max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE
= (2^0) * (2^0) * 4096
= 1 * 1 * 4096
= 4096 bytes
Whoa ! one page.
That should help.
There are 32 GiB of memory.
So I will ask someone to set log_num_mtt=23 and log_mtts_per_seg=1.
=> 68719476736 = (2**23)*(2**1)*4096
>> But the latency is not very good.
>>
>> ** Test 1
>>
>> --mca btl_openib_max_send_size 4096 \
>> --mca btl_openib_eager_limit 4096 \
>> --mca btl_openib_rndv_eager_limit 4096 \
>> --mca btl_openib_receive_queues S,4096,2048,1024,32 \
>>
>> I get 1.5 milliseconds.
>>
>> => https://gist.github.com/3799889
>>
>> ** Test 2
>>
>> --mca btl_openib_receive_queues S,65536,256,128,32 \
>>
>> I get around 1.5 milliseconds too.
>>
>> => https://gist.github.com/3799940
>
> Are you saying 1.5us is bad?
1.5 us is very good. But I get 1.5 ms with shared queues (see above).
> That's actually not bad at all. On the most modern hardware with a bunch of
> software tuning, you can probably get closer to 1us.
>
>> With my virtual router I am sure I can get something around 270 microseconds.
>
> OTOH, that's pretty bad. :-)
I know, all my Ray processes are doing busy waiting, if MPI was event-driven,
I would call my software sleepy Ray when latency is high.
>
> I'm not sure why it would be so bad -- are you hammering the virtual router
> with small incoming messages?
There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies
MT26428
on each node. That may be the cause too.
> You might need to do a little profiling to see where the bottlenecks are.
>
Well, with the very valuable information you provided about log_num_mtt and
log_mtts_per_seg
for the Linux kernel module mlx4_core, I think this may be the root of our
problem.
We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the
bottleneck is in
our software.
>> Just out of curiosity, does Open-MPI utilize heavily negative values
>> internally for user-provided MPI tags ?
>
> I know offhand we use them for collectives. Something is tickling my brain
> that we use them for other things, too (CID allocation, perhaps?), but I
> don't remember offhand.
>
The only collective I use is a few MPI_Barrier.
> I'm just saying: YMMV. Buyer be warned. And all that. :-)
>
Yes, I agree on this, non-portable code is not portable and all with unexpected
behaviors.
>> If the negative tags are internal to Open-MPI, my code will not touch
>> these private variables, right ?
>
> It's not a variable that's the issue. If you do a receive for tag -3 and
> OMPI sends an internal control message with tag -3, you might receive it
> instead of OMPI's core. And that would be Bad.
>
Ah I see. By removing the checks in my silly patch, I can now dictate things to
do to
OMPI. Hehe.