On Sep 28, 2012, at 10:38 AM, Sébastien Boisvert wrote:

> 1.5 us is very good. But I get 1.5 ms with shared queues (see above).

Oh, I mis-read (I blame it on jet-lag...).

Yes, that seems waaaay too high.

You didn't do a developer build, did you?  We add a bunch of extra debugging in 
developer builds that adds a bunch of latency.  And you're not running 
over-subscribed, right?

>> OTOH, that's pretty bad.  :-)
> 
> I know, all my Ray processes are doing busy waiting, if MPI was event-driven,
> I would call my software sleepy Ray when latency is high.
> 
>> I'm not sure why it would be so bad -- are you hammering the virtual router 
>> with small incoming messages?
> 
> There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies 
> MT26428 on each node. That may be the cause too.

That's a QDR HCA, right?  (i.e., I assume it's very recent)

Try running some simple point-to-point benchmarks and see if you're getting the 
same latency (i.e., don't run benchmarks in your app -- get a baseline with 
some well-known benchmarks first).

>> You might need to do a little profiling to see where the bottlenecks are.
> 
> Well, with the very valuable information you provided about log_num_mtt and 
> log_mtts_per_seg for the Linux kernel module mlx4_core, I think this may be 
> the root of our problem.

It is definitely a cause, but perhaps not the only cause.

> We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the 
> bottleneck is in our software.

Possibly not.  But every environment is different, and the same software can 
perform differently in different environments.

> Yes, I agree on this, non-portable code is not portable and all with 
> unexpected behaviors.

Got it.

> Ah I see. By removing the checks in my silly patch, I can now dictate things 
> to do to OMPI. Hehe.

:-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to