On Sep 28, 2012, at 10:38 AM, Sébastien Boisvert wrote: > 1.5 us is very good. But I get 1.5 ms with shared queues (see above).
Oh, I mis-read (I blame it on jet-lag...). Yes, that seems waaaay too high. You didn't do a developer build, did you? We add a bunch of extra debugging in developer builds that adds a bunch of latency. And you're not running over-subscribed, right? >> OTOH, that's pretty bad. :-) > > I know, all my Ray processes are doing busy waiting, if MPI was event-driven, > I would call my software sleepy Ray when latency is high. > >> I'm not sure why it would be so bad -- are you hammering the virtual router >> with small incoming messages? > > There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies > MT26428 on each node. That may be the cause too. That's a QDR HCA, right? (i.e., I assume it's very recent) Try running some simple point-to-point benchmarks and see if you're getting the same latency (i.e., don't run benchmarks in your app -- get a baseline with some well-known benchmarks first). >> You might need to do a little profiling to see where the bottlenecks are. > > Well, with the very valuable information you provided about log_num_mtt and > log_mtts_per_seg for the Linux kernel module mlx4_core, I think this may be > the root of our problem. It is definitely a cause, but perhaps not the only cause. > We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the > bottleneck is in our software. Possibly not. But every environment is different, and the same software can perform differently in different environments. > Yes, I agree on this, non-portable code is not portable and all with > unexpected behaviors. Got it. > Ah I see. By removing the checks in my silly patch, I can now dictate things > to do to OMPI. Hehe. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/