Per my other disclaimer, I'm trolling through my disastrous inbox and finding 
some orphaned / never-answered emails.  Sorry for the delay!


On Jun 2, 2010, at 4:36 PM, Jed Brown wrote:

> The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), 
> connected
> with QDR InfiniBand.  The benchmark loops over
> 
>   
> MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD);
> 
> with nlocal=10000 (80 KiB messages) 10000 times, so it normally runs in
> a few seconds.  

Just to be clear -- you're running 8 procs locally on an 8 core node, right?  
(Cisco is an Intel partner -- I don't follow the AMD line much)  So this should 
all be local communication with no external network involved, right?

> #  JOB       TIME (s)      HOST
> 
> ompirun
> lsf.o240562 killed       8*a6200
> lsf.o240563 9.2110e+01   8*a6200
> lsf.o240564 1.5638e+01   8*a6237
> lsf.o240565 1.3873e+01   8*a6228

Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!

> ompirun -mca btl self,sm
> lsf.o240574 1.6916e+01   8*a6237
> lsf.o240575 1.7456e+01   8*a6200
> lsf.o240576 1.4183e+01   8*a6161
> lsf.o240577 1.3254e+01   8*a6203
> lsf.o240578 1.8848e+01   8*a6274

13 vs. 18 seconds.  Better, but still dodgy.

> prun (quadrics)
> lsf.o240602 1.6168e+01   4*a2108+4*a2109
> lsf.o240603 1.6746e+01   4*a2110+4*a2111
> lsf.o240604 1.6371e+01   4*a2108+4*a2109
> lsf.o240606 1.6867e+01   4*a2110+4*a2111

Nice and consistent, as you mentioned.  And I assume your notation here means 
that it's across 2 nodes.

> ompirun -mca btl self,openib
> lsf.o240776 3.1463e+01   8*a6203
> lsf.o240777 3.0418e+01   8*a6264
> lsf.o240778 3.1394e+01   8*a6203
> lsf.o240779 3.5111e+01   8*a6274

Also much better.  Probably because all messages are equally penalized by going 
out to the HCA and back.

> ompirun -mca self,sm,openib
> lsf.o240851 1.3848e+01   8*a6244
> lsf.o240852 1.7362e+01   8*a6237
> lsf.o240854 1.3266e+01   8*a6204
> lsf.o240855 1.3423e+01   8*a6276

This should be pretty much the same as sm,self, because openib shouldn't be 
used for any of the communication (i.e., Open MPI should determine that sm is 
the "best" transport between all the peers and silently discard openib).

> ompirun
> lsf.o240858 1.4415e+01   8*a6244
> lsf.o240859 1.5092e+01   8*a6237
> lsf.o240860 1.3940e+01   8*a6204
> lsf.o240861 1.5521e+01   8*a6276
> lsf.o240903 1.3273e+01   8*a6234
> lsf.o240904 1.6700e+01   8*a6206
> lsf.o240905 1.4636e+01   8*a6269
> lsf.o240906 1.5056e+01   8*a6234

Strange that this would be different than the first one.  It should be 
functionally equivalent to --mca self,sm,openib.

> ompirun -mca self,tcp
> lsf.o240948 1.8504e+01   8*a6234
> lsf.o240949 1.9317e+01   8*a6207
> lsf.o240950 1.8964e+01   8*a6234
> lsf.o240951 2.0764e+01   8*a6207

Variation here isn't too bad.  The slowdown here (compared to sm) is likely 
because it's going through the TCP loopback stack vs. "directly" going to the 
peer in shared memory.

...a quick look through the rest seems to indicate that they're more-or-less 
consistent with what you showed above.

Your later mail says:

> Following up on this, I have partial resolution.  The primary culprit
> appears to be stale files in a ramdisk non-uniformly distributed across
> the sockets, thus interactingly poorly with NUMA.  The slow runs
> invariably have high numa_miss and numa_foreign counts.  I still have
> trouble making it explain up to a factor of 10 degredation, but it
> certainly explains a factor of 3.

Try playing with Open MPI's process affinity options, like --bind-to-core (see 
mpirun(1)). This may help prevent some OS jitter in moving processes around, 
and allow pinning memory locally to each NUMA node.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to