Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jed Brown
On Thu, 15 Jul 2010 13:03:31 -0400, Jeff Squyres  wrote:
> Given the oversubscription on the existing HT links, could contention
> account for the difference?  (I have no idea how HT's contention
> management works) Meaning: if the stars line up in a given run, you
> could end up with very little/no contention and you get good
> bandwidth.  But if there's a bit of jitter, you could end up with
> quite a bit of contention that ends up cascading into a bunch of
> additional delay.

What contention?  Many sockets needing to access memory on another
socket via HT links?  Then yes, perhaps that could be a lot.  As show in
the diagram, it's pretty non-uniform, and if, say sockets 0, 1, and 3
all found memory on socket 0 (say socket 2 had local memory), then there
are two ways for messages to get from 3 to 0 (via 1 or via 2).  I don't
know if there is hardware support to re-route to avoid contention, but
if not, then socket 3 could be sharing the 1->0 HT link (which has max
throughput of 8 GB/s, therefore 4 GB/s would be available per socket,
provided it was still operating at peak).  Note that this 4 GB/s is
still less than splitting the 10.7 GB/s three ways.

> I fail to see how that could add up to 70-80 (or more) seconds of
> difference -- 13 secs vs. 90+ seconds (and more), though...  70-80
> seconds sounds like an IO delay -- perhaps paging due to the ramdisk
> or somesuch...?  That's a SWAG.

This problem should have had significantly less resident than would
cause paging, but these were very short jobs so a relatively small
amount of paging would cause a big performance hit.  We have also seen
up to a factor of 10 variability in longer jobs (e.g. 1 hour for a
"fast" run), with larger working sets, but once the pages are faulted,
this kernel (2.6.18 from RHEL5) won't migrate them around, so even if
you eventually swap out all the ramdisk, pages faulted before and after
will be mapped to all sorts of inconvenient places.

But, I don't have any systematic testing with a guaranteed clean
ramdisk, and I'm not going to overanalyze the extra factors when there's
an understood factor of 3 hanging in the way.  I'll give an update if
there is any news.

Jed


Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jeff Squyres
Given the oversubscription on the existing HT links, could contention account 
for the difference?  (I have no idea how HT's contention management works)  
Meaning: if the stars line up in a given run, you could end up with very 
little/no contention and you get good bandwidth.  But if there's a bit of 
jitter, you could end up with quite a bit of contention that ends up cascading 
into a bunch of additional delay.

I fail to see how that could add up to 70-80 (or more) seconds of difference -- 
13 secs vs. 90+ seconds (and more), though...  70-80 seconds sounds like an IO 
delay -- perhaps paging due to the ramdisk or somesuch...?  That's a SWAG.



On Jul 15, 2010, at 10:40 AM, Jed Brown wrote:

> On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres  wrote:
> > Per my other disclaimer, I'm trolling through my disastrous inbox and
> > finding some orphaned / never-answered emails.  Sorry for the delay!
> 
> No problem, I should have followed up on this with further explanation.
> 
> > Just to be clear -- you're running 8 procs locally on an 8 core node,
> > right?
> 
> These are actually 4-socket quad-core nodes, so there are 16 cores
> available, but we are only running on 8, -npersocket 2 -bind-to-socket.
> This was a greatly simplified case, but is still sufficient to show the
> variability.  It tends to be somewhat worse if we use all cores of a
> node.
> 
>   (Cisco is an Intel partner -- I don't follow the AMD line
> > much) So this should all be local communication with no external
> > network involved, right?
> 
> Yes, this was the greatly simplified case, contained entirely within a
> 
> > > lsf.o240562 killed   8*a6200
> > > lsf.o240563 9.2110e+01   8*a6200
> > > lsf.o240564 1.5638e+01   8*a6237
> > > lsf.o240565 1.3873e+01   8*a6228
> >
> > Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!
> 
> Yes, an the "killed" means it wasn't done after 120 seconds.  This
> factor of 10 is about the worst we see, but of course very surprising.
> 
> > Nice and consistent, as you mentioned.  And I assume your notation
> > here means that it's across 2 nodes.
> 
> Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
> nodes.
> 
> The rest of your observations are consistent with my understanding.  We
> identified two other issues, neither of which accounts for a factor of
> 10, but which account for at least a factor of 3.
> 
> 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
>ensure that it was wiped before the next task ran.  So if you got a
>node after some job that left stinky feces there, you could
>effectively only have 16 GB (before the old stuff would be swapped
>out).  More importantly, the physical pages backing the ramdisk may
>not be uniformly distributed across the sockets, and rather than
>preemptively swap out those old ramdisk pages, the kernel would find
>a page on some other socket (instead of locally, this could be
>confirmed, for example, by watching the numa_foreign and numa_miss
>counts with numastat).  Then when you went to use that memory
>(typically in a bandwidth-limited application), it was easy to have 3
>sockets all waiting on one bus, thus taking a factor of 3+
>performance hit despite a resident set much less than 50% of the
>available memory.  I have a rather complete analysis of this in case
>someone is interested.  Note that this can affect programs with
>static or dynamic allocation (the kernel looks for local pages when
>you fault it, not when you allocate it), the only way I know of to
>circumvent the problem is to allocate memory with libnuma
>(e.g. numa_alloc_local) which will fail if local memory isn't
>available (instead of returning and subsequently faulting remote
>pages).
> 
> 2. The memory bandwidth is 16-18% different between sockets, with
>sockets 0,3 being slow and sockets 1,2 having much faster available
>bandwidth.  This is fully reproducible and acknowledged by
>Sun/Oracle, their response to an early inquiry:
> 
>  http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf
> 
>I am not completely happy with this explanation because the issue
>persists even with full software prefetch, packed SSE2, and
>non-temporal stores; as long as the working set does not fit within
>(per-socket) L3.  Note that the software prefetch allows for several
>hundred cycles of latency, so the extra hop for snooping shouldn't be
>a problem.  If the working set fits within L3, then all sockets are
>the same speed (and of course much faster due to improved bandwidth).
>Some disassembly here:
> 
>  http://gist.github.com/476942
> 
>The three with prefetch and movntpd run within 2% of each other, the
>other is much faster within cache and much slower when it breaks out
>of cache (obviously).  The performance numbers are higher than with
>the reference implementation 

Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jed Brown
On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres  wrote:
> Per my other disclaimer, I'm trolling through my disastrous inbox and
> finding some orphaned / never-answered emails.  Sorry for the delay!

No problem, I should have followed up on this with further explanation.

> Just to be clear -- you're running 8 procs locally on an 8 core node,
> right?

These are actually 4-socket quad-core nodes, so there are 16 cores
available, but we are only running on 8, -npersocket 2 -bind-to-socket.
This was a greatly simplified case, but is still sufficient to show the
variability.  It tends to be somewhat worse if we use all cores of a
node.

  (Cisco is an Intel partner -- I don't follow the AMD line
> much) So this should all be local communication with no external
> network involved, right?

Yes, this was the greatly simplified case, contained entirely within a 

> > lsf.o240562 killed   8*a6200
> > lsf.o240563 9.2110e+01   8*a6200
> > lsf.o240564 1.5638e+01   8*a6237
> > lsf.o240565 1.3873e+01   8*a6228
>
> Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!

Yes, an the "killed" means it wasn't done after 120 seconds.  This
factor of 10 is about the worst we see, but of course very surprising.

> Nice and consistent, as you mentioned.  And I assume your notation
> here means that it's across 2 nodes.

Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two
nodes.

The rest of your observations are consistent with my understanding.  We
identified two other issues, neither of which accounts for a factor of
10, but which account for at least a factor of 3.

1. The administrators mounted a 16 GB ramdisk on /scratch, but did not
   ensure that it was wiped before the next task ran.  So if you got a
   node after some job that left stinky feces there, you could
   effectively only have 16 GB (before the old stuff would be swapped
   out).  More importantly, the physical pages backing the ramdisk may
   not be uniformly distributed across the sockets, and rather than
   preemptively swap out those old ramdisk pages, the kernel would find
   a page on some other socket (instead of locally, this could be
   confirmed, for example, by watching the numa_foreign and numa_miss
   counts with numastat).  Then when you went to use that memory
   (typically in a bandwidth-limited application), it was easy to have 3
   sockets all waiting on one bus, thus taking a factor of 3+
   performance hit despite a resident set much less than 50% of the
   available memory.  I have a rather complete analysis of this in case
   someone is interested.  Note that this can affect programs with
   static or dynamic allocation (the kernel looks for local pages when
   you fault it, not when you allocate it), the only way I know of to
   circumvent the problem is to allocate memory with libnuma
   (e.g. numa_alloc_local) which will fail if local memory isn't
   available (instead of returning and subsequently faulting remote
   pages).

2. The memory bandwidth is 16-18% different between sockets, with
   sockets 0,3 being slow and sockets 1,2 having much faster available
   bandwidth.  This is fully reproducible and acknowledged by
   Sun/Oracle, their response to an early inquiry:

 http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf

   I am not completely happy with this explanation because the issue
   persists even with full software prefetch, packed SSE2, and
   non-temporal stores; as long as the working set does not fit within
   (per-socket) L3.  Note that the software prefetch allows for several
   hundred cycles of latency, so the extra hop for snooping shouldn't be
   a problem.  If the working set fits within L3, then all sockets are
   the same speed (and of course much faster due to improved bandwidth).
   Some disassembly here:

 http://gist.github.com/476942

   The three with prefetch and movntpd run within 2% of each other, the
   other is much faster within cache and much slower when it breaks out
   of cache (obviously).  The performance numbers are higher than with
   the reference implementation (quoted in Sun/Oracle's repsonse), but
   (run with taskset to each of the four sockets):

 Triad:   5842.5814   0.0329   0.0329   0.0330
 Triad:   6843.4206   0.0281   0.0281   0.0282
 Triad:   6827.6390   0.0282   0.0281   0.0283
 Triad:   5862.0601   0.0329   0.0328   0.0331

   This is almost exclusively due to the prefetching, the packed
   arithmetic is almost completely inconsequential when waiting on
   memory bandwidth.

Jed


Re: [OMPI users] Highly variable performance

2010-07-15 Thread Jeff Squyres
Per my other disclaimer, I'm trolling through my disastrous inbox and finding 
some orphaned / never-answered emails.  Sorry for the delay!


On Jun 2, 2010, at 4:36 PM, Jed Brown wrote:

> The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), 
> connected
> with QDR InfiniBand.  The benchmark loops over
> 
>   
> MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD);
> 
> with nlocal=1 (80 KiB messages) 1 times, so it normally runs in
> a few seconds.  

Just to be clear -- you're running 8 procs locally on an 8 core node, right?  
(Cisco is an Intel partner -- I don't follow the AMD line much)  So this should 
all be local communication with no external network involved, right?

> #  JOB   TIME (s)  HOST
> 
> ompirun
> lsf.o240562 killed   8*a6200
> lsf.o240563 9.2110e+01   8*a6200
> lsf.o240564 1.5638e+01   8*a6237
> lsf.o240565 1.3873e+01   8*a6228

Am I reading that right that it's 92 seconds vs. 13 seconds?  Woof!

> ompirun -mca btl self,sm
> lsf.o240574 1.6916e+01   8*a6237
> lsf.o240575 1.7456e+01   8*a6200
> lsf.o240576 1.4183e+01   8*a6161
> lsf.o240577 1.3254e+01   8*a6203
> lsf.o240578 1.8848e+01   8*a6274

13 vs. 18 seconds.  Better, but still dodgy.

> prun (quadrics)
> lsf.o240602 1.6168e+01   4*a2108+4*a2109
> lsf.o240603 1.6746e+01   4*a2110+4*a2111
> lsf.o240604 1.6371e+01   4*a2108+4*a2109
> lsf.o240606 1.6867e+01   4*a2110+4*a2111

Nice and consistent, as you mentioned.  And I assume your notation here means 
that it's across 2 nodes.

> ompirun -mca btl self,openib
> lsf.o240776 3.1463e+01   8*a6203
> lsf.o240777 3.0418e+01   8*a6264
> lsf.o240778 3.1394e+01   8*a6203
> lsf.o240779 3.5111e+01   8*a6274

Also much better.  Probably because all messages are equally penalized by going 
out to the HCA and back.

> ompirun -mca self,sm,openib
> lsf.o240851 1.3848e+01   8*a6244
> lsf.o240852 1.7362e+01   8*a6237
> lsf.o240854 1.3266e+01   8*a6204
> lsf.o240855 1.3423e+01   8*a6276

This should be pretty much the same as sm,self, because openib shouldn't be 
used for any of the communication (i.e., Open MPI should determine that sm is 
the "best" transport between all the peers and silently discard openib).

> ompirun
> lsf.o240858 1.4415e+01   8*a6244
> lsf.o240859 1.5092e+01   8*a6237
> lsf.o240860 1.3940e+01   8*a6204
> lsf.o240861 1.5521e+01   8*a6276
> lsf.o240903 1.3273e+01   8*a6234
> lsf.o240904 1.6700e+01   8*a6206
> lsf.o240905 1.4636e+01   8*a6269
> lsf.o240906 1.5056e+01   8*a6234

Strange that this would be different than the first one.  It should be 
functionally equivalent to --mca self,sm,openib.

> ompirun -mca self,tcp
> lsf.o240948 1.8504e+01   8*a6234
> lsf.o240949 1.9317e+01   8*a6207
> lsf.o240950 1.8964e+01   8*a6234
> lsf.o240951 2.0764e+01   8*a6207

Variation here isn't too bad.  The slowdown here (compared to sm) is likely 
because it's going through the TCP loopback stack vs. "directly" going to the 
peer in shared memory.

...a quick look through the rest seems to indicate that they're more-or-less 
consistent with what you showed above.

Your later mail says:

> Following up on this, I have partial resolution.  The primary culprit
> appears to be stale files in a ramdisk non-uniformly distributed across
> the sockets, thus interactingly poorly with NUMA.  The slow runs
> invariably have high numa_miss and numa_foreign counts.  I still have
> trouble making it explain up to a factor of 10 degredation, but it
> certainly explains a factor of 3.

Try playing with Open MPI's process affinity options, like --bind-to-core (see 
mpirun(1)). This may help prevent some OS jitter in moving processes around, 
and allow pinning memory locally to each NUMA node.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Highly variable performance

2010-06-23 Thread Jed Brown
Following up on this, I have partial resolution.  The primary culprit
appears to be stale files in a ramdisk non-uniformly distributed across
the sockets, thus interactingly poorly with NUMA.  The slow runs
invariably have high numa_miss and numa_foreign counts.  I still have
trouble making it explain up to a factor of 10 degredation, but it
certainly explains a factor of 3.

Jed


[OMPI users] Highly variable performance

2010-06-02 Thread Jed Brown
I'm investigating some very large performance variation and have reduced
the issue to a very simple MPI_Allreduce benchmark.  The variability
does not occur for serial jobs, but it does occur within single nodes.
I'm not at all convinced that this is an Open MPI-specific issue (in
fact the same variance is observed with MVAPICH2 which is an available,
but not "recommended", implementation on that cluster) but perhaps
someone here can suggest steps to track down the issue.

The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected
with QDR InfiniBand.  The benchmark loops over

  
MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD);

with nlocal=1 (80 KiB messages) 1 times, so it normally runs in
a few seconds.  Open MPI 1.4.1 was compiled with gcc-4.3.3, and this
code was built with mpicc -O2.  All submissions were 8 process, timing
and host results are presented below in chronological order.  The jobs
were run with 2-minute time limits (to get through the queue easily)
jobs are marked "killed" if they went over this amount of time.  Jobs
were usually submitted in batches of 4.  The scheduler is LSF-7.0.

The HOST field indicates the node that was actually used, a6* nodes are
of the type described above, a2* nodes are much older (2-socket Opteron
2220 (dual core, 2.8 GHz)) and use a Quadrics network, the timings are
very reliable on these older nodes.  When the issue first came up, I was
inclined to blame memory bandwidith issues with other jobs, but the
variance is still visible when our job runs on exactly a full node,
present regardless of affinity settings, and events that don't require
communication are well-balanced in both small and large runs.

I then suspected possible contention between transport layers, ompi_info
gives

  MCA btl: parameter "btl" (current value: "self,sm,openib,tcp", data source: 
environment)

so the timings below show many variations of restricting these values.
Unfortunately, the variance is large for all combinations, but I find it
notable that -mca btl self,openib is reliably much slower than self,tcp.

Note that some nodes are used in multiple runs, yet there is no strict
relationship where some nodes are "fast", for instance, a6200 is very
slow (6x and more) in the first set, then normal on the subsequent test.
Nevertheless, when the same node appears in temporally nearby tests,
there seems to be a correlation (though there is certainly not enough
data here to establish that with confidence).

As a final observation, I think the performance in all cases is
unreasonably low since the same test on a (unrelated to the cluster)
2-socket Opteron 2356 (quad core, 2.3 GHz) always takes between 9.75 and
10.0 seconds, 30% faster than the fastest observations on the cluster
nodes with faster cores and memory.

#  JOB   TIME (s)  HOST

ompirun
lsf.o240562 killed   8*a6200
lsf.o240563 9.2110e+01   8*a6200
lsf.o240564 1.5638e+01   8*a6237
lsf.o240565 1.3873e+01   8*a6228

ompirun -mca btl self,sm
lsf.o240574 1.6916e+01   8*a6237
lsf.o240575 1.7456e+01   8*a6200
lsf.o240576 1.4183e+01   8*a6161
lsf.o240577 1.3254e+01   8*a6203
lsf.o240578 1.8848e+01   8*a6274

prun (quadrics)
lsf.o240602 1.6168e+01   4*a2108+4*a2109
lsf.o240603 1.6746e+01   4*a2110+4*a2111
lsf.o240604 1.6371e+01   4*a2108+4*a2109
lsf.o240606 1.6867e+01   4*a2110+4*a2111

ompirun -mca btl self,openib
lsf.o240776 3.1463e+01   8*a6203
lsf.o240777 3.0418e+01   8*a6264
lsf.o240778 3.1394e+01   8*a6203
lsf.o240779 3.5111e+01   8*a6274

ompirun -mca self,sm,openib
lsf.o240851 1.3848e+01   8*a6244
lsf.o240852 1.7362e+01   8*a6237
lsf.o240854 1.3266e+01   8*a6204
lsf.o240855 1.3423e+01   8*a6276

ompirun
lsf.o240858 1.4415e+01   8*a6244
lsf.o240859 1.5092e+01   8*a6237
lsf.o240860 1.3940e+01   8*a6204
lsf.o240861 1.5521e+01   8*a6276
lsf.o240903 1.3273e+01   8*a6234
lsf.o240904 1.6700e+01   8*a6206
lsf.o240905 1.4636e+01   8*a6269
lsf.o240906 1.5056e+01   8*a6234

ompirun -mca self,tcp
lsf.o240948 1.8504e+01   8*a6234
lsf.o240949 1.9317e+01   8*a6207
lsf.o240950 1.8964e+01   8*a6234
lsf.o240951 2.0764e+01   8*a6207

ompirun -mca btl self,sm,openib
lsf.o240998 1.3265e+01   8*a6269
lsf.o240999 1.2884e+01   8*a6269
lsf.o241000 1.3092e+01   8*a6234
lsf.o241001 1.3044e+01   8*a6269

ompirun -mca btl self,openib
lsf.o241013 3.1572e+01   8*a6229
lsf.o241014 3.0552e+01   8*a6234
lsf.o241015 3.1813e+01   8*a6229
lsf.o241016 3.2514e+01   8*a6252

ompirun -mca btl self,sm
lsf.o241044 1.3417e+01   8*a6234
lsf.o241045 killed   8*a6232
lsf.o241046 1.4626e+01   8*a6269
lsf.o241047 1.5060e+01   8*a6253
lsf.o241166 1.3179e+01   8*a6228
lsf.o241167 2.7759e+01   8*a6232
lsf.o241168 1.4224e+01   8*a6234
lsf.o241169 1.4825e+01   8*a6228
lsf.o241446 1.4896e+01   8*a6204
lsf.o241447 1.4960e+01   8*a6228
lsf.o241448 1.7622e+01   8*a6222
lsf.o241449 1.5112e+01   8*a6204

ompirun -mca btl self,tcp
lsf.o241556 1.9135e+01   8*a6204
lsf.o241557 2.4365e+01   8*a6261