Re: [OMPI users] Highly variable performance
On Thu, 15 Jul 2010 13:03:31 -0400, Jeff Squyreswrote: > Given the oversubscription on the existing HT links, could contention > account for the difference? (I have no idea how HT's contention > management works) Meaning: if the stars line up in a given run, you > could end up with very little/no contention and you get good > bandwidth. But if there's a bit of jitter, you could end up with > quite a bit of contention that ends up cascading into a bunch of > additional delay. What contention? Many sockets needing to access memory on another socket via HT links? Then yes, perhaps that could be a lot. As show in the diagram, it's pretty non-uniform, and if, say sockets 0, 1, and 3 all found memory on socket 0 (say socket 2 had local memory), then there are two ways for messages to get from 3 to 0 (via 1 or via 2). I don't know if there is hardware support to re-route to avoid contention, but if not, then socket 3 could be sharing the 1->0 HT link (which has max throughput of 8 GB/s, therefore 4 GB/s would be available per socket, provided it was still operating at peak). Note that this 4 GB/s is still less than splitting the 10.7 GB/s three ways. > I fail to see how that could add up to 70-80 (or more) seconds of > difference -- 13 secs vs. 90+ seconds (and more), though... 70-80 > seconds sounds like an IO delay -- perhaps paging due to the ramdisk > or somesuch...? That's a SWAG. This problem should have had significantly less resident than would cause paging, but these were very short jobs so a relatively small amount of paging would cause a big performance hit. We have also seen up to a factor of 10 variability in longer jobs (e.g. 1 hour for a "fast" run), with larger working sets, but once the pages are faulted, this kernel (2.6.18 from RHEL5) won't migrate them around, so even if you eventually swap out all the ramdisk, pages faulted before and after will be mapped to all sorts of inconvenient places. But, I don't have any systematic testing with a guaranteed clean ramdisk, and I'm not going to overanalyze the extra factors when there's an understood factor of 3 hanging in the way. I'll give an update if there is any news. Jed
Re: [OMPI users] Highly variable performance
Given the oversubscription on the existing HT links, could contention account for the difference? (I have no idea how HT's contention management works) Meaning: if the stars line up in a given run, you could end up with very little/no contention and you get good bandwidth. But if there's a bit of jitter, you could end up with quite a bit of contention that ends up cascading into a bunch of additional delay. I fail to see how that could add up to 70-80 (or more) seconds of difference -- 13 secs vs. 90+ seconds (and more), though... 70-80 seconds sounds like an IO delay -- perhaps paging due to the ramdisk or somesuch...? That's a SWAG. On Jul 15, 2010, at 10:40 AM, Jed Brown wrote: > On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyreswrote: > > Per my other disclaimer, I'm trolling through my disastrous inbox and > > finding some orphaned / never-answered emails. Sorry for the delay! > > No problem, I should have followed up on this with further explanation. > > > Just to be clear -- you're running 8 procs locally on an 8 core node, > > right? > > These are actually 4-socket quad-core nodes, so there are 16 cores > available, but we are only running on 8, -npersocket 2 -bind-to-socket. > This was a greatly simplified case, but is still sufficient to show the > variability. It tends to be somewhat worse if we use all cores of a > node. > > (Cisco is an Intel partner -- I don't follow the AMD line > > much) So this should all be local communication with no external > > network involved, right? > > Yes, this was the greatly simplified case, contained entirely within a > > > > lsf.o240562 killed 8*a6200 > > > lsf.o240563 9.2110e+01 8*a6200 > > > lsf.o240564 1.5638e+01 8*a6237 > > > lsf.o240565 1.3873e+01 8*a6228 > > > > Am I reading that right that it's 92 seconds vs. 13 seconds? Woof! > > Yes, an the "killed" means it wasn't done after 120 seconds. This > factor of 10 is about the worst we see, but of course very surprising. > > > Nice and consistent, as you mentioned. And I assume your notation > > here means that it's across 2 nodes. > > Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two > nodes. > > The rest of your observations are consistent with my understanding. We > identified two other issues, neither of which accounts for a factor of > 10, but which account for at least a factor of 3. > > 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not >ensure that it was wiped before the next task ran. So if you got a >node after some job that left stinky feces there, you could >effectively only have 16 GB (before the old stuff would be swapped >out). More importantly, the physical pages backing the ramdisk may >not be uniformly distributed across the sockets, and rather than >preemptively swap out those old ramdisk pages, the kernel would find >a page on some other socket (instead of locally, this could be >confirmed, for example, by watching the numa_foreign and numa_miss >counts with numastat). Then when you went to use that memory >(typically in a bandwidth-limited application), it was easy to have 3 >sockets all waiting on one bus, thus taking a factor of 3+ >performance hit despite a resident set much less than 50% of the >available memory. I have a rather complete analysis of this in case >someone is interested. Note that this can affect programs with >static or dynamic allocation (the kernel looks for local pages when >you fault it, not when you allocate it), the only way I know of to >circumvent the problem is to allocate memory with libnuma >(e.g. numa_alloc_local) which will fail if local memory isn't >available (instead of returning and subsequently faulting remote >pages). > > 2. The memory bandwidth is 16-18% different between sockets, with >sockets 0,3 being slow and sockets 1,2 having much faster available >bandwidth. This is fully reproducible and acknowledged by >Sun/Oracle, their response to an early inquiry: > > http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf > >I am not completely happy with this explanation because the issue >persists even with full software prefetch, packed SSE2, and >non-temporal stores; as long as the working set does not fit within >(per-socket) L3. Note that the software prefetch allows for several >hundred cycles of latency, so the extra hop for snooping shouldn't be >a problem. If the working set fits within L3, then all sockets are >the same speed (and of course much faster due to improved bandwidth). >Some disassembly here: > > http://gist.github.com/476942 > >The three with prefetch and movntpd run within 2% of each other, the >other is much faster within cache and much slower when it breaks out >of cache (obviously). The performance numbers are higher than with >the reference implementation
Re: [OMPI users] Highly variable performance
On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyreswrote: > Per my other disclaimer, I'm trolling through my disastrous inbox and > finding some orphaned / never-answered emails. Sorry for the delay! No problem, I should have followed up on this with further explanation. > Just to be clear -- you're running 8 procs locally on an 8 core node, > right? These are actually 4-socket quad-core nodes, so there are 16 cores available, but we are only running on 8, -npersocket 2 -bind-to-socket. This was a greatly simplified case, but is still sufficient to show the variability. It tends to be somewhat worse if we use all cores of a node. (Cisco is an Intel partner -- I don't follow the AMD line > much) So this should all be local communication with no external > network involved, right? Yes, this was the greatly simplified case, contained entirely within a > > lsf.o240562 killed 8*a6200 > > lsf.o240563 9.2110e+01 8*a6200 > > lsf.o240564 1.5638e+01 8*a6237 > > lsf.o240565 1.3873e+01 8*a6228 > > Am I reading that right that it's 92 seconds vs. 13 seconds? Woof! Yes, an the "killed" means it wasn't done after 120 seconds. This factor of 10 is about the worst we see, but of course very surprising. > Nice and consistent, as you mentioned. And I assume your notation > here means that it's across 2 nodes. Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two nodes. The rest of your observations are consistent with my understanding. We identified two other issues, neither of which accounts for a factor of 10, but which account for at least a factor of 3. 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not ensure that it was wiped before the next task ran. So if you got a node after some job that left stinky feces there, you could effectively only have 16 GB (before the old stuff would be swapped out). More importantly, the physical pages backing the ramdisk may not be uniformly distributed across the sockets, and rather than preemptively swap out those old ramdisk pages, the kernel would find a page on some other socket (instead of locally, this could be confirmed, for example, by watching the numa_foreign and numa_miss counts with numastat). Then when you went to use that memory (typically in a bandwidth-limited application), it was easy to have 3 sockets all waiting on one bus, thus taking a factor of 3+ performance hit despite a resident set much less than 50% of the available memory. I have a rather complete analysis of this in case someone is interested. Note that this can affect programs with static or dynamic allocation (the kernel looks for local pages when you fault it, not when you allocate it), the only way I know of to circumvent the problem is to allocate memory with libnuma (e.g. numa_alloc_local) which will fail if local memory isn't available (instead of returning and subsequently faulting remote pages). 2. The memory bandwidth is 16-18% different between sockets, with sockets 0,3 being slow and sockets 1,2 having much faster available bandwidth. This is fully reproducible and acknowledged by Sun/Oracle, their response to an early inquiry: http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf I am not completely happy with this explanation because the issue persists even with full software prefetch, packed SSE2, and non-temporal stores; as long as the working set does not fit within (per-socket) L3. Note that the software prefetch allows for several hundred cycles of latency, so the extra hop for snooping shouldn't be a problem. If the working set fits within L3, then all sockets are the same speed (and of course much faster due to improved bandwidth). Some disassembly here: http://gist.github.com/476942 The three with prefetch and movntpd run within 2% of each other, the other is much faster within cache and much slower when it breaks out of cache (obviously). The performance numbers are higher than with the reference implementation (quoted in Sun/Oracle's repsonse), but (run with taskset to each of the four sockets): Triad: 5842.5814 0.0329 0.0329 0.0330 Triad: 6843.4206 0.0281 0.0281 0.0282 Triad: 6827.6390 0.0282 0.0281 0.0283 Triad: 5862.0601 0.0329 0.0328 0.0331 This is almost exclusively due to the prefetching, the packed arithmetic is almost completely inconsequential when waiting on memory bandwidth. Jed
Re: [OMPI users] Highly variable performance
Per my other disclaimer, I'm trolling through my disastrous inbox and finding some orphaned / never-answered emails. Sorry for the delay! On Jun 2, 2010, at 4:36 PM, Jed Brown wrote: > The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), > connected > with QDR InfiniBand. The benchmark loops over > > > MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD); > > with nlocal=1 (80 KiB messages) 1 times, so it normally runs in > a few seconds. Just to be clear -- you're running 8 procs locally on an 8 core node, right? (Cisco is an Intel partner -- I don't follow the AMD line much) So this should all be local communication with no external network involved, right? > # JOB TIME (s) HOST > > ompirun > lsf.o240562 killed 8*a6200 > lsf.o240563 9.2110e+01 8*a6200 > lsf.o240564 1.5638e+01 8*a6237 > lsf.o240565 1.3873e+01 8*a6228 Am I reading that right that it's 92 seconds vs. 13 seconds? Woof! > ompirun -mca btl self,sm > lsf.o240574 1.6916e+01 8*a6237 > lsf.o240575 1.7456e+01 8*a6200 > lsf.o240576 1.4183e+01 8*a6161 > lsf.o240577 1.3254e+01 8*a6203 > lsf.o240578 1.8848e+01 8*a6274 13 vs. 18 seconds. Better, but still dodgy. > prun (quadrics) > lsf.o240602 1.6168e+01 4*a2108+4*a2109 > lsf.o240603 1.6746e+01 4*a2110+4*a2111 > lsf.o240604 1.6371e+01 4*a2108+4*a2109 > lsf.o240606 1.6867e+01 4*a2110+4*a2111 Nice and consistent, as you mentioned. And I assume your notation here means that it's across 2 nodes. > ompirun -mca btl self,openib > lsf.o240776 3.1463e+01 8*a6203 > lsf.o240777 3.0418e+01 8*a6264 > lsf.o240778 3.1394e+01 8*a6203 > lsf.o240779 3.5111e+01 8*a6274 Also much better. Probably because all messages are equally penalized by going out to the HCA and back. > ompirun -mca self,sm,openib > lsf.o240851 1.3848e+01 8*a6244 > lsf.o240852 1.7362e+01 8*a6237 > lsf.o240854 1.3266e+01 8*a6204 > lsf.o240855 1.3423e+01 8*a6276 This should be pretty much the same as sm,self, because openib shouldn't be used for any of the communication (i.e., Open MPI should determine that sm is the "best" transport between all the peers and silently discard openib). > ompirun > lsf.o240858 1.4415e+01 8*a6244 > lsf.o240859 1.5092e+01 8*a6237 > lsf.o240860 1.3940e+01 8*a6204 > lsf.o240861 1.5521e+01 8*a6276 > lsf.o240903 1.3273e+01 8*a6234 > lsf.o240904 1.6700e+01 8*a6206 > lsf.o240905 1.4636e+01 8*a6269 > lsf.o240906 1.5056e+01 8*a6234 Strange that this would be different than the first one. It should be functionally equivalent to --mca self,sm,openib. > ompirun -mca self,tcp > lsf.o240948 1.8504e+01 8*a6234 > lsf.o240949 1.9317e+01 8*a6207 > lsf.o240950 1.8964e+01 8*a6234 > lsf.o240951 2.0764e+01 8*a6207 Variation here isn't too bad. The slowdown here (compared to sm) is likely because it's going through the TCP loopback stack vs. "directly" going to the peer in shared memory. ...a quick look through the rest seems to indicate that they're more-or-less consistent with what you showed above. Your later mail says: > Following up on this, I have partial resolution. The primary culprit > appears to be stale files in a ramdisk non-uniformly distributed across > the sockets, thus interactingly poorly with NUMA. The slow runs > invariably have high numa_miss and numa_foreign counts. I still have > trouble making it explain up to a factor of 10 degredation, but it > certainly explains a factor of 3. Try playing with Open MPI's process affinity options, like --bind-to-core (see mpirun(1)). This may help prevent some OS jitter in moving processes around, and allow pinning memory locally to each NUMA node. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Highly variable performance
Following up on this, I have partial resolution. The primary culprit appears to be stale files in a ramdisk non-uniformly distributed across the sockets, thus interactingly poorly with NUMA. The slow runs invariably have high numa_miss and numa_foreign counts. I still have trouble making it explain up to a factor of 10 degredation, but it certainly explains a factor of 3. Jed
[OMPI users] Highly variable performance
I'm investigating some very large performance variation and have reduced the issue to a very simple MPI_Allreduce benchmark. The variability does not occur for serial jobs, but it does occur within single nodes. I'm not at all convinced that this is an Open MPI-specific issue (in fact the same variance is observed with MVAPICH2 which is an available, but not "recommended", implementation on that cluster) but perhaps someone here can suggest steps to track down the issue. The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected with QDR InfiniBand. The benchmark loops over MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD); with nlocal=1 (80 KiB messages) 1 times, so it normally runs in a few seconds. Open MPI 1.4.1 was compiled with gcc-4.3.3, and this code was built with mpicc -O2. All submissions were 8 process, timing and host results are presented below in chronological order. The jobs were run with 2-minute time limits (to get through the queue easily) jobs are marked "killed" if they went over this amount of time. Jobs were usually submitted in batches of 4. The scheduler is LSF-7.0. The HOST field indicates the node that was actually used, a6* nodes are of the type described above, a2* nodes are much older (2-socket Opteron 2220 (dual core, 2.8 GHz)) and use a Quadrics network, the timings are very reliable on these older nodes. When the issue first came up, I was inclined to blame memory bandwidith issues with other jobs, but the variance is still visible when our job runs on exactly a full node, present regardless of affinity settings, and events that don't require communication are well-balanced in both small and large runs. I then suspected possible contention between transport layers, ompi_info gives MCA btl: parameter "btl" (current value: "self,sm,openib,tcp", data source: environment) so the timings below show many variations of restricting these values. Unfortunately, the variance is large for all combinations, but I find it notable that -mca btl self,openib is reliably much slower than self,tcp. Note that some nodes are used in multiple runs, yet there is no strict relationship where some nodes are "fast", for instance, a6200 is very slow (6x and more) in the first set, then normal on the subsequent test. Nevertheless, when the same node appears in temporally nearby tests, there seems to be a correlation (though there is certainly not enough data here to establish that with confidence). As a final observation, I think the performance in all cases is unreasonably low since the same test on a (unrelated to the cluster) 2-socket Opteron 2356 (quad core, 2.3 GHz) always takes between 9.75 and 10.0 seconds, 30% faster than the fastest observations on the cluster nodes with faster cores and memory. # JOB TIME (s) HOST ompirun lsf.o240562 killed 8*a6200 lsf.o240563 9.2110e+01 8*a6200 lsf.o240564 1.5638e+01 8*a6237 lsf.o240565 1.3873e+01 8*a6228 ompirun -mca btl self,sm lsf.o240574 1.6916e+01 8*a6237 lsf.o240575 1.7456e+01 8*a6200 lsf.o240576 1.4183e+01 8*a6161 lsf.o240577 1.3254e+01 8*a6203 lsf.o240578 1.8848e+01 8*a6274 prun (quadrics) lsf.o240602 1.6168e+01 4*a2108+4*a2109 lsf.o240603 1.6746e+01 4*a2110+4*a2111 lsf.o240604 1.6371e+01 4*a2108+4*a2109 lsf.o240606 1.6867e+01 4*a2110+4*a2111 ompirun -mca btl self,openib lsf.o240776 3.1463e+01 8*a6203 lsf.o240777 3.0418e+01 8*a6264 lsf.o240778 3.1394e+01 8*a6203 lsf.o240779 3.5111e+01 8*a6274 ompirun -mca self,sm,openib lsf.o240851 1.3848e+01 8*a6244 lsf.o240852 1.7362e+01 8*a6237 lsf.o240854 1.3266e+01 8*a6204 lsf.o240855 1.3423e+01 8*a6276 ompirun lsf.o240858 1.4415e+01 8*a6244 lsf.o240859 1.5092e+01 8*a6237 lsf.o240860 1.3940e+01 8*a6204 lsf.o240861 1.5521e+01 8*a6276 lsf.o240903 1.3273e+01 8*a6234 lsf.o240904 1.6700e+01 8*a6206 lsf.o240905 1.4636e+01 8*a6269 lsf.o240906 1.5056e+01 8*a6234 ompirun -mca self,tcp lsf.o240948 1.8504e+01 8*a6234 lsf.o240949 1.9317e+01 8*a6207 lsf.o240950 1.8964e+01 8*a6234 lsf.o240951 2.0764e+01 8*a6207 ompirun -mca btl self,sm,openib lsf.o240998 1.3265e+01 8*a6269 lsf.o240999 1.2884e+01 8*a6269 lsf.o241000 1.3092e+01 8*a6234 lsf.o241001 1.3044e+01 8*a6269 ompirun -mca btl self,openib lsf.o241013 3.1572e+01 8*a6229 lsf.o241014 3.0552e+01 8*a6234 lsf.o241015 3.1813e+01 8*a6229 lsf.o241016 3.2514e+01 8*a6252 ompirun -mca btl self,sm lsf.o241044 1.3417e+01 8*a6234 lsf.o241045 killed 8*a6232 lsf.o241046 1.4626e+01 8*a6269 lsf.o241047 1.5060e+01 8*a6253 lsf.o241166 1.3179e+01 8*a6228 lsf.o241167 2.7759e+01 8*a6232 lsf.o241168 1.4224e+01 8*a6234 lsf.o241169 1.4825e+01 8*a6228 lsf.o241446 1.4896e+01 8*a6204 lsf.o241447 1.4960e+01 8*a6228 lsf.o241448 1.7622e+01 8*a6222 lsf.o241449 1.5112e+01 8*a6204 ompirun -mca btl self,tcp lsf.o241556 1.9135e+01 8*a6204 lsf.o241557 2.4365e+01 8*a6261