------- Comment From mranw...@us.ibm.com 2018-09-02 13:33 EDT------- Kris Murphy recreated the ATS enabled driver performance issue with 18.04.1 and then loaded the test kernel from the bug below and verified that it fixes the issue. Bz 170624 ? NV 2282038 performance drop with ATS enabled https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788097
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1788097 Title: performance drop with ATS enabled Status in The Ubuntu-power-systems project: In Progress Status in linux package in Ubuntu: Fix Committed Status in linux source package in Bionic: In Progress Bug description: == Comment: #0 - Michael Ranweiler <mranw...@us.ibm.com> - 2018-08-16 09:58:02 == Witherspoon cluster now has ATS enabled with driver 396.42, CUDA version 9.2.148. They are running the CORAL benchmark LULESH with and without ATS, and they see a significant performance drop with ATS enabled. ======== Below is the run with ATS: Run completed: Problem size = 160 MPI tasks = 8 Iteration count = 100 Final Origin Energy = 1.605234e+09 Testing Plane 0 of Energy Array on rank 0: MaxAbsDiff = 2.384186e-07 TotalAbsDiff = 5.300015e-07 MaxRelDiff = 1.631916e-12 Elapsed time = 153.00 (s) Grind time (us/z/c) = 0.37352393 (per dom) (0.046690491 overall) FOM = 21417.637 (z/s) ======== Here is the run without ATS: Run completed: Problem size = 160 MPI tasks = 8 Iteration count = 100 Final Origin Energy = 1.605234e+09 Testing Plane 0 of Energy Array on rank 0: MaxAbsDiff = 2.384186e-07 TotalAbsDiff = 5.300015e-07 MaxRelDiff = 1.631916e-12 Elapsed time = 13.27 (s) Grind time (us/z/c) = 0.032394027 (per dom) (0.0040492534 overall) FOM = 246959.11 (z/s) ======== Using ATS on a single node slows down the OpenACC version more than 10 times, and for the version with OpenMP 4.5 and managed memory, they observe a 2x slowdown. Last comment from NVIDIA (Javier Cabezas - 07/29/2018 11:30 AM): We think we have found where's the issue. This behavior reproduces for any two concurrent processes that create CUDA contexts on the GPUs and heavily unmap memory (no need to launch any work on the GPUs). When the problem repros, perf shows that most of the time is spent in mmio_invalidate. However, this only happens when processes register GPUs attached to the same NPU. Thus, if process A, initializes GPU 0 and/or 1, and process B, initializes GPU 2 and/or 3, we don't see the slowdown. This makes sense, because ATSDs on different NPUs are issued independently. After some code inspection in npu-dma.c (powerpc backend in the Linux kernel), Mark noticed that the problem could be in the utilization of test_and_set_bit_lock in get_mmio_atsd_reg. The implementation of test_and_set_bit_lock in powerpc relies on the ldarx/stdcx instructions (PPC_LLARX/PPC_STLCX in the snippet below): #define DEFINE_TESTOP(fn, op, prefix, postfix, eh) \ static __inline__ unsigned long fn( \ unsigned long mask, \ volatile unsigned long *_p) \ { \ unsigned long old, t; \ unsigned long *p = (unsigned long *)_p; \ __asm__ __volatile__ ( \ prefix \ "1:" PPC_LLARX(%0,0,%3,eh) "\n" \ stringify_in_c(op) "%1,%0,%2\n" \ PPC405_ERR77(0,%3) \ PPC_STLCX "%1,0,%3\n" \ "bne- 1b\n" \ postfix \ : "=&r" (old), "=&r" (t) \ : "r" (mask), "r" (p) \ : "cc", "memory"); \ return (old & mask); \ } According to the PowerPC manual, ldarx creates a memory reservation and a subsequent stwcx instruction from the same processor ensures an atomic read-modify-write operation. However, the reservation can be lost if a different processor executes any store instruction on the same address. That's why "bne- 1b" checks wether stwcx was successful and jumps back to retry, otherwise. Since DEFINE_TESTOP doesn't implement any back-off mechanism, two different processors trying to get an ATSD register can starve each other. Mark compiled a custom kernel which surrounds the calls to test_and_set_bit_lock in get_mmio_atsd_reg with a spinlock and I verified that it solves the issue. These are the execution times for LULESH: ATS OFF Elapsed time = 16.87 (s) ATS ON Elapsed time = 215.56 (s) ATS ON + Spinlock Elapsed time = 18.14 (s) Fixed with the following patch in the powerpc tree: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=9eab9901b015 == Comment: #1 - Michael Ranweiler <mranw...@us.ibm.com> - 2018-08-20 14:56:52 == This is now in mainline, too: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/powernv/npu-dma.c?id=9eab9901b015f489199105c470de1ffc337cfabb It has some small fuzz to apply to 4.15.0.32-35: diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 6c8e168e6571..18226895681e 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -434,8 +434,9 @@ static int get_mmio_atsd_reg(struct npu *npu) int i; for (i = 0; i < npu->mmio_atsd_count; i++) { - if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage)) - return i; + if (!test_bit(i, &npu->mmio_atsd_usage)) + if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage)) + return i; } return -ENOSPC; To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1788097/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp