------- Comment From mranw...@us.ibm.com 2018-09-02 13:33 EDT-------
Kris Murphy recreated the ATS enabled driver performance issue with 18.04.1 and 
then loaded the test kernel from the bug below and verified that it fixes the 
issue.
Bz 170624 ? NV 2282038 performance drop with ATS enabled
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788097

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1788097

Title:
  performance drop with ATS enabled

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Bionic:
  In Progress

Bug description:
  == Comment: #0 - Michael Ranweiler <mranw...@us.ibm.com> - 2018-08-16
  09:58:02 ==

  Witherspoon cluster now has ATS enabled with driver 396.42, CUDA
  version 9.2.148. They are running the CORAL benchmark LULESH with and
  without ATS, and they see a significant performance drop with ATS
  enabled.

  ========
  Below is the run with ATS:

  Run completed:
  Problem size = 160
  MPI tasks = 8
  Iteration count = 100
  Final Origin Energy = 1.605234e+09
  Testing Plane 0 of Energy Array on rank 0:
  MaxAbsDiff = 2.384186e-07
  TotalAbsDiff = 5.300015e-07
  MaxRelDiff = 1.631916e-12

  Elapsed time = 153.00 (s)
  Grind time (us/z/c) = 0.37352393 (per dom) (0.046690491 overall)
  FOM = 21417.637 (z/s)

  ========
  Here is the run without ATS:
  Run completed:
  Problem size = 160
  MPI tasks = 8
  Iteration count = 100
  Final Origin Energy = 1.605234e+09
  Testing Plane 0 of Energy Array on rank 0:
  MaxAbsDiff = 2.384186e-07
  TotalAbsDiff = 5.300015e-07
  MaxRelDiff = 1.631916e-12

  
  Elapsed time = 13.27 (s)
  Grind time (us/z/c) = 0.032394027 (per dom) (0.0040492534 overall)
  FOM = 246959.11 (z/s)
  ========

  Using ATS on a single node slows down the OpenACC version more than 10
  times, and for the version with OpenMP 4.5 and managed memory, they
  observe a 2x slowdown.

  Last comment from NVIDIA (Javier Cabezas - 07/29/2018 11:30 AM):
  We think we have found where's the issue.

  This behavior reproduces for any two concurrent processes that create
  CUDA contexts on the GPUs and heavily unmap memory (no need to launch
  any work on the GPUs). When the problem repros, perf shows that most
  of the time is spent in mmio_invalidate. However, this only happens
  when processes register GPUs attached to the same NPU. Thus, if
  process A, initializes GPU 0 and/or 1, and process B, initializes GPU
  2 and/or 3, we don't see the slowdown. This makes sense, because ATSDs
  on different NPUs are issued independently.

  After some code inspection in npu-dma.c (powerpc backend in the Linux
  kernel), Mark noticed that the problem could be in the utilization of
  test_and_set_bit_lock in get_mmio_atsd_reg. The implementation of
  test_and_set_bit_lock in powerpc relies on the ldarx/stdcx
  instructions (PPC_LLARX/PPC_STLCX in the snippet below):

  #define DEFINE_TESTOP(fn, op, prefix, postfix, eh)    \
  static __inline__ unsigned long fn(                   \
          unsigned long mask,                           \
          volatile unsigned long *_p)                   \
  {                                                     \
      unsigned long old, t;                             \
      unsigned long *p = (unsigned long *)_p;           \
      __asm__ __volatile__ (                            \
      prefix                                            \
  "1:"    PPC_LLARX(%0,0,%3,eh) "\n"                    \
      stringify_in_c(op) "%1,%0,%2\n"                   \
      PPC405_ERR77(0,%3)                                \
      PPC_STLCX "%1,0,%3\n"                             \
      "bne- 1b\n"                                       \
      postfix                                           \
      : "=&r" (old), "=&r" (t)                          \
      : "r" (mask), "r" (p)                             \
      : "cc", "memory");                                \
      return (old & mask);                              \
  }

  According to the PowerPC manual, ldarx creates a memory reservation
  and a subsequent stwcx instruction from the same processor ensures an
  atomic read-modify-write operation. However, the reservation can be
  lost if a different processor executes any store instruction on the
  same address. That's why "bne- 1b" checks wether stwcx was successful
  and jumps back to retry, otherwise. Since DEFINE_TESTOP doesn't
  implement any back-off mechanism, two different processors trying to
  get an ATSD register can starve each other.

  Mark compiled a custom kernel which surrounds the calls to
  test_and_set_bit_lock in get_mmio_atsd_reg with a spinlock and I
  verified that it solves the issue. These are the execution times for
  LULESH:

  ATS OFF
  Elapsed time         =      16.87 (s)

  ATS ON
  Elapsed time         =     215.56 (s)

  ATS ON + Spinlock
  Elapsed time         =      18.14 (s)

  
  Fixed with the following patch in the powerpc tree:
  
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=9eab9901b015

  == Comment: #1 - Michael Ranweiler <mranw...@us.ibm.com> - 2018-08-20 
14:56:52 ==
  This is now in mainline, too:
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/powernv/npu-dma.c?id=9eab9901b015f489199105c470de1ffc337cfabb

  It has some small fuzz to apply to 4.15.0.32-35:

  diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
  index 6c8e168e6571..18226895681e 100644
  --- a/arch/powerpc/platforms/powernv/npu-dma.c
  +++ b/arch/powerpc/platforms/powernv/npu-dma.c
  @@ -434,8 +434,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
          int i;
   
          for (i = 0; i < npu->mmio_atsd_count; i++) {
  -               if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
  -                       return i;
  +               if (!test_bit(i, &npu->mmio_atsd_usage))
  +                       if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
  +                               return i;
          }
   
          return -ENOSPC;

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1788097/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to