Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0% regression

Rong Chen Thu, 08 Nov 2018 16:19:59 -0800



On 11/02/2018 04:36 PM, Daniel Borkmann wrote:

Hi Rong,

On 11/02/2018 03:14 AM, kernel test robot wrote:

Greeting,

FYI, we noticed a -4.0% regression of will-it-scale.per_process_ops due to 
commit:


commit: fd978bf7fd312581a7ca454a991f0ffb34c4204b ("bpf: Add reference tracking to 
verifier")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 80 threads Skylake with 64G memory
with following parameters:

        nr_task: 100%
        mode: process
        test: mmap1
        cpufreq_governor: performance

Hmm, so the test cases you are running are these ones:

   https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap1.c
   https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap2.c

The commit from Joe referenced above only adds a feature to the (eBPF) 
verifier. Looking
through will-it-scale test suite, looks like there's neither cBPF nor eBPF in 
use and if
it would have been the former (e.g. via seccomp BPF), then also this has no 
effect on it
since this doesn't load through bpf(2); meaning if so then something must use 
eBPF here,
but then it's also unclear right now how this would even remotely affect mmap() 
test
performance by -4%. Hm, are you certain it's not a false bisection? If so, what 
else is
loading eBPF on your machine in parallel when you run the tests?


Please accept my apologies for taking your time, It's a false bisection.
Something strange happened, we're trying to figure out the root cause.

Best Regards,
Rong Chen


Thanks,
Daniel

test-description: Will It Scale takes a testcase and runs it from 1 through to 
n parallel copies to see if the testcase will scale. It builds both a process 
and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

In addition to that, the commit also has significant impact on the following 
tests:

+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_process_ops -3.8% 
regression |
| test machine     | 80 threads Skylake with 64G memory                         
   |
| test parameters  | cpufreq_governor=performance                               
   |
|                  | mode=process                                               
   |
|                  | nr_task=100%                                               
   |
|                  | test=mmap2                                                 
   |
+------------------+---------------------------------------------------------------+


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

         git clone https://github.com/intel/lkp-tests.git
         cd lkp-tests
         bin/lkp install job.yaml  # job file is attached in this email
         bin/lkp run     job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
   
gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap1/will-it-scale

commit:
   84dbf35073 ("bpf: Macrofy stack state copy")
   fd978bf7fd ("bpf: Add reference tracking to verifier")

84dbf3507349696b fd978bf7fd312581a7ca454a99
---------------- --------------------------
          %stddev     %change         %stddev
              \          |                \
      16811            -4.0%      16140        will-it-scale.per_process_ops
    1344946            -4.0%    1291230        will-it-scale.workload
     107.75 ± 38%    +252.4%     379.75 ± 93%  cpuidle.POLL.usage
     121.70 ± 18%     +18.9%     144.70 ±  4%  
sched_debug.cfs_rq:/.exec_clock.stddev
       4933            +2.0%       5031        proc-vmstat.nr_inactive_anon
       4933            +2.0%       5031        proc-vmstat.nr_zone_inactive_anon
       9874            +9.0%      10765 ±  7%  
slabinfo.proc_inode_cache.active_objs
       9874            +9.0%      10765 ±  7%  
slabinfo.proc_inode_cache.num_objs
      12248 ± 50%     +52.2%      18640 ±  2%  numa-meminfo.node0.Inactive
      33943 ±  8%     +16.2%      39453 ±  6%  numa-meminfo.node0.SReclaimable
      91549 ±  9%      -9.9%      82530 ±  7%  numa-meminfo.node1.Slab
      18091 ± 15%     +29.2%      23382 ± 17%  numa-vmstat.node0
       3027 ± 52%     +52.6%       4620 ±  3%  
numa-vmstat.node0.nr_inactive_anon
       8485 ±  8%     +16.2%       9862 ±  6%  
numa-vmstat.node0.nr_slab_reclaimable
       3027 ± 52%     +52.6%       4620 ±  3%  
numa-vmstat.node0.nr_zone_inactive_anon
    1.4e+12            -2.5%  1.364e+12        perf-stat.branch-instructions
      41.42            +0.7       42.15        perf-stat.cache-miss-rate%
  2.166e+10            -2.1%   2.12e+10        perf-stat.cache-references
      12.16            +2.7%      12.49        perf-stat.cpi
  1.741e+12            -2.8%  1.692e+12        perf-stat.dTLB-loads
       0.00 ±  3%      +0.0        0.00 ±  9%  perf-stat.dTLB-store-miss-rate%
  5.713e+11            -3.9%   5.49e+11        perf-stat.dTLB-stores
  6.103e+12            -2.6%  5.943e+12        perf-stat.instructions
       0.08            -2.6%       0.08        perf-stat.ipc
  1.954e+09            -1.8%  1.919e+09        perf-stat.node-load-misses
    4538060            +1.4%    4602862        perf-stat.path-length
      49.62            -0.5       49.14        
perf-profile.calltrace.cycles-pp.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
      47.64            -0.5       47.17        
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
      47.49            -0.5       47.02        
perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
      49.99            -0.5       49.53        
perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
      49.96            -0.5       49.51        
perf-profile.calltrace.cycles-pp.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
      48.02            -0.4       47.58        
perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
       1.41            -0.0        1.37        
perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
      47.73            +0.4       48.11        
perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region
      47.85            +0.4       48.25        
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap
      48.28            +0.4       48.68        
perf-profile.calltrace.cycles-pp.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
      48.23            +0.4       48.63        
perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff
      48.96            +0.4       49.41        
perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
      49.11            +0.5       49.56        
perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
      49.24            +0.5       49.70        
perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
      49.25            +0.5       49.72        
perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
      49.62            -0.5       49.15        
perf-profile.children.cycles-pp.do_munmap
      49.99            -0.5       49.53        
perf-profile.children.cycles-pp.__x64_sys_munmap
      49.97            -0.5       49.51        
perf-profile.children.cycles-pp.vm_munmap
       0.51 ±  2%      -0.0        0.46        
perf-profile.children.cycles-pp.___might_sleep
       1.16            -0.0        1.11        
perf-profile.children.cycles-pp.unmap_vmas
       1.15            -0.0        1.10        
perf-profile.children.cycles-pp.unmap_page_range
       1.41            -0.0        1.37        
perf-profile.children.cycles-pp.unmap_region
       0.32 ±  2%      +0.0        0.34 ±  2%  
perf-profile.children.cycles-pp.up_write
       0.32 ±  2%      +0.0        0.34        
perf-profile.children.cycles-pp.vm_area_alloc
       0.29            +0.0        0.32        
perf-profile.children.cycles-pp.kmem_cache_alloc
      48.28            +0.4       48.68        
perf-profile.children.cycles-pp.__vm_enough_memory
      48.96            +0.4       49.41        
perf-profile.children.cycles-pp.mmap_region
      49.11            +0.5       49.56        
perf-profile.children.cycles-pp.do_mmap
      49.25            +0.5       49.71        
perf-profile.children.cycles-pp.vm_mmap_pgoff
      49.25            +0.5       49.72        
perf-profile.children.cycles-pp.ksys_mmap_pgoff
       0.47 ±  3%      -0.0        0.43        
perf-profile.self.cycles-pp.___might_sleep
       0.32 ±  3%      +0.0        0.34 ±  2%  
perf-profile.self.cycles-pp.up_write
       0.27            +0.0        0.30        
perf-profile.self.cycles-pp.kmem_cache_alloc
       0.49            +0.0        0.53        
perf-profile.self.cycles-pp.percpu_counter_add_batch

will-it-scale.per_process_ops18000 +-+-----------------------------------------------------------------+

         |                                                                   |
   17500 +-+   +.+                                                           |
         |+.+++  :            +.++++.+++           ++++.++++.++              |
         |        :++. +      :        :          :            :             |
   17000 +-+      +   + ++.++:          :  ++.+++ :            ++.+ ++.   +. |
         |                   +          +.+      +                 +   +++  +|
   16500 +-+                                                                 |
         |              O  OOOO OOOO O O                                     |
   16000 +-+           O O            O O O                                  |
         |                                                                   |
         O     O  OOO O                                                      |
   15500 +O+OOO  O                                                           |
         |                                                                   |
   15000 +-+-----------------------------------------------------------------+

will-it-scale.workload1.42e+06 +-+--------------------------------------------------------------+

    1.4e+06 +-+   ++                                                         |
            |++.++ :                 ++.                   +++.+             |
   1.38e+06 +-+     :           +.+++   ++          ++++.++    :             |
   1.36e+06 +-+     +.+++++.    :         :         :           :+           |
            |               ++++          ++.+++++.+            + ++.+++++.++|
   1.34e+06 +-+                                                              |
   1.32e+06 +-+                                                              |
    1.3e+06 +-+               O                                              |
            |            OO OO OO OOOOO  OOO                                 |
   1.28e+06 +-+                         O                                    |
   1.26e+06 +-+       O                                                      |
            O     O O  OO                                                    |
   1.24e+06 +OO OO O                                                         |
   1.22e+06 +-+--------------------------------------------------------------+

[*] bisect-good sample

[O] bisect-bad  sample

***************************************************************************************************
lkp-skl-2sp2: 80 threads Skylake with 64G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
   
gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap2/will-it-scale

commit:
   84dbf35073 ("bpf: Macrofy stack state copy")
   fd978bf7fd ("bpf: Add reference tracking to verifier")

84dbf3507349696b fd978bf7fd312581a7ca454a99
---------------- --------------------------
          %stddev     %change         %stddev
              \          |                \
      16832            -3.8%      16186        will-it-scale.per_process_ops
    1346634            -3.8%    1294984        will-it-scale.workload
     390809 ± 21%     +51.6%     592424 ± 27%  cpuidle.C1.time
       6897            +2.7%       7085        proc-vmstat.nr_mapped
     936.00 ±  7%     +15.6%       1082 ±  5%  
slabinfo.Acpi-ParseExt.active_objs
     936.00 ±  7%     +15.6%       1082 ±  5%  slabinfo.Acpi-ParseExt.num_objs
     968.00 ±  9%     +27.5%       1233 ± 16%  
slabinfo.pool_workqueue.active_objs
     968.00 ±  9%     +29.7%       1255 ± 16%  slabinfo.pool_workqueue.num_objs
       8430           -14.1%       7244 ±  2%  numa-meminfo.node0.KernelStack
       4283 ± 14%     -22.4%       3325 ± 10%  numa-meminfo.node0.PageTables
      73929 ±  3%     -10.6%      66061 ±  6%  numa-meminfo.node0.SUnreclaim
       5569 ±  2%     +21.0%       6738 ±  3%  numa-meminfo.node1.KernelStack
      55085 ±  5%     +17.5%      64739 ±  5%  numa-meminfo.node1.SUnreclaim
      81155 ±  6%     +16.2%      94292 ±  7%  numa-meminfo.node1.Slab
     230.00          -100.0%       0.00        numa-vmstat.node0.nr_active_file
     100.25 ±  3%     -88.8%      11.25 ±173%  
numa-vmstat.node0.nr_inactive_file
       8431           -14.1%       7245 ±  2%  numa-vmstat.node0.nr_kernel_stack
       1071 ± 14%     -22.4%     831.25 ± 10%  
numa-vmstat.node0.nr_page_table_pages
      18482 ±  3%     -10.6%      16515 ±  6%  
numa-vmstat.node0.nr_slab_unreclaimable
     230.00          -100.0%       0.00        
numa-vmstat.node0.nr_zone_active_file
     100.25 ±  3%     -88.8%      11.25 ±173%  
numa-vmstat.node0.nr_zone_inactive_file
       5569 ±  2%     +21.0%       6738 ±  3%  numa-vmstat.node1.nr_kernel_stack
       2778 ±  3%     +28.4%       3567 ± 16%  numa-vmstat.node1.nr_mapped
      13771 ±  5%     +17.5%      16184 ±  5%  
numa-vmstat.node1.nr_slab_unreclaimable
  1.506e+12            -2.5%  1.468e+12        perf-stat.branch-instructions
      41.41            +0.8       42.20        perf-stat.cache-miss-rate%
  2.165e+10            -1.7%  2.129e+10        perf-stat.cache-references
      11.25            +2.8%      11.57        perf-stat.cpi
  1.891e+12            -2.8%  1.838e+12        perf-stat.dTLB-loads
  6.543e+11            -3.7%    6.3e+11        perf-stat.dTLB-stores
  6.591e+12            -2.6%  6.419e+12        perf-stat.instructions
       0.09            -2.7%       0.09        perf-stat.ipc
  1.967e+09            -1.3%  1.941e+09        perf-stat.node-load-misses
    4894750            +1.3%    4956596        perf-stat.path-length
      40.37 ± 12%     -16.2%      33.81 ±  7%  
sched_debug.cfs_rq:/.load_avg.stddev
       0.05 ±  2%     +18.7%       0.06 ±  3%  
sched_debug.cfs_rq:/.nr_running.stddev
       6.37 ± 40%     -50.2%       3.17 ± 32%  
sched_debug.cfs_rq:/.removed.load_avg.avg
      31.64 ± 18%     -28.5%      22.63 ± 16%  
sched_debug.cfs_rq:/.removed.load_avg.stddev
     293.89 ± 40%     -50.1%     146.61 ± 32%  
sched_debug.cfs_rq:/.removed.runnable_sum.avg
       1459 ± 18%     -28.3%       1045 ± 16%  
sched_debug.cfs_rq:/.removed.runnable_sum.stddev
       2.46 ± 43%     -60.9%       0.96 ± 66%  
sched_debug.cfs_rq:/.removed.util_avg.avg
      12.42 ± 26%     -46.5%       6.64 ± 59%  
sched_debug.cfs_rq:/.removed.util_avg.stddev
     385.92 ±  6%     +12.8%     435.46 ±  2%  sched_debug.cpu.nr_switches.min
     -14.21           -31.4%      -9.75        
sched_debug.cpu.nr_uninterruptible.min
      47.54            -0.2       47.31        
perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
      47.67            -0.2       47.45        
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
      48.04            -0.2       47.86        
perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
      99.36            -0.0       99.34        
perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
       1.47            +0.0        1.51        
perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
      94.77            -0.3       94.52        
perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      95.04            -0.2       94.81        
perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      95.77            -0.2       95.60        
perf-profile.children.cycles-pp.percpu_counter_add_batch
      49.72            -0.1       49.58        
perf-profile.children.cycles-pp.do_munmap
       0.53 ±  2%      -0.1        0.47        
perf-profile.children.cycles-pp.___might_sleep
       0.30 ±  2%      +0.0        0.33        
perf-profile.children.cycles-pp.perf_event_mmap
       0.30 ±  3%      +0.0        0.33 ±  2%  
perf-profile.children.cycles-pp.vm_area_alloc
       0.33 ±  2%      +0.0        0.36 ±  2%  
perf-profile.children.cycles-pp.up_write
       1.48            +0.0        1.51        
perf-profile.children.cycles-pp.unmap_region
      94.77            -0.3       94.52        
perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
       0.48 ±  2%      -0.0        0.44        
perf-profile.self.cycles-pp.___might_sleep
       0.33 ±  2%      +0.0        0.36 ±  2%  
perf-profile.self.cycles-pp.up_write
       0.53            +0.0        0.57        
perf-profile.self.cycles-pp.unmap_page_range
       0.47            +0.0        0.52 ±  2%  
perf-profile.self.cycles-pp.percpu_counter_add_batch





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen

Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0% regression

Reply via email to