Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-19 Thread Oliver Sang
On Wed, Nov 18, 2020 at 10:17:27AM -0800, Dan Williams wrote:
> On Wed, Nov 18, 2020 at 5:51 AM Jan Kara  wrote:
> >
> > On Mon 16-11-20 19:35:31, John Hubbard wrote:
> > >
> > > On 11/16/20 6:48 PM, kernel test robot wrote:
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -45.0% regression of 
> > > > phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> > > >
> > >
> > > That's a huge slowdown...
> > >
> > > >
> > > > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
> > > > page->hpage_pinned_refcount: exact pin counts for huge pages")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > ...but that commit happened in April, 2020. Surely if this were a serious
> > > issue we would have some other indication...is this worth following up
> > > on?? I'm inclined to ignore it, honestly.
> >
> > Why this was detected so late is a fair question although it doesn't quite
> > invalidate the report...
> 
> I don't know what specifically happened in this case, perhaps someone
> from the lkp team can comment? 

- some extra phoronix test suites are enabled/fixed gradually so we will have
better coverage
- we scan kernel releases within the year to baseline the performance, it may
trigger bisection if one release has regressed and not recovered.

With this continuous effort, 0-day ci can detect the changes on mainline.

> However, the myth / contention that
> "surely someone else would have noticed by now" is why the lkp project
> was launched. Kernels regressed without much complaint and it wasn't
> until much later in the process, around the time enterprise distros
> rebased to new kernels, did end users start filing performance loss
> regression reports. Given -stable kernel releases, 6-7 months is still
> faster than many end user upgrade cycles to new kernel baselines.


Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-18 Thread John Hubbard

On 11/18/20 10:17 AM, Dan Williams wrote:

On Wed, Nov 18, 2020 at 5:51 AM Jan Kara  wrote:


On Mon 16-11-20 19:35:31, John Hubbard wrote:


On 11/16/20 6:48 PM, kernel test robot wrote:


Greeting,

FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s 
due to commit:



That's a huge slowdown...



commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
page->hpage_pinned_refcount: exact pin counts for huge pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


...but that commit happened in April, 2020. Surely if this were a serious
issue we would have some other indication...is this worth following up
on?? I'm inclined to ignore it, honestly.


Why this was detected so late is a fair question although it doesn't quite
invalidate the report...


I don't know what specifically happened in this case, perhaps someone
from the lkp team can comment? However, the myth / contention that
"surely someone else would have noticed by now" is why the lkp project
was launched. Kernels regressed without much complaint and it wasn't
until much later in the process, around the time enterprise distros
rebased to new kernels, did end users start filing performance loss
regression reports. Given -stable kernel releases, 6-7 months is still
faster than many end user upgrade cycles to new kernel baselines.



I see, thanks for explaining. I'll take a peek, then.

thanks,
--
John Hubbard
NVIDIA


Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-18 Thread Dan Williams
On Wed, Nov 18, 2020 at 5:51 AM Jan Kara  wrote:
>
> On Mon 16-11-20 19:35:31, John Hubbard wrote:
> >
> > On 11/16/20 6:48 PM, kernel test robot wrote:
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -45.0% regression of 
> > > phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> > >
> >
> > That's a huge slowdown...
> >
> > >
> > > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
> > > page->hpage_pinned_refcount: exact pin counts for huge pages")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > ...but that commit happened in April, 2020. Surely if this were a serious
> > issue we would have some other indication...is this worth following up
> > on?? I'm inclined to ignore it, honestly.
>
> Why this was detected so late is a fair question although it doesn't quite
> invalidate the report...

I don't know what specifically happened in this case, perhaps someone
from the lkp team can comment? However, the myth / contention that
"surely someone else would have noticed by now" is why the lkp project
was launched. Kernels regressed without much complaint and it wasn't
until much later in the process, around the time enterprise distros
rebased to new kernels, did end users start filing performance loss
regression reports. Given -stable kernel releases, 6-7 months is still
faster than many end user upgrade cycles to new kernel baselines.


Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-18 Thread Jan Kara
On Mon 16-11-20 19:35:31, John Hubbard wrote:
> 
> On 11/16/20 6:48 PM, kernel test robot wrote:
> > 
> > Greeting,
> > 
> > FYI, we noticed a -45.0% regression of 
> > phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> > 
> 
> That's a huge slowdown...
> 
> > 
> > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
> > page->hpage_pinned_refcount: exact pin counts for huge pages")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> ...but that commit happened in April, 2020. Surely if this were a serious
> issue we would have some other indication...is this worth following up
> on?? I'm inclined to ignore it, honestly.

Why this was detected so late is a fair question although it doesn't quite
invalidate the report... The NPB benchmark appears to be a supercomputing
benchmark so concievably it could be heavily using THPs. The question is
why it would be a heavy user of pinning as well but even that is imaginable
considering that MPI is in use etc.

So maybe it is worth trying to reproduce this because heavy THP + pinning
users might be indeed rare and only those would show regressions in THP
pinning performance...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-16 Thread John Hubbard



On 11/16/20 6:48 PM, kernel test robot wrote:


Greeting,

FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s 
due to commit:



That's a huge slowdown...



commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
page->hpage_pinned_refcount: exact pin counts for huge pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


...but that commit happened in April, 2020. Surely if this were a serious issue 
we
would have some other indication...is this worth following up on?? I'm inclined 
to
ignore it, honestly.

thanks,
--
John Hubbard
NVIDIA



in testcase: phoronix-test-suite
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G 
memory
with following parameters:

test: npb-1.3.1
option_a: FT.A
cpufreq_governor: performance
ucode: 0x5002f01

test-description: The Phoronix Test Suite is the most comprehensive testing and 
benchmarking platform available that provides an extensible framework for which 
new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/



If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


Details are as below:
-->


To reproduce:

 git clone https://github.com/intel/lkp-tests.git
 cd lkp-tests
 bin/lkp install job.yaml  # job file is attached in this email
 bin/lkp run job.yaml

=
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase/ucode:
   
gcc-9/performance/x86_64-rhel-8.3/FT.A/debian-x86_64-phoronix/lkp-csl-2sp8/npb-1.3.1/phoronix-test-suite/0x5002f01

commit:
   3faa52c03f ("mm/gup: track FOLL_PIN pages")
   47e29d32af ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge 
pages")

3faa52c03f440d1b 47e29d32afba11b13efb51f0315
 ---
fail:runs  %reproductionfail:runs
| | |
   1:4  -25%:4 
kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
  %stddev %change %stddev
  \  |\
   4585 ±  2% -45.0%   2522
phoronix-test-suite.npb.FT.A.total_mop_s
   1223 ±  4% +40.2%   1714
phoronix-test-suite.time.percent_of_cpu_this_job_got


 
  phoronix-test-suite.npb.FT.A.total_mop_s
 
   6500 ++

|  .+.  .+.  .+. |
   6000 |.+   +.+.+.++.+.+.+.+.+.+.+   +.+.++   +.+.+.+.+.+.+.+.+.++.+   |
   5500 |-+   :  |
| :  |
   5000 |-+: |
   4500 |-++.+.+.|
||
   4000 |-+  |
   3500 |-+  |
||
   3000 |-+  |
   2500 |-+   O   O O|
| O O O O O OO O   O O O O O   O   O  O O|
   2000 ++
 
 
[*] bisect-good sample

[O] bisect-bad  sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Oliver Sang