Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-24 Thread Linus Torvalds
On Tue, Mar 24, 2015 at 8:33 AM, Mel Gorman  wrote:
> On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote:
>>
>> So it looks like the patch set fixes the remaining regression and in
>> 2 of the four cases actually improves performance
>
> \o/

W00t.

> Linus, these three patches plus the small fixlet for pmd_mkyoung (to match
> pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to
> you before 4.0 assuming nothing else goes pear shaped.

Yup. Thanks Mel,

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-24 Thread Mel Gorman
On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 12:24:00PM +, Mel Gorman wrote:
> > These are three follow-on patches based on the xfsrepair workload Dave
> > Chinner reported was problematic in 4.0-rc1 due to changes in page table
> > management -- https://lkml.org/lkml/2015/3/1/226.
> > 
> > Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
> > read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
> > Return the correct value for change_huge_pmd"). It was known that the 
> > performance
> > in 3.19 was still better even if is far less safe. This series aims to
> > restore the performance without compromising on safety.
> > 
> > Dave, you already tested patch 1 on its own but it would be nice to test
> > patches 1+2 and 1+2+3 separately just to be certain.
> 
>  3.19  4.0-rc4+p1  +p2  +p3
> mm_migrate_pages  266,750  572,839  558,632  223,706  201,429
> run time4m54s7m50s7m20s5m07s4m31s
> 

Excellent, this is in line with predictions and roughly matches what I
was seeing on bare metal + real NUMA + spinning disk instead of KVM +
fake NUMA + SSD.

Editting slightly;

> numa stats form p1+p2:numa_pte_updates 46109698
> numa stats form p1+p2+p3: numa_pte_updates 24460492

The big drop in PTE updates matches what I expected -- migration
failures should not lead to increased scan rates which is what patch 3
fixes. I'm also pleased that there was not a drop in performance.

> 
> OK, the summary with all patches applied:
> 
> config  3.19   4.0-rc1  4.0-rc4  4.0-rc5+
> defaults   8m08s 9m34s9m14s6m57s
> -o ag_stride=-14m04s 4m38s4m11s4m06s
> -o bhash=1010736m04s17m43s7m35s6m13s
> -o ag_stride=-1,bhash=101073   4m54s 9m58s7m50s4m31s
> 
> So it looks like the patch set fixes the remaining regression and in
> 2 of the four cases actually improves performance
> 

\o/

Linus, these three patches plus the small fixlet for pmd_mkyoung (to match
pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to
you before 4.0 assuming nothing else goes pear shaped.

> Thanks, Linus and Mel, for tracking this tricky problem down! 
> 

Thanks Dave for persisting with this and collecting the necessary data.
FWIW, I've marked the xfsrepair test case as a "large memory test".
It'll take time before the test machines have historical data for it but
in theory if this regresses again then I should spot it eventually.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-24 Thread Dave Chinner
On Mon, Mar 23, 2015 at 12:24:00PM +, Mel Gorman wrote:
> These are three follow-on patches based on the xfsrepair workload Dave
> Chinner reported was problematic in 4.0-rc1 due to changes in page table
> management -- https://lkml.org/lkml/2015/3/1/226.
> 
> Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
> read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
> Return the correct value for change_huge_pmd"). It was known that the 
> performance
> in 3.19 was still better even if is far less safe. This series aims to
> restore the performance without compromising on safety.
> 
> Dave, you already tested patch 1 on its own but it would be nice to test
> patches 1+2 and 1+2+3 separately just to be certain.

   3.19  4.0-rc4+p1  +p2  +p3
mm_migrate_pages266,750  572,839  558,632  223,706  201,429
run time  4m54s7m50s7m20s5m07s4m31s

numa stats form p1+p2:

numa_hit 8436537
numa_miss 0
numa_foreign 0
numa_interleave 30765
numa_local 8409240
numa_other 27297
numa_pte_updates 46109698
numa_huge_pte_updates 0
numa_hint_faults 44756389
numa_hint_faults_local 11841095
numa_pages_migrated 4868674
pgmigrate_success 4868674
pgmigrate_fail 0


numa stats form p1+p2+p3:

numa_hit 6991596
numa_miss 0
numa_foreign 0
numa_interleave 10336
numa_local 6983144
numa_other 8452
numa_pte_updates 24460492
numa_huge_pte_updates 0
numa_hint_faults 23677262
numa_hint_faults_local 5952273
numa_pages_migrated 3557928
pgmigrate_success 3557928
pgmigrate_fail 0

OK, the summary with all patches applied:

config  3.19   4.0-rc1  4.0-rc4  4.0-rc5+
defaults   8m08s 9m34s9m14s6m57s
-o ag_stride=-14m04s 4m38s4m11s4m06s
-o bhash=1010736m04s17m43s7m35s6m13s
-o ag_stride=-1,bhash=101073   4m54s 9m58s7m50s4m31s

So it looks like the patch set fixes the remaining regression and in
2 of the four cases actually improves performance

Thanks, Linus and Mel, for tracking this tricky problem down! 

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/3] Reduce system overhead of automatic NUMA balancing

2015-03-23 Thread Mel Gorman
These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd"). It was known that the 
performance
in 3.19 was still better even if is far less safe. This series aims to
restore the performance without compromising on safety.

Dave, you already tested patch 1 on its own but it would be nice to test
patches 1+2 and 1+2+3 separately just to be certain.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

autonumabench
  3.19.0 4.0.0-rc4  
   4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
 vanilla   vanilla  
vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
Time System-NUMA01  124.00 (  0.00%)  161.86 (-30.53%)  
107.13 ( 13.60%)  103.13 ( 16.83%)  145.01 (-16.94%)
Time System-NUMA01_THEADLOCAL   115.54 (  0.00%)  107.64 (  6.84%)  
131.87 (-14.13%)   83.30 ( 27.90%)   92.35 ( 20.07%)
Time System-NUMA029.35 (  0.00%)   10.44 (-11.66%)  
  8.95 (  4.28%)   10.72 (-14.65%)8.16 ( 12.73%)
Time System-NUMA02_SMT3.87 (  0.00%)4.63 (-19.64%)  
  4.57 (-18.09%)3.99 ( -3.10%)3.36 ( 13.18%)
Time Elapsed-NUMA01 570.06 (  0.00%)  567.82 (  0.39%)  
515.78 (  9.52%)  517.26 (  9.26%)  543.80 (  4.61%)
Time Elapsed-NUMA01_THEADLOCAL  393.69 (  0.00%)  384.83 (  2.25%)  
384.10 (  2.44%)  384.31 (  2.38%)  380.73 (  3.29%)
Time Elapsed-NUMA02  49.09 (  0.00%)   49.33 ( -0.49%)  
 48.86 (  0.47%)   48.78 (  0.63%)   50.94 ( -3.77%)
Time Elapsed-NUMA02_SMT  47.51 (  0.00%)   47.15 (  0.76%)  
 47.98 ( -0.99%)   48.12 ( -1.28%)   49.56 ( -4.31%)

  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
User46334.6046391.9444383.9543971.8944372.12
System252.84  284.66  252.61  201.24  249.00
Elapsed  1062.14 1050.96  998.68 1000.94 1026.78

Overall the system CPU usage is comparable and the test is naturally a bit 
variable. The
slowing of the scanner hurts numa01 but on this machine it is an adverse 
workload and
patches that dramatically help it often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   
4.0.0-rc4
   vanilla 
vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Minor Faults   2097811 2656646 2597249 1981230 
1636841
Major Faults   362 450 365 364  
   365

Note the impact preserving the write bit across protection updates and fault 
reduces
faults.

NUMA alloc hit 1229008 1217015 1191660 1178322 
1199681
NUMA alloc miss  0   0   0   0  
 0
NUMA interleave hit  0   0   0   0  
 0
NUMA alloc local   1228514 1216317 1190871 1177448 
1199021
NUMA base PTE updates245706197   240041607   238195516   244704842   
115012800
NUMA huge PMD updates   479530  468448  464868  477573  
224487
NUMA page range updates  491225557   479886983   476207932   48918   
229950144
NUMA hint faults659753  656503  641678  656926  
294842
NUMA hint local faults  381604  373963  360478  337585  
186249
NUMA hint local percent 57  56  56  51  
63
NUMA pages migrated5412140 6374899 6266530 5277468 
5755096
AutoNUMA cost5121%   5083%   4994%   5097%  
 2388%

Here the impact of slowing the PTE scanner on migratrion failures is obvious as 
"NUMA base PTE updates" and
"NUMA huge PMD updates" are massively reduced even though the headline 
performance
is very similar.

As xfsrepair was the reported workload here is the impact of the series on it.

xfsrepair
   3.19.0 4.0.0-rc4 
4.0.0-rc4 4.0.0-rc4 4.0.0-rc4
  vanilla   vanilla  
vmwrite-v5r8 preserve-v5r8 slowscan-v5r8
Min  real-fsmark1183.29 (  0.00%) 1165.73 (