Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
On Tue, Mar 24, 2015 at 8:33 AM, Mel Gorman wrote: > On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote: >> >> So it looks like the patch set fixes the remaining regression and in >> 2 of the four cases actually improves performance > > \o/ W00t. > Linus, these three patches plus the small fixlet for pmd_mkyoung (to match > pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to > you before 4.0 assuming nothing else goes pear shaped. Yup. Thanks Mel, Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote: > On Mon, Mar 23, 2015 at 12:24:00PM +, Mel Gorman wrote: > > These are three follow-on patches based on the xfsrepair workload Dave > > Chinner reported was problematic in 4.0-rc1 due to changes in page table > > management -- https://lkml.org/lkml/2015/3/1/226. > > > > Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa > > read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp: > > Return the correct value for change_huge_pmd"). It was known that the > > performance > > in 3.19 was still better even if is far less safe. This series aims to > > restore the performance without compromising on safety. > > > > Dave, you already tested patch 1 on its own but it would be nice to test > > patches 1+2 and 1+2+3 separately just to be certain. > > 3.19 4.0-rc4+p1 +p2 +p3 > mm_migrate_pages 266,750 572,839 558,632 223,706 201,429 > run time4m54s7m50s7m20s5m07s4m31s > Excellent, this is in line with predictions and roughly matches what I was seeing on bare metal + real NUMA + spinning disk instead of KVM + fake NUMA + SSD. Editting slightly; > numa stats form p1+p2:numa_pte_updates 46109698 > numa stats form p1+p2+p3: numa_pte_updates 24460492 The big drop in PTE updates matches what I expected -- migration failures should not lead to increased scan rates which is what patch 3 fixes. I'm also pleased that there was not a drop in performance. > > OK, the summary with all patches applied: > > config 3.19 4.0-rc1 4.0-rc4 4.0-rc5+ > defaults 8m08s 9m34s9m14s6m57s > -o ag_stride=-14m04s 4m38s4m11s4m06s > -o bhash=1010736m04s17m43s7m35s6m13s > -o ag_stride=-1,bhash=101073 4m54s 9m58s7m50s4m31s > > So it looks like the patch set fixes the remaining regression and in > 2 of the four cases actually improves performance > \o/ Linus, these three patches plus the small fixlet for pmd_mkyoung (to match pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to you before 4.0 assuming nothing else goes pear shaped. > Thanks, Linus and Mel, for tracking this tricky problem down! > Thanks Dave for persisting with this and collecting the necessary data. FWIW, I've marked the xfsrepair test case as a "large memory test". It'll take time before the test machines have historical data for it but in theory if this regresses again then I should spot it eventually. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
On Mon, Mar 23, 2015 at 12:24:00PM +, Mel Gorman wrote: > These are three follow-on patches based on the xfsrepair workload Dave > Chinner reported was problematic in 4.0-rc1 due to changes in page table > management -- https://lkml.org/lkml/2015/3/1/226. > > Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa > read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp: > Return the correct value for change_huge_pmd"). It was known that the > performance > in 3.19 was still better even if is far less safe. This series aims to > restore the performance without compromising on safety. > > Dave, you already tested patch 1 on its own but it would be nice to test > patches 1+2 and 1+2+3 separately just to be certain. 3.19 4.0-rc4+p1 +p2 +p3 mm_migrate_pages266,750 572,839 558,632 223,706 201,429 run time 4m54s7m50s7m20s5m07s4m31s numa stats form p1+p2: numa_hit 8436537 numa_miss 0 numa_foreign 0 numa_interleave 30765 numa_local 8409240 numa_other 27297 numa_pte_updates 46109698 numa_huge_pte_updates 0 numa_hint_faults 44756389 numa_hint_faults_local 11841095 numa_pages_migrated 4868674 pgmigrate_success 4868674 pgmigrate_fail 0 numa stats form p1+p2+p3: numa_hit 6991596 numa_miss 0 numa_foreign 0 numa_interleave 10336 numa_local 6983144 numa_other 8452 numa_pte_updates 24460492 numa_huge_pte_updates 0 numa_hint_faults 23677262 numa_hint_faults_local 5952273 numa_pages_migrated 3557928 pgmigrate_success 3557928 pgmigrate_fail 0 OK, the summary with all patches applied: config 3.19 4.0-rc1 4.0-rc4 4.0-rc5+ defaults 8m08s 9m34s9m14s6m57s -o ag_stride=-14m04s 4m38s4m11s4m06s -o bhash=1010736m04s17m43s7m35s6m13s -o ag_stride=-1,bhash=101073 4m54s 9m58s7m50s4m31s So it looks like the patch set fixes the remaining regression and in 2 of the four cases actually improves performance Thanks, Linus and Mel, for tracking this tricky problem down! Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/3] Reduce system overhead of automatic NUMA balancing
These are three follow-on patches based on the xfsrepair workload Dave Chinner reported was problematic in 4.0-rc1 due to changes in page table management -- https://lkml.org/lkml/2015/3/1/226. Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp: Return the correct value for change_huge_pmd"). It was known that the performance in 3.19 was still better even if is far less safe. This series aims to restore the performance without compromising on safety. Dave, you already tested patch 1 on its own but it would be nice to test patches 1+2 and 1+2+3 separately just to be certain. For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the three patches applied on top autonumabench 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 Time System-NUMA01 124.00 ( 0.00%) 161.86 (-30.53%) 107.13 ( 13.60%) 103.13 ( 16.83%) 145.01 (-16.94%) Time System-NUMA01_THEADLOCAL 115.54 ( 0.00%) 107.64 ( 6.84%) 131.87 (-14.13%) 83.30 ( 27.90%) 92.35 ( 20.07%) Time System-NUMA029.35 ( 0.00%) 10.44 (-11.66%) 8.95 ( 4.28%) 10.72 (-14.65%)8.16 ( 12.73%) Time System-NUMA02_SMT3.87 ( 0.00%)4.63 (-19.64%) 4.57 (-18.09%)3.99 ( -3.10%)3.36 ( 13.18%) Time Elapsed-NUMA01 570.06 ( 0.00%) 567.82 ( 0.39%) 515.78 ( 9.52%) 517.26 ( 9.26%) 543.80 ( 4.61%) Time Elapsed-NUMA01_THEADLOCAL 393.69 ( 0.00%) 384.83 ( 2.25%) 384.10 ( 2.44%) 384.31 ( 2.38%) 380.73 ( 3.29%) Time Elapsed-NUMA02 49.09 ( 0.00%) 49.33 ( -0.49%) 48.86 ( 0.47%) 48.78 ( 0.63%) 50.94 ( -3.77%) Time Elapsed-NUMA02_SMT 47.51 ( 0.00%) 47.15 ( 0.76%) 47.98 ( -0.99%) 48.12 ( -1.28%) 49.56 ( -4.31%) 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 User46334.6046391.9444383.9543971.8944372.12 System252.84 284.66 252.61 201.24 249.00 Elapsed 1062.14 1050.96 998.68 1000.94 1026.78 Overall the system CPU usage is comparable and the test is naturally a bit variable. The slowing of the scanner hurts numa01 but on this machine it is an adverse workload and patches that dramatically help it often hurt absolutely everything else. Due to patch 2, the fault activity is interesting 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8 Minor Faults 2097811 2656646 2597249 1981230 1636841 Major Faults 362 450 365 364 365 Note the impact preserving the write bit across protection updates and fault reduces faults. NUMA alloc hit 1229008 1217015 1191660 1178322 1199681 NUMA alloc miss 0 0 0 0 0 NUMA interleave hit 0 0 0 0 0 NUMA alloc local 1228514 1216317 1190871 1177448 1199021 NUMA base PTE updates245706197 240041607 238195516 244704842 115012800 NUMA huge PMD updates 479530 468448 464868 477573 224487 NUMA page range updates 491225557 479886983 476207932 48918 229950144 NUMA hint faults659753 656503 641678 656926 294842 NUMA hint local faults 381604 373963 360478 337585 186249 NUMA hint local percent 57 56 56 51 63 NUMA pages migrated5412140 6374899 6266530 5277468 5755096 AutoNUMA cost5121% 5083% 4994% 5097% 2388% Here the impact of slowing the PTE scanner on migratrion failures is obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are massively reduced even though the headline performance is very similar. As xfsrepair was the reported workload here is the impact of the series on it. xfsrepair 3.19.0 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 4.0.0-rc4 vanilla vanilla vmwrite-v5r8 preserve-v5r8 slowscan-v5r8 Min real-fsmark1183.29 ( 0.00%) 1165.73 (