Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Fri, Mar 20, 2015 at 10:02:23AM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner da...@fromorbit.com wrote: Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and 7m36s. IOWs's a bit better, but not significantly. page migrations are pretty much unchanged, too: 558,632 migrate:mm_migrate_pages ( +- 6.38% ) Ok. That was kind of the expected thing. I don't really know the NUMA fault rate limiting code, but one thing that strikes me is that if it tries to balance the NUMA faults against the *regular* faults, then maybe just the fact that we end up taking more COW faults after a NUMA fault then means that the NUMA rate limiting code now gets over-eager (because it sees all those extra non-numa faults). Mel, does that sound at all possible? I really have never looked at the magic automatic rate handling.. It should not be trying to balance against regular faults as it has no information on it. The trapping of additional faults to mark the PTE writable will alter timing so it indirectly affects how many migration faults there but this is only a side-effect IMO. There is more overhead now due to losing the writable information and that should be reduced so I tried a few approaches. Ultimately, the one that performed the best and was easiest to understand simply preserved the writable bit across the protection update and page fault. I'll post it later when I stick a changelog on it. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). As a side-note, I did test a patch that checked pte_write and preserved it across both faults and setting the protections. It did not alter migration activity much but there was a drop in minor faults - 20% drop in autonumabench, 58% drop in xfsrepair workload. I'm assuming this is due to refaults to mark pages writable. The patch looks and is hacky so I won't post it to save people bleaching their eyes. I'll spend some time soon (hopefully today) at a smooth way of falling through to WP checks after trapping a NUMA fault. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote: My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the more than an order of magnitude difference was really about measurement differences, not quite as real. Looks like more a factor of two than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the 1.4% from tlb_invalidate_range() vs pretty much everything from migration). The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. TNF_NO_GROUP affects whether the scheduler tries to group related processes together. Whether migration occurs depends on what node a process is scheduled on. If processes are aggressively grouped inappropriately then it is possible there is a bug that causes the load balancer to move processes off a node (possible migration) with NUMA balancing trying to pull it back (another possible migration). Small bugs there can result in excessive migration. -- Mel Gorman SUSE Labs ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner da...@fromorbit.com wrote: Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and 7m36s. IOWs's a bit better, but not significantly. page migrations are pretty much unchanged, too: 558,632 migrate:mm_migrate_pages ( +- 6.38% ) Ok. That was kind of the expected thing. I don't really know the NUMA fault rate limiting code, but one thing that strikes me is that if it tries to balance the NUMA faults against the *regular* faults, then maybe just the fact that we end up taking more COW faults after a NUMA fault then means that the NUMA rate limiting code now gets over-eager (because it sees all those extra non-numa faults). Mel, does that sound at all possible? I really have never looked at the magic automatic rate handling.. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote: On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds torva...@linux-foundation.org wrote: So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Ahh. The normal page table scanning and page fault handling both clear and set the dirty bit together with the writable one. But fork() will clear the writable bit without clearing dirty. For some reason I thought it moved the dirty bit into the struct page like the VM scanning does, but that was just me having a brainfart. So yeah, pte_dirty doesn't have to match pte_write even under perfectly normal circumstances. Maybe there are other cases. Not that I see a lot of forking in the xfs repair case either, so.. Dave, mind re-running the plain 3.19 numbers to really verify that the pte_dirty/pte_write change really made that big of a difference. Maybe your recollection of ~55,000 migrate_pages events was faulty. If the pte_write -pte_dirty change is the *only* difference, it's still very odd how that one difference would make migrate_rate go from ~55k to 471k. That's an order of magnitude difference, for what really shouldn't be a big change. My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Wondering where the 55k number came from, I played around with when I started the measurement - all the numbers since I did the bisect have come from starting it at roughly 130AGs into phase 3 where the memory footprint stabilises and the tlb flush overhead kicks in. However, if I start the measurement at the same time as the repair test, I get something much closer to the 55k number. I also note that my original 4.0-rc1 numbers were much lower than the more recent steady state measurements (360k vs 470k), so I'd say the original numbers weren't representative of the steady state behaviour and so can be ignored... Maybe a system update has changed libraries and memory allocation patterns, and there is something bigger than that one-liner pte_dirty/write change going on? Possibly. The xfs_repair binary has definitely been rebuilt (testing unrelated bug fixes that only affect phase 6/7 behaviour), but otherwise the system libraries are unchanged. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote: My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the more than an order of magnitude difference was really about measurement differences, not quite as real. Looks like more a factor of two than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the 1.4% from tlb_invalidate_range() vs pretty much everything from migration). No, the profiles all came from steady state. The profiles from the initial startup phase hammer the mmap_sem because of page fault vs mprotect contention (glibc runs mprotect() on every chunk of memory it allocates). It's not until the cache reaches full and it starts recycling old buffers rather than allocating new ones that the tlb flush problem dominates the profiles. The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. It must be *very* unstable for it to make that big of a difference. But I feel at least a *bit* better about unstable algorithm changes a small varioation into a factor-of-two vs that crazy factor-of-20. Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. Yup, will do. might take an hour or two before I get to it, though... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote: Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. $ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 266,750 migrate:mm_migrate_pages ( +- 7.43% ) 10.002032292 seconds time elapsed ( +- 0.00% ) Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote: Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Ahh, so that !(vma-vm_flags VM_WRITE) test works _almost_ as well as the original !pte_write() test. Now, can you check that on top of rc4? If I've gotten everything right, we now have: - plain 3.19 (pte_write): 4m54s - 3.19 with vm_flags VM_WRITE: 5m4s - 3.19 with pte_dirty: 5m20s so the pte_dirty version seems to have been a bad choice indeed. For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still _much_ worse, but I'm wondering whether that VM_WRITE test will at least shrink the difference like it does for 3.19. And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote: On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote: Bit more variance there than the pte checking, but runtime difference is in the noise - 5m4s vs 4m54s - and profiles are identical to the pte checking version. Ahh, so that !(vma-vm_flags VM_WRITE) test works _almost_ as well as the original !pte_write() test. Now, can you check that on top of rc4? If I've gotten everything right, we now have: - plain 3.19 (pte_write): 4m54s - 3.19 with vm_flags VM_WRITE: 5m4s - 3.19 with pte_dirty: 5m20s *nod* so the pte_dirty version seems to have been a bad choice indeed. For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still _much_ worse, but I'm wondering whether that VM_WRITE test will at least shrink the difference like it does for 3.19. Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and 7m36s. IOWs's a bit better, but not significantly. page migrations are pretty much unchanged, too: 558,632 migrate:mm_migrate_pages ( +- 6.38% ) And the VM_WRITE test should be stable and not have any subtle interaction with the other changes that the numa pte things introduced. It would be good to see if the profiles then pop something *else* up as the performance difference (which I'm sure will remain, since the 7m50s was so far off). No, nothing new pops up in the kernel profiles. All the system CPU time is still being spent sending IPIs on the tlb flush path. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote: My recollection wasn't faulty - I pulled it from an earlier email. That said, the original measurement might have been faulty. I ran the numbers again on the 3.19 kernel I saved away from the original testing. That came up at 235k, which is pretty much the same as yesterday's test. The runtime,however, is unchanged from my original measurements of 4m54s (pte_hack came in at 5m20s). Ok. Good. So the more than an order of magnitude difference was really about measurement differences, not quite as real. Looks like more a factor of two than a factor of 20. Did you do the profiles the same way? Because that would explain the differences in the TLB flush percentages too (the 1.4% from tlb_invalidate_range() vs pretty much everything from migration). The runtime variation does show that there's some *big* subtle difference for the numa balancing in the exact TNF_NO_GROUP details. It must be *very* unstable for it to make that big of a difference. But I feel at least a *bit* better about unstable algorithm changes a small varioation into a factor-of-two vs that crazy factor-of-20. Can you try Mel's change to make it use if (!(vma-vm_flags VM_WRITE)) instead of the pte details? Again, on otherwise plain 3.19, just so that we have a baseline. I'd be *so* much happer with checking the vma details over per-pte details, especially ones that change over the lifetime of the pte entry, and the NUMA code explicitly mucks with. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: - something completely different that I am entirely missing So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Minimally, there is still the window where we clear the PTE to set the protections. During that window, a fault can occur. In the old code which was inherently racy and unsafe, the fault might still go ahead deferring a potential migration for a short period. In the current code, it'll stall on the lock, notice the PTE is changed and refault so the overhead is very different but functionally correct. In the old code, pte_write had complex interactions with background cleaning and sync in the case of file mappings (not applicable to Dave's case but still it's unpredictable behaviour). pte_dirty is close but there are interactions with the application as the timing of writes vs the PTE scanner matter. Even if we restored the original behaviour, it would still be very difficult to understand all the interactions between userspace and kernel. The patch below should be tested because it's clearer what the intent is. Using the VMA flags is coarse but it's not vulnerable to timing artifacts that behave differently depending on the machine. My preliminary testing shows it helps but not by much. It does not restore performance to where it was but it's easier to understand which is important if there are changes in the scheduler later. In combination, I also think that slowing PTE scanning when migration fails is the correct action even if it is unrelated to the patch Dave bisected to. It's stupid to increase scanning rates and incurs more faults when migrations are failing so I'll be testing that next. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 626e93db28ba..2f12e9fcf1a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, flags |= TNF_FAULT_LOCAL; } - /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. -* -* FIXME! This checks pmd_dirty() as an approximation of -* is this a read-only page, since checking pmd_write() -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. -*/ - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma-vm_flags VM_WRITE)) flags |= TNF_NO_GROUP; /* diff --git a/mm/memory.c b/mm/memory.c index 411144f977b1..20beb6647dba 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, } /* -* Avoid grouping on DSO/COW pages in specific and RO pages -* in general, RO pages shouldn't hurt as much anyway since -* they can be in shared cache state. +* Avoid grouping on RO pages in general. RO pages shouldn't hurt as +* much anyway since they can be in shared cache state. This misses +* the case where a mapping is writable but the process never writes +* to it but pte_write gets cleared during protection updates and +* pte_dirty has unpredictable behaviour between PTE scan updates, +* background writeback, dirty balancing and application behaviour. * -* FIXME! This checks pmd_dirty() as an approximation of -* is this a read-only page, since checking pmd_write() -* is even more broken. We haven't actually turned this into -* a writable page, so pmd_write() will always be false. +* TODO: Note that the ideal here would be to avoid a situation where a +* NUMA fault is taken immediately followed by a write fault in +* some cases which would have lower overhead overall but would be +* invasive as the fault paths would need to be unified. */ - if (!pte_dirty(pte)) + if (!(vma-vm_flags VM_WRITE)) flags |= TNF_NO_GROUP; /* ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 19, 2015 at 7:10 AM, Mel Gorman mgor...@suse.de wrote: - if (!pmd_dirty(pmd)) + /* See similar comment in do_numa_page for explanation */ + if (!(vma-vm_flags VM_WRITE)) Yeah, that would certainly be a whole lot more obvious than all the if this particular pte/pmd looks like X tests. So that, together with scanning rate improvements (this *does* seem to be somewhat chaotic, so it's quite possible that the current scanning rate thing is just fairly unstable) is likely the right thing. I'd just like to _understand_ why that write/dirty bit makes such a difference. I thought I understood what was going on, and was happy, and then Dave come with his crazy numbers. Damn you Dave, and damn your numbers and facts and stuff. Sometimes I much prefer ignorant bliss. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds torva...@linux-foundation.org wrote: So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Ahh. The normal page table scanning and page fault handling both clear and set the dirty bit together with the writable one. But fork() will clear the writable bit without clearing dirty. For some reason I thought it moved the dirty bit into the struct page like the VM scanning does, but that was just me having a brainfart. So yeah, pte_dirty doesn't have to match pte_write even under perfectly normal circumstances. Maybe there are other cases. Not that I see a lot of forking in the xfs repair case either, so.. Dave, mind re-running the plain 3.19 numbers to really verify that the pte_dirty/pte_write change really made that big of a difference. Maybe your recollection of ~55,000 migrate_pages events was faulty. If the pte_write -pte_dirty change is the *only* difference, it's still very odd how that one difference would make migrate_rate go from ~55k to 471k. That's an order of magnitude difference, for what really shouldn't be a big change. I'm running a kernel right now with a hacky update_mmu_cache() that warns if pte_dirty is ever different from pte_write(). +void update_mmu_cache(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + if (!(vma-vm_flags VM_SHARED)) { + pte_t now = READ_ONCE(*ptep); + if (!pte_write(now) != !pte_dirty(now)) { + static int count = 20; + static unsigned int prev = 0; + unsigned int val = pte_val(now) 0xfff; + if (prev != val count) { + prev = val; + count--; + WARN(1, pte value %x, val); + } + } + } +} I haven't seen a single warning so far (and there I wrote all that code to limit repeated warnings), although admittedly update_mu_cache() isn't called for all cases where we change a pte (not for the fork case, for example). But it *is* called for the page faulting cases Maybe a system update has changed libraries and memory allocation patterns, and there is something bigger than that one-liner pte_dirty/write change going on? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 17, 2015 at 3:08 PM, Dave Chinner da...@fromorbit.com wrote: Damn. From a performance number standpoint, it looked like we zoomed in on the right thing. But now it's migrating even more pages than before. Odd. Throttling problem, like Mel originally suspected? That doesn't much make sense for the original bisect you did, though. Although if there are two different issues, maybe that bisect was wrong. Or rather, incomplete. Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm: fix up numa read-only thread grouping logic) to 3.19, so that it uses the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19 and 4.0-rc4 numbers comparable. patched 3.19 numbers on this test are slightly worse than stock 3.19, but nowhere near as bad as 4.0-rc4: 241,718 migrate:mm_migrate_pages ( +- 5.17% ) Ok, that's still much worse than plain 3.19, which was ~55,000. Assuming your memory/measurements were the same. So apparently the pte_write() - pte_dirty() check isn't equivalent at all. My thinking that for the common case (ie private mappings) it would be *exactly* the same, because all normal COW pages turn dirty at the same time they turn writable (and, in page_mkclean_one(), turn clean and read-only again at the same time). But if the numbers change that much, then clearly my simplistic they are the same in practice is just complete BS. So why am I wrong? Why is testing for dirty not the same as testing for writable? I can see a few cases: - your load has lots of writable (but not written-to) shared memory, and maybe the test should be something like pte_dirty(pte) || (vma-vm_flags (VM_WRITE|VM_SHARED) == (VM_WRITE|VM_SHARED)) and we really should have some helper function for this logic. - something completely different that I am entirely missing What am I missing? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds torva...@linux-foundation.org wrote: So why am I wrong? Why is testing for dirty not the same as testing for writable? I can see a few cases: - your load has lots of writable (but not written-to) shared memory Hmm. I tried to look at the xfsprog sources, and I don't see any MAP_SHARED activity. It looks like it's just using pread64/pwrite64, and the only MAP_SHARED is for the xfsio mmap test thing, not for xfsrepair. So I don't see any shared mappings, but I don't know the code-base. - something completely different that I am entirely missing So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote: On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds torva...@linux-foundation.org wrote: So why am I wrong? Why is testing for dirty not the same as testing for writable? I can see a few cases: - your load has lots of writable (but not written-to) shared memory Hmm. I tried to look at the xfsprog sources, and I don't see any MAP_SHARED activity. It looks like it's just using pread64/pwrite64, and the only MAP_SHARED is for the xfsio mmap test thing, not for xfsrepair. So I don't see any shared mappings, but I don't know the code-base. Right - all the mmap activity in the xfs_repair test is coming from memory allocation through glibc - we don't use mmap() directly anywhere in xfs_repair. FWIW, all the IO into these pages that are allocated is being done via direct IO, if that makes any difference... - something completely different that I am entirely missing So I think there's something I'm missing. For non-shared mappings, I still have the idea that pte_dirty should be the same as pte_write. And yet, your testing of 3.19 shows that it's a big difference. There's clearly something I'm completely missing. This level of pte interactions is beyond my level of knowledge, so I'm afraid at this point I'm not going to be much help other than to test patches and report the result. FWIW, here's the distribution of the hash table we are iterating over. There are a lot of search misses, which means we are doing a lot of pointer chasing, but the distribution is centred directly around the goal of 8 entries per chain and there is no long tail: libxfs_bcache: 0x67e110 Max supported entries = 808584 Max utilized entries = 808584 Active entries = 808583 Hash table size = 101073 Hits = 9789987 Misses = 8224234 Hit ratio = 54.35 MRU 0 entries = 4667 ( 0%) MRU 1 entries = 0 ( 0%) MRU 2 entries = 4 ( 0%) MRU 3 entries = 797447 ( 98%) MRU 4 entries =653 ( 0%) MRU 5 entries = 0 ( 0%) MRU 6 entries = 2755 ( 0%) MRU 7 entries = 1518 ( 0%) MRU 8 entries = 1518 ( 0%) MRU 9 entries = 0 ( 0%) MRU 10 entries = 21 ( 0%) MRU 11 entries = 0 ( 0%) MRU 12 entries = 0 ( 0%) MRU 13 entries = 0 ( 0%) MRU 14 entries = 0 ( 0%) MRU 15 entries = 0 ( 0%) Hash buckets with 0 entries 30 ( 0%) Hash buckets with 1 entries241 ( 0%) Hash buckets with 2 entries 1019 ( 0%) Hash buckets with 3 entries 2787 ( 1%) Hash buckets with 4 entries 5838 ( 2%) Hash buckets with 5 entries 9144 ( 5%) Hash buckets with 6 entries 12165 ( 9%) Hash buckets with 7 entries 14194 ( 12%) Hash buckets with 8 entries 14387 ( 14%) Hash buckets with 9 entries 12742 ( 14%) Hash buckets with 10 entries 10253 ( 12%) Hash buckets with 11 entries 7308 ( 9%) Hash buckets with 12 entries 4872 ( 7%) Hash buckets with 13 entries 2869 ( 4%) Hash buckets with 14 entries 1578 ( 2%) Hash buckets with 15 entries894 ( 1%) Hash buckets with 16 entries430 ( 0%) Hash buckets with 17 entries188 ( 0%) Hash buckets with 18 entries 88 ( 0%) Hash buckets with 19 entries 24 ( 0%) Hash buckets with 20 entries 11 ( 0%) Hash buckets with 21 entries 10 ( 0%) Hash buckets with 22 entries 1 ( 0%) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 17, 2015 at 02:30:57PM -0700, Linus Torvalds wrote: On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner da...@fromorbit.com wrote: On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I was using during steady state shows: 471,752 migrate:mm_migrate_pages ( +- 7.38% ) The migrate pages rate is even higher than in 4.0-rc1 (~360,000) and 3.19 (~55,000), so that looks like even more of a problem than before. Hmm. How stable are those numbers boot-to-boot? I've run the test several times but only profiles once so far. runtimes were 7m45, 7m50, 7m44s, 8m2s, and the profiles came from the 8m2s run. reboot, run again: $ sudo perf stat -a -r 6 -e migrate:mm_migrate_pages sleep 10 Performance counter stats for 'system wide' (6 runs): 572,839 migrate:mm_migrate_pages( +- 3.15% ) 10.001664694 seconds time elapsed ( +- 0.00% ) $ And just to confirm, a minute later, still in phase 3: 590,974 migrate:mm_migrate_pages ( +- 2.86% ) Reboot, run again: 575,344 migrate:mm_migrate_pages ( +- 0.70% ) So there is boot-to-boot variation, but it doesn't look like it gets any better That kind of extreme spread makes me suspicious. It's also interesting that if the numbers really go up even more (and by that big amount), then why does there seem to be almost no correlation with performance (which apparently went up since rc1, despite migrate_pages getting even _worse_). And the profile looks like: - 43.73% 0.05% [kernel][k] native_flush_tlb_others Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%). And flush_tlb_page() does seem to be called about ten times more (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible at 0.13%) Damn. From a performance number standpoint, it looked like we zoomed in on the right thing. But now it's migrating even more pages than before. Odd. Throttling problem, like Mel originally suspected? And the vmstats are: 3.19: numa_hit 5163221 numa_local 5153127 4.0-rc1: numa_hit 36952043 numa_local 36927384 4.0-rc4: numa_hit 23447345 numa_local 23438564 Page migrations are still up by a factor of ~20 on 3.19. The thing is, those numa_hit things come from the zone_statistics() call in buffered_rmqueue(), which in turn is simple from the memory allocator. That has *nothing* to do with virtual memory, and everything to do with actual physical memory allocations. So the load is simply allocating a lot more pages, presumably for those stupid migration events. But then it doesn't correlate with performance anyway.. Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm: fix up numa read-only thread grouping logic) to 3.19, so that it uses the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19 and 4.0-rc4 numbers comparable. patched 3.19 numbers on this test are slightly worse than stock 3.19, but nowhere near as bad as 4.0-rc4: 241,718 migrate:mm_migrate_pages ( +- 5.17% ) So that pte_write-pte_dirty change makes this go from ~55k to 240k, and runtime go from 4m54s to 5m20s. vmstats: numa_hit 9162476 numa_miss 0 numa_foreign 0 numa_interleave 10685 numa_local 9153740 numa_other 8736 numa_pte_updates 49582103 numa_huge_pte_updates 0 numa_hint_faults 48075098 numa_hint_faults_local 12974704 numa_pages_migrated 5748256 pgmigrate_success 5748256 pgmigrate_fail 0 Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner da...@fromorbit.com wrote: On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I was using during steady state shows: 471,752 migrate:mm_migrate_pages ( +- 7.38% ) The migrate pages rate is even higher than in 4.0-rc1 (~360,000) and 3.19 (~55,000), so that looks like even more of a problem than before. Hmm. How stable are those numbers boot-to-boot? That kind of extreme spread makes me suspicious. It's also interesting that if the numbers really go up even more (and by that big amount), then why does there seem to be almost no correlation with performance (which apparently went up since rc1, despite migrate_pages getting even _worse_). And the profile looks like: - 43.73% 0.05% [kernel][k] native_flush_tlb_others Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%). And flush_tlb_page() does seem to be called about ten times more (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible at 0.13%) Damn. From a performance number standpoint, it looked like we zoomed in on the right thing. But now it's migrating even more pages than before. Odd. And the vmstats are: 3.19: numa_hit 5163221 numa_local 5153127 4.0-rc1: numa_hit 36952043 numa_local 36927384 4.0-rc4: numa_hit 23447345 numa_local 23438564 Page migrations are still up by a factor of ~20 on 3.19. The thing is, those numa_hit things come from the zone_statistics() call in buffered_rmqueue(), which in turn is simple from the memory allocator. That has *nothing* to do with virtual memory, and everything to do with actual physical memory allocations. So the load is simply allocating a lot more pages, presumably for those stupid migration events. But then it doesn't correlate with performance anyway.. Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm: fix up numa read-only thread grouping logic) to 3.19, so that it uses the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19 and 4.0-rc4 numbers comparable. It does make me wonder if your load is chaotic wrt scheduling. The load presumably wants to spread out across all cpu's, but then the numa code tries to group things together for numa accesses, but depending on just random allocation patterns and layout in the hash tables, there either are patters with page access or there aren't. Which is kind of why I wonder how stable those numbers are boot to boot. Maybe this is at least partly about lucky allocation patterns. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 17, 2015 at 12:06 AM, Dave Chinner da...@fromorbit.com wrote: TO close the loop here, now I'm back home and can run tests: config3.19 4.0-rc1 4.0-rc4 defaults 8m08s9m34s 9m14s -o ag_stride=-1 4m04s4m38s 4m11s -o bhash=101073 6m04s 17m43s 7m35s -o ag_stride=-1,bhash=101073 4m54s9m58s 7m50s It's better but there are still significant regressions, especially for the large memory footprint cases. I haven't had a chance to look at any stats or profiles yet, so I don't know yet whether this is still page fault related or some other problem Ok. I'd love to see some data on what changed between 3.19 and rc4 in the profiles, just to see whether it's more page faults due to extra COW, or whether it's due to more TLB flushes because of the pte_write() vs pte_dirty() differences. I'm *guessing* lot of the remaining issues are due to extra page fault overhead because I'd expect write/dirty to be fairly 1:1, but there could be differences due to shared memory use and/or just writebacks of dirty pages that become clean. I guess you can also see in vmstat.mm_migrate_pages whether it's because of excessive migration (because of bad grouping) or not. So not just profiles data. At the same time, I feel fairly happy about the situation - we at least understand what is going on, and the 3x worse performance case is at least gone. Even if that last case still looks horrible. So it's still a bad performance regression, but at the same time I think your test setup (big 500 TB filesystem, but then a fake-numa thing with just 4GB per node) is specialized and unrealistic enough that I don't feel it's all that relevant from a *real-world* standpoint, and so I wouldn't be uncomfortable saying ok, the page table handling cleanup caused some issues, but we know about them and how to fix them longer-term. So I don't consider this a 4.0 showstopper or a we need to revert for now issue. If it's a case of we take a lot more page faults because we handle the NUMA fault and then have a COW fault almost immediately, then the fix is likely to do the same early-cow that the normal non-numa-fault case does. In fact, my gut feel is that we should try to unify that numa/regula fault handling path a bit more, but that would be a pretty invasive patch. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Tue, Mar 10, 2015 at 04:55:52PM -0700, Linus Torvalds wrote: On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner da...@fromorbit.com wrote: On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: What's your virtual environment setup? Kernel config, and virtualization environment to actually get that odd fake NUMA thing happening? I don't have the exact .config with me (test machines at home are shut down because I'm half a world away), but it's pretty much this (copied and munged from a similar test vm on my laptop): [ snip snip ] Ok, I hate debugging by symptoms anyway, so I didn't do any of this, and went back to actually *thinking* about the code instead of trying to reproduce this and figure things out by trial and error. And I think I figured it out. SNIP I believe you're correct and it matches what was observed. I'm still travelling and wireless is dirt but managed to queue a test using pmd_dirty 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanilla ptewrite-v1r20 Time User-NUMA01 25695.96 ( 0.00%)32883.59 (-27.97%) 24012.80 ( 6.55%) Time User-NUMA01_THEADLOCAL 17404.36 ( 0.00%)17453.20 ( -0.28%) 17950.54 ( -3.14%) Time User-NUMA02 2037.65 ( 0.00%) 2063.70 ( -1.28%) 2046.88 ( -0.45%) Time User-NUMA02_SMT981.02 ( 0.00%) 983.70 ( -0.27%) 983.68 ( -0.27%) Time System-NUMA01 194.70 ( 0.00%) 602.44 (-209.42%) 158.90 ( 18.39%) Time System-NUMA01_THEADLOCAL98.52 ( 0.00%) 78.10 ( 20.73%) 107.66 ( -9.28%) Time System-NUMA029.28 ( 0.00%)6.47 ( 30.28%) 9.25 ( 0.32%) Time System-NUMA02_SMT3.79 ( 0.00%)5.06 (-33.51%) 3.92 ( -3.43%) Time Elapsed-NUMA01 558.84 ( 0.00%) 755.96 (-35.27%) 532.41 ( 4.73%) Time Elapsed-NUMA01_THEADLOCAL 382.54 ( 0.00%) 382.22 ( 0.08%) 390.48 ( -2.08%) Time Elapsed-NUMA02 49.83 ( 0.00%) 49.38 ( 0.90%) 49.79 ( 0.08%) Time Elapsed-NUMA02_SMT 46.59 ( 0.00%) 47.70 ( -2.38%) 47.77 ( -2.53%) Time CPU-NUMA014632.00 ( 0.00%) 4429.00 ( 4.38%) 4539.00 ( 2.01%) Time CPU-NUMA01_THEADLOCAL 4575.00 ( 0.00%) 4586.00 ( -0.24%) 4624.00 ( -1.07%) Time CPU-NUMA024107.00 ( 0.00%) 4191.00 ( -2.05%) 4129.00 ( -0.54%) Time CPU-NUMA02_SMT2113.00 ( 0.00%) 2072.00 ( 1.94%) 2067.00 ( 2.18%) 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanillaptewrite-v1r20 User46119.1253384.2944994.10 System306.41 692.14 279.78 Elapsed 1039.88 1236.87 1022.92 There are still some difference but it's much closer to what it was. The balancing stats are almost looking similar to 3.19 NUMA base PTE updates222840103 304513172 230724075 NUMA huge PMD updates 434894 594467 450274 NUMA page range updates 445505831 608880276 461264363 NUMA hint faults601358 733491 626176 NUMA hint local faults 371571 511530 359215 NUMA hint local percent 61 69 57 NUMA pages migrated707317726366701 6829196 XFS repair on the same machine is not fully restore either but a big enough move in the right direction to indicate this was the relevant change. xfsrepair 3.19.0 4.0.0-rc1 4.0.0-rc1 vanilla vanilla ptewrite-v1r20 Ameanreal-fsmark1166.28 ( 0.00%) 1166.63 ( -0.03%) 1184.97 ( -1.60%) Ameansyst-fsmark4025.87 ( 0.00%) 4020.94 ( 0.12%) 4071.10 ( -1.12%) Ameanreal-xfsrepair 447.66 ( 0.00%) 507.85 (-13.45%) 460.94 ( -2.97%) Ameansyst-xfsrepair 202.93 ( 0.00%) 519.88 (-156.19%) 282.45 (-39.19%) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman mgor...@suse.de wrote: I believe you're correct and it matches what was observed. I'm still travelling and wireless is dirt but managed to queue a test using pmd_dirty Ok, thanks. I'm not entirely happy with that change, and I suspect the whole heuristic should be looked at much more (maybe it should also look at whether it's executable, for example), but it's a step in the right direction. So I committed it and added a comment, and wrote a commit log about it. I suspect any further work is post-4.0-release, unless somebody comes up with something small and simple and obviously better. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote: On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman mgor...@suse.de wrote: I believe you're correct and it matches what was observed. I'm still travelling and wireless is dirt but managed to queue a test using pmd_dirty Ok, thanks. I'm not entirely happy with that change, and I suspect the whole heuristic should be looked at much more (maybe it should also look at whether it's executable, for example), but it's a step in the right direction. I can follow up when I'm back in work properly. As you have already pulled this in directly, can you also consider pulling in mm: thp: return the correct value for change_huge_pmd please? The other two patches were very minor can be resent through the normal paths later. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner da...@fromorbit.com wrote: Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Should be - I don't actually use 500TB of storage to generate this - 50GB on an SSD is all you need from the storage side. I just use a sparse backing file to make it look like a 500TB device. :P What's your virtual environment setup? Kernel config, and virtualization environment to actually get that odd fake NUMA thing happening? I don't have the exact .config with me (test machines at home are shut down because I'm half a world away), but it's pretty much this (copied and munged from a similar test vm on my laptop): $ cat run-vm-4.sh sudo qemu-system-x86_64 \ -machine accel=kvm \ -no-fd-bootchk \ -localtime \ -boot c \ -serial pty \ -nographic \ -alt-grab \ -smp 16 -m 16384 \ -hda /data/vm-2/root.img \ -drive file=/vm/vm-4/vm-4-test.img,if=virtio,cache=none \ -drive file=/vm/vm-4/vm-4-scratch.img,if=virtio,cache=none \ -drive file=/vm/vm-4/vm-4-500TB.img,if=virtio,cache=none \ -kernel /vm/vm-4/vmlinuz \ -append console=ttyS0,115200 root=/dev/sda1,numa=fake=4 $ And on the host I have /vm on a ssd that is an XFS filesystem, and I've created /vm/vm-4/vm-4-500TB.img by doing: $ xfs_io -f -c truncate 500t -c extsize 1m /vm/vm-4/vm-4-500TB.img and in the guest the filesystem is created with: # mkfs.xfs -f -mcrc=1,finobt=1 /dev/vdc And that will create a 500TB filesystem that you can then mount and run fsmark on it, then unmount and run xfs_repair on it. the .config I have on my laptop is from 3.18-rc something, but it should work just with a make oldconfig update. I'ts attached below. Hopefully this will be sufficient for you, otherwise it'll have to wait until I get home to get the exact configs for you. Cheers, Dave. -- Dave Chinner da...@fromorbit.com # # Automatically generated file; DO NOT EDIT. # Linux/x86 3.18.0-rc1 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT=elf64-x86-64 CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11 CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE= # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME=(none) CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y # CONFIG_FHANDLE is not set CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # #
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Mon, Mar 09, 2015 at 09:02:19PM +, Mel Gorman wrote: On Sun, Mar 08, 2015 at 08:40:25PM +, Mel Gorman wrote: Because if the answer is 'yes', then we can safely say: 'we regressed performance because correctness [not dropping dirty bits] comes before performance'. If the answer is 'no', then we still have a mystery (and a regression) to track down. As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it can be done. More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just updating the value should crash. For the purposes of testing the idea, I thought the straight-forward option was to break soft dirty page tracking and steal their bit for testing (patch below). Took most of the day to get access to the test machine so tests are not long running and only the autonuma one has completed; And the xfsrepair workload also does not show any benefit from using a different bit either 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanilla slowscan-v2r7protnone-v3r17 Min real-fsmark1164.44 ( 0.00%) 1157.41 ( 0.60%) 1150.38 ( 1.21%) 1173.22 ( -0.75%) Min syst-fsmark4016.12 ( 0.00%) 3998.06 ( 0.45%) 3988.42 ( 0.69%) 4037.90 ( -0.54%) Min real-xfsrepair 442.64 ( 0.00%) 497.64 (-12.43%) 456.87 ( -3.21%) 489.60 (-10.61%) Min syst-xfsrepair 194.97 ( 0.00%) 500.61 (-156.76%) 263.41 (-35.10%) 544.56 (-179.30%) Ameanreal-fsmark1166.28 ( 0.00%) 1166.63 ( -0.03%) 1155.97 ( 0.88%) 1183.19 ( -1.45%) Ameansyst-fsmark4025.87 ( 0.00%) 4020.94 ( 0.12%) 4004.19 ( 0.54%) 4061.64 ( -0.89%) Ameanreal-xfsrepair 447.66 ( 0.00%) 507.85 (-13.45%) 459.58 ( -2.66%) 498.71 (-11.40%) Ameansyst-xfsrepair 202.93 ( 0.00%) 519.88 (-156.19%) 281.63 (-38.78%) 569.21 (-180.50%) Stddev real-fsmark 1.44 ( 0.00%)6.55 (-354.10%) 3.97 (-175.65%)9.20 (-537.90%) Stddev syst-fsmark 9.76 ( 0.00%) 16.22 (-66.27%) 15.09 (-54.69%) 17.47 (-79.13%) Stddev real-xfsrepair5.57 ( 0.00%) 11.17 (-100.68%) 3.41 ( 38.66%)6.77 (-21.63%) Stddev syst-xfsrepair5.69 ( 0.00%) 13.98 (-145.78%) 19.94 (-250.49%) 20.03 (-252.05%) CoeffVar real-fsmark 0.12 ( 0.00%)0.56 (-353.96%) 0.34 (-178.11%)0.78 (-528.79%) CoeffVar syst-fsmark 0.24 ( 0.00%)0.40 (-66.48%)0.38 (-55.53%)0.43 (-77.55%) CoeffVar real-xfsrepair1.24 ( 0.00%)2.20 (-76.89%)0.74 ( 40.25%)1.36 ( -9.17%) CoeffVar syst-xfsrepair2.80 ( 0.00%)2.69 ( 4.06%)7.08 (-152.54%)3.52 (-25.51%) Max real-fsmark1167.96 ( 0.00%) 1171.98 ( -0.34%) 1159.25 ( 0.75%) 1195.41 ( -2.35%) Max syst-fsmark4039.20 ( 0.00%) 4033.84 ( 0.13%) 4024.53 ( 0.36%) 4079.45 ( -1.00%) Max real-xfsrepair 455.42 ( 0.00%) 523.40 (-14.93%) 464.40 ( -1.97%) 505.82 (-11.07%) Max syst-xfsrepair 207.94 ( 0.00%) 533.37 (-156.50%) 309.38 (-48.78%) 593.62 (-185.48%) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner da...@fromorbit.com wrote: On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote: What's your virtual environment setup? Kernel config, and virtualization environment to actually get that odd fake NUMA thing happening? I don't have the exact .config with me (test machines at home are shut down because I'm half a world away), but it's pretty much this (copied and munged from a similar test vm on my laptop): [ snip snip ] Ok, I hate debugging by symptoms anyway, so I didn't do any of this, and went back to actually *thinking* about the code instead of trying to reproduce this and figure things out by trial and error. And I think I figured it out. Of course, since I didn't actually test anything, what do I know, but I feel good about it, because I think I can explain why that patch that on the face of it shouldn't change anything actually did. So, the old code just did all those manual page table changes, clearing the present bit and setting the NUMA bit instead. The new code _ostensibly_ does the same, except it clears the present bit and sets the PROTNONE bit instead. However, rather than playing special games with just those two bits, it uses the normal pte accessor functions, and in particular uses vma-vm_page_prot to reset the protections back. Which is a nice cleanup and really makes the code look saner, and does the same thing. Except it really isn't the same thing at all. Why? The protection bits in the page tables are *not* the same as vma-vm_page_prot. Yes, they start out that way, but they don't stay that way. And no, I'm not talking about dirty and accessed bits. The difference? COW. Any private mapping is marked read-only in vma-vm_page_prot, and then the COW (or the initial write) makes it read-write. And so, when we did - pte = pte_mknonnuma(pte); + /* Make it present again */ + pte = pte_modify(pte, vma-vm_page_prot); + pte = pte_mkyoung(pte); that isn't equivalent at all - it makes the page read-only, because it restores it to its original state. Now, that isn't actually what hurts most, I suspect. Judging by the profiles, we don't suddenly take a lot of new COW faults. No, what hurts most is that the NUMA balancing code does this: /* * Avoid grouping on DSO/COW pages in specific and RO pages * in general, RO pages shouldn't hurt as much anyway since * they can be in shared cache state. */ if (!pte_write(pte)) flags |= TNF_NO_GROUP; and that !pte_write(pte) is basically now *always* true for private mappings (which is 99% of all mappings). In other words, I think the patch unintentionally made the NUMA code basically always do the TNF_NO_GROUP case. I think that a quick hack for testing might be to just replace that !pte_write() with !pte_dirty(), and seeing how that acts. Comments? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 08, 2015 at 11:35:59AM -0700, Linus Torvalds wrote: On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar mi...@kernel.org wrote: But: As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) to double check that the position of the bit does not matter? Agreed. We should definitely try that. Dave? As Mel has already mentioned, I'm in Boston for LSFMM and don't have access to the test rig I've used to generate this. Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Should be - I don't actually use 500TB of storage to generate this - 50GB on an SSD is all you need from the storage side. I just use a sparse backing file to make it look like a 500TB device. :P i.e. create an XFS filesystem on a 500TB sparse file with mkfs.xfs -d size=500t,file=1 /path/to/file.img, mount it on loopback or as a virtio,cache=none device for the guest vm and then use fsmark to generate several million files spread across many, many directories such as: $ fs_mark -D 1 -S0 -n 10 -s 1 -L 32 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 That should only take a few minutes to run - if you throw 8p at it then it should run at 100k files/s being created. Then unmount and run xfs_repair -o bhash=101703 /path/to/file.img on the resultant image file. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner da...@fromorbit.com wrote: Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Should be - I don't actually use 500TB of storage to generate this - 50GB on an SSD is all you need from the storage side. I just use a sparse backing file to make it look like a 500TB device. :P What's your virtual environment setup? Kernel config, and virtualization environment to actually get that odd fake NUMA thing happening? Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 08, 2015 at 08:40:25PM +, Mel Gorman wrote: Because if the answer is 'yes', then we can safely say: 'we regressed performance because correctness [not dropping dirty bits] comes before performance'. If the answer is 'no', then we still have a mystery (and a regression) to track down. As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it can be done. More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just updating the value should crash. For the purposes of testing the idea, I thought the straight-forward option was to break soft dirty page tracking and steal their bit for testing (patch below). Took most of the day to get access to the test machine so tests are not long running and only the autonuma one has completed; autonumabench 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanilla slowscan-v2r7protnone-v3 Time User-NUMA01 25695.96 ( 0.00%)32883.59 (-27.97%) 35288.00 (-37.33%)35236.21 (-37.13%) Time User-NUMA01_THEADLOCAL 17404.36 ( 0.00%)17453.20 ( -0.28%) 17765.79 ( -2.08%)17590.10 ( -1.07%) Time User-NUMA02 2037.65 ( 0.00%) 2063.70 ( -1.28%) 2063.22 ( -1.25%) 2072.95 ( -1.73%) Time User-NUMA02_SMT981.02 ( 0.00%) 983.70 ( -0.27%) 976.01 ( 0.51%) 983.42 ( -0.24%) Time System-NUMA01 194.70 ( 0.00%) 602.44 (-209.42%) 209.42 ( -7.56%) 737.36 (-278.72%) Time System-NUMA01_THEADLOCAL98.52 ( 0.00%) 78.10 ( 20.73%) 92.70 ( 5.91%) 80.69 ( 18.10%) Time System-NUMA029.28 ( 0.00%)6.47 ( 30.28%) 6.06 ( 34.70%)6.63 ( 28.56%) Time System-NUMA02_SMT3.79 ( 0.00%)5.06 (-33.51%) 3.39 ( 10.55%)3.60 ( 5.01%) Time Elapsed-NUMA01 558.84 ( 0.00%) 755.96 (-35.27%) 833.63 (-49.17%) 804.50 (-43.96%) Time Elapsed-NUMA01_THEADLOCAL 382.54 ( 0.00%) 382.22 ( 0.08%) 395.45 ( -3.37%) 388.12 ( -1.46%) Time Elapsed-NUMA02 49.83 ( 0.00%) 49.38 ( 0.90%) 50.21 ( -0.76%) 48.99 ( 1.69%) Time Elapsed-NUMA02_SMT 46.59 ( 0.00%) 47.70 ( -2.38%) 48.55 ( -4.21%) 49.50 ( -6.25%) Time CPU-NUMA014632.00 ( 0.00%) 4429.00 ( 4.38%) 4258.00 ( 8.07%) 4471.00 ( 3.48%) Time CPU-NUMA01_THEADLOCAL 4575.00 ( 0.00%) 4586.00 ( -0.24%) 4515.00 ( 1.31%) 4552.00 ( 0.50%) Time CPU-NUMA024107.00 ( 0.00%) 4191.00 ( -2.05%) 4120.00 ( -0.32%) 4244.00 ( -3.34%) Time CPU-NUMA02_SMT2113.00 ( 0.00%) 2072.00 ( 1.94%) 2017.00 ( 4.54%) 1993.00 ( 5.68%) 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanillaslowscan-v2r7protnone-v3 User46119.1253384.2956093.1155882.82 System306.41 692.14 311.64 828.36 Elapsed 1039.88 1236.87 1328.61 1292.92 So just using a different bit doesn't seem to be it either 3.19.0 4.0.0-rc1 4.0.0-rc1 4.0.0-rc1 vanilla vanillaslowscan-v2r7protnone-v3 NUMA alloc hit 1202922 1437560 1472578 1499274 NUMA alloc miss 0 0 0 0 NUMA interleave hit 0 0 0 0 NUMA alloc local 1200683 1436781 1472226 1498680 NUMA base PTE updates222840103 304513172 121532313 337431414 NUMA huge PMD updates 434894 594467 237170 658715 NUMA page range updates 445505831 608880276 242963353 674693494 NUMA hint faults601358 733491 334334 820793 NUMA hint local faults 371571 511530 227171 565003 NUMA hint local percent 61 69 67 68 NUMA pages migrated707317726366701 860708231288355 Patch to use a bit other than the global bit for prot none is below. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 8c7c10802e9c..1f243323693c 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -20,16 +20,16 @@ #define _PAGE_BIT_SOFTW2 10 /* */ #define _PAGE_BIT_SOFTW3 11 /* */ #define _PAGE_BIT_PAT_LARGE12
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 08, 2015 at 11:02:23AM +0100, Ingo Molnar wrote: * Linus Torvalds torva...@linux-foundation.org wrote: On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote: And the patch Dave bisected to is a relatively simple patch. Why not simply revert it to see whether that cures much of the problem? So the problem with that is that pmd_set_numa() and friends simply no longer exist. So we can't just revert that one patch, it's the whole series, and the whole point of the series. Yeah. What confuses me is that the only real change that I can see in that patch is the change to change_huge_pmd(). Everything else is pretty much a 100% equivalent transformation, afaik. Of course, I may be wrong about that, and missing something silly. Well, there's a difference in what we write to the pte: #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL and our expectation was that the two should be equivalent methods from the POV of the NUMA balancing code, right? Functionally yes but performance-wise no. We are now using the global bit for NUMA faults at the very least. And the changes to change_huge_pmd() were basically re-done differently by subsequent patches anyway. The *only* change I see remaining is that change_huge_pmd() now does entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); set_pmd_at(mm, addr, pmd, entry); for all changes. It used to do that pmdp_set_numa() for the prot_numa case, which did just pmd_t pmd = *pmdp; pmd = pmd_mknuma(pmd); set_pmd_at(mm, addr, pmdp, pmd); instead. I don't like the old pmdp_set_numa() because it can drop dirty bits, so I think the old code was actively buggy. Could we, as a silly testing hack not to be applied, write a hack-patch that re-introduces the racy way of setting the NUMA bit, to confirm that it is indeed this difference that changes pte visibility across CPUs enough to create so many more faults? This was already done and tested by Dave but while it helped, it was not enough. As the approach was inherently unsafe it was dropped and the throttling approach taken. However, the fact it made little difference may indicate that this is somehow related to the global bit being used. Because if the answer is 'yes', then we can safely say: 'we regressed performance because correctness [not dropping dirty bits] comes before performance'. If the answer is 'no', then we still have a mystery (and a regression) to track down. As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it can be done. to double check that the position of the bit does not matter? It's worth checking in case it's a case of how the global bit is treated. However, note that Dave is currently travelling for LSF/MM in Boston and there is a chance he cannot test this week at all. I'm just after landing in the hotel myself. I'll try find time during during one of the breaks tomorrow but if the wireless is too crap then accessing the test machine remotely might be an issue. I don't think we've exhaused all avenues of analysis here. True. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar mi...@kernel.org wrote: Well, there's a difference in what we write to the pte: #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL and our expectation was that the two should be equivalent methods from the POV of the NUMA balancing code, right? Right. But yes, we might have screwed something up. In particular, there might be something that thinks it cares about the global bit, but doesn't notice that the present bit isn't set, so it considers the protnone mappings to be global and causes lots more tlb flushes etc. I don't like the old pmdp_set_numa() because it can drop dirty bits, so I think the old code was actively buggy. Could we, as a silly testing hack not to be applied, write a hack-patch that re-introduces the racy way of setting the NUMA bit, to confirm that it is indeed this difference that changes pte visibility across CPUs enough to create so many more faults? So one of Mel's patches did that, but I don't know if Dave tested it. And thinking about it, it *may* be safe for huge-pages, if they always already have the dirty bit set to begin with. And I don't see how we could have a clean hugepage (apart from the special case of the zeropage, which is read-only, so races on teh dirty bit aren't an issue). So it might actually be that the non-atomic version is safe for hpages. And we could possibly get rid of the atomic read-and-clear even for the non-numa case. I'd rather do it for both cases than for just one of them. But: As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) to double check that the position of the bit does not matter? Agreed. We should definitely try that. Dave? Also, is there some sane way for me to actually see this behavior on a regular machine with just a single socket? Dave is apparently running in some fake-numa setup, I'm wondering if this is easy enough to reproduce that I could see it myself. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sun, Mar 8, 2015 at 11:35 AM, Linus Torvalds torva...@linux-foundation.org wrote: As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) to double check that the position of the bit does not matter? Agreed. We should definitely try that. There's a second reason to do that, actually: the __supported_pte_mask thing, _and_ the pageattr stuff in __split_large_page() etc play games with _PAGE_GLOBAL. As does drivers/lguest for some reason. So looking at this all, there's a lot of room for confusion with _PAGE_GLOBAL. That kind of confusion would certainly explain the whole the changes _look_ like they do the same thing, but don't - because of silly semantic conflicts with PROTNONE vs GLOBAL. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
* Mel Gorman mgor...@suse.de wrote: Elapsed time is primarily worse on one benchmark -- numa01 which is an adverse workload. The user time differences are also dominated by that benchmark 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2r7 vanilla Time User-NUMA01 32883.59 ( 0.00%)35288.00 ( -7.31%) 25695.96 ( 21.86%) Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%)17765.79 ( -1.79%) 17404.36 ( 0.28%) Time User-NUMA02 2063.70 ( 0.00%) 2063.22 ( 0.02%) 2037.65 ( 1.26%) Time User-NUMA02_SMT983.70 ( 0.00%) 976.01 ( 0.78%) 981.02 ( 0.27%) But even for 'numa02', the simplest of the workloads, there appears to be some of a regression relative to v3.19, which ought to be beyond the noise of the measurement (which would be below 1% I suspect), and as such relevant, right? And the XFS numbers still show significant regression compared to v3.19 - and that cannot be ignored as artificial, 'adversarial' workload, right? For example, from your numbers: xfsrepair 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla ... Ameanreal-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) Ameansyst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) if I interpret the numbers correctly, it shows that compared to v3.19, system time increased by 38% - which is rather significant! So what worries me is that Dave bisected the regression to: 4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table manipulations) And clearly your patch #4 just tunes balancing/migration intensity - is that a workaround for the real problem/bug? The patch makes NUMA hinting faults use standard page table handling routines and protections to trap the faults. Fundamentally it's safer even though it appears to cause more traps to be handled. I've been assuming this is related to the different permissions PTEs get and when they are visible on all CPUs. This path is addressing the symptom that more faults are being handled and that it needs to be less aggressive. But the whole cleanup ought to have been close to an identity transformation from the CPU's point of view - and your measurements seem to confirm Dave's findings. And your measurement was on bare metal, while Dave's is on a VM, and both show a significant slowdown on the xfs tests even with your slow-tuning patch applied, so it's unlikely to be a measurement fluke or some weird platform property. I've gone through that patch and didn't spot anything else that is doing wrong that is not already handled in this series. Did you spot anything obviously wrong in that patch that isn't addressed in this series? I didn't spot anything wrong, but is that a basis to go forward and work around the regression, in a way that doesn't even recover lost performance? And the patch Dave bisected to is a relatively simple patch. Why not simply revert it to see whether that cures much of the problem? Because it also means reverting all the PROT_NONE handling and going back to _PAGE_NUMA tricks which I expect would be naked by Linus. Yeah, I realize that (and obviously I support the PROT_NONE direction that Peter Zijlstra prototyped with the original sched/numa series), but can we leave this much of a regression on the table? I hate to be such a pain in the neck, but especially the 'down tuning' of the scanning intensity will make an apples to apples comparison harder! I'd rather not do the slow-tuning part and leave sucky performance in place for now and have an easy method plus the motivation to find and fix the real cause of the regression, than to partially hide it this way ... Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
* Mel Gorman mgor...@suse.de wrote: xfsrepair 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla Min real-fsmark1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) Min syst-fsmark3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) Ameanreal-fsmark1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) Ameansyst-fsmark4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) Ameanreal-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) Ameansyst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) Stddev real-fsmark 6.55 ( 0.00%)3.97 ( 39.30%) 1.44 ( 77.98%) Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) Stddev real-xfsrepair 11.17 ( 0.00%)3.41 ( 69.43%) 5.57 ( 50.17%) Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) CoeffVar real-fsmark 0.56 ( 0.00%)0.34 ( 38.74%) 0.12 ( 77.97%) CoeffVar syst-fsmark 0.40 ( 0.00%)0.38 ( 6.57%) 0.24 ( 39.93%) CoeffVar real-xfsrepair2.20 ( 0.00%)0.74 ( 66.22%) 1.24 ( 43.47%) CoeffVar syst-xfsrepair2.69 ( 0.00%)7.08 (-163.23%) 2.80 ( -4.23%) Max real-fsmark1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) Max syst-fsmark4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) Btw., I think it would be nice if these numbers listed v3.19 performance in the first column, to make it clear at a glance how much regression we still have? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
* Linus Torvalds torva...@linux-foundation.org wrote: On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote: And the patch Dave bisected to is a relatively simple patch. Why not simply revert it to see whether that cures much of the problem? So the problem with that is that pmd_set_numa() and friends simply no longer exist. So we can't just revert that one patch, it's the whole series, and the whole point of the series. Yeah. What confuses me is that the only real change that I can see in that patch is the change to change_huge_pmd(). Everything else is pretty much a 100% equivalent transformation, afaik. Of course, I may be wrong about that, and missing something silly. Well, there's a difference in what we write to the pte: #define _PAGE_BIT_NUMA (_PAGE_BIT_GLOBAL+1) #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL and our expectation was that the two should be equivalent methods from the POV of the NUMA balancing code, right? And the changes to change_huge_pmd() were basically re-done differently by subsequent patches anyway. The *only* change I see remaining is that change_huge_pmd() now does entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); set_pmd_at(mm, addr, pmd, entry); for all changes. It used to do that pmdp_set_numa() for the prot_numa case, which did just pmd_t pmd = *pmdp; pmd = pmd_mknuma(pmd); set_pmd_at(mm, addr, pmdp, pmd); instead. I don't like the old pmdp_set_numa() because it can drop dirty bits, so I think the old code was actively buggy. Could we, as a silly testing hack not to be applied, write a hack-patch that re-introduces the racy way of setting the NUMA bit, to confirm that it is indeed this difference that changes pte visibility across CPUs enough to create so many more faults? Because if the answer is 'yes', then we can safely say: 'we regressed performance because correctness [not dropping dirty bits] comes before performance'. If the answer is 'no', then we still have a mystery (and a regression) to track down. As a second hack (not to be applied), could we change: #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL to: #define _PAGE_BIT_PROTNONE (_PAGE_BIT_GLOBAL+1) to double check that the position of the bit does not matter? I don't think we've exhaused all avenues of analysis here. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote: And the patch Dave bisected to is a relatively simple patch. Why not simply revert it to see whether that cures much of the problem? So the problem with that is that pmd_set_numa() and friends simply no longer exist. So we can't just revert that one patch, it's the whole series, and the whole point of the series. What confuses me is that the only real change that I can see in that patch is the change to change_huge_pmd(). Everything else is pretty much a 100% equivalent transformation, afaik. Of course, I may be wrong about that, and missing something silly. And the changes to change_huge_pmd() were basically re-done differently by subsequent patches anyway. The *only* change I see remaining is that change_huge_pmd() now does entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); set_pmd_at(mm, addr, pmd, entry); for all changes. It used to do that pmdp_set_numa() for the prot_numa case, which did just pmd_t pmd = *pmdp; pmd = pmd_mknuma(pmd); set_pmd_at(mm, addr, pmdp, pmd); instead. I don't like the old pmdp_set_numa() because it can drop dirty bits, so I think the old code was actively buggy. But I do *not* see why the new code would cause more migrations to happen. There's probably something really stupid I'm missing. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
On Sat, Mar 07, 2015 at 05:36:58PM +0100, Ingo Molnar wrote: * Mel Gorman mgor...@suse.de wrote: Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the -o bhash=101073 config: [...] 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla User53384.2956093.1146119.12 System692.14 311.64 306.41 Elapsed 1236.87 1328.61 1039.88 Note that the system CPU usage is now similar to 3.19-vanilla. Similar, but still worse, and also the elapsed time is still much worse. User time is much higher, although it's the same amount of work done on every kernel, right? Elapsed time is primarily worse on one benchmark -- numa01 which is an adverse workload. The user time differences are also dominated by that benchmark 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2r7 vanilla Time User-NUMA01 32883.59 ( 0.00%)35288.00 ( -7.31%) 25695.96 ( 21.86%) Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%)17765.79 ( -1.79%) 17404.36 ( 0.28%) Time User-NUMA02 2063.70 ( 0.00%) 2063.22 ( 0.02%) 2037.65 ( 1.26%) Time User-NUMA02_SMT983.70 ( 0.00%) 976.01 ( 0.78%) 981.02 ( 0.27%) I also tested with a workload very similar to Dave's. The machine configuration and storage is completely different so it's not an equivalent test unfortunately. It's reporting the elapsed time and CPU time while fsmark is running to create the inodes and when runnig xfsrepair afterwards xfsrepair 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla Min real-fsmark1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) Min syst-fsmark3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) Ameanreal-fsmark1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) Ameansyst-fsmark4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) Ameanreal-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) Ameansyst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) Stddev real-fsmark 6.55 ( 0.00%)3.97 ( 39.30%) 1.44 ( 77.98%) Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) Stddev real-xfsrepair 11.17 ( 0.00%)3.41 ( 69.43%) 5.57 ( 50.17%) Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) CoeffVar real-fsmark 0.56 ( 0.00%)0.34 ( 38.74%) 0.12 ( 77.97%) CoeffVar syst-fsmark 0.40 ( 0.00%)0.38 ( 6.57%) 0.24 ( 39.93%) CoeffVar real-xfsrepair2.20 ( 0.00%)0.74 ( 66.22%) 1.24 ( 43.47%) CoeffVar syst-xfsrepair2.69 ( 0.00%)7.08 (-163.23%) 2.80 ( -4.23%) Max real-fsmark1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) Max syst-fsmark4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) The key point is that system CPU usage for xfsrepair (syst-xfsrepair) is almost cut in half. It's still not as low as 3.19-vanilla but it's much closer 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla NUMA alloc hit 146138883 121929782 104019526 NUMA alloc miss 1314632811456356 7806370 NUMA interleave hit 0 0 0 NUMA alloc local 146060848 121865921 103953085 NUMA base PTE updates242201535 117237258 216624143 NUMA huge PMD updates 113270 52121 127782 NUMA page range updates 300195775 143923210 282048527 NUMA hint faults 18038802587299060 147235021 NUMA hint local faults727845323293925861866265 NUMA hint local percent 40 37
Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
* Mel Gorman mgor...@suse.de wrote: Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the -o bhash=101073 config: [...] 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla User53384.2956093.1146119.12 System692.14 311.64 306.41 Elapsed 1236.87 1328.61 1039.88 Note that the system CPU usage is now similar to 3.19-vanilla. Similar, but still worse, and also the elapsed time is still much worse. User time is much higher, although it's the same amount of work done on every kernel, right? I also tested with a workload very similar to Dave's. The machine configuration and storage is completely different so it's not an equivalent test unfortunately. It's reporting the elapsed time and CPU time while fsmark is running to create the inodes and when runnig xfsrepair afterwards xfsrepair 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla Min real-fsmark1157.41 ( 0.00%) 1150.38 ( 0.61%) 1164.44 ( -0.61%) Min syst-fsmark3998.06 ( 0.00%) 3988.42 ( 0.24%) 4016.12 ( -0.45%) Min real-xfsrepair 497.64 ( 0.00%) 456.87 ( 8.19%) 442.64 ( 11.05%) Min syst-xfsrepair 500.61 ( 0.00%) 263.41 ( 47.38%) 194.97 ( 61.05%) Ameanreal-fsmark1166.63 ( 0.00%) 1155.97 ( 0.91%) 1166.28 ( 0.03%) Ameansyst-fsmark4020.94 ( 0.00%) 4004.19 ( 0.42%) 4025.87 ( -0.12%) Ameanreal-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%) Ameansyst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%) Stddev real-fsmark 6.55 ( 0.00%)3.97 ( 39.30%) 1.44 ( 77.98%) Stddev syst-fsmark 16.22 ( 0.00%) 15.09 ( 6.96%) 9.76 ( 39.86%) Stddev real-xfsrepair 11.17 ( 0.00%)3.41 ( 69.43%) 5.57 ( 50.17%) Stddev syst-xfsrepair 13.98 ( 0.00%) 19.94 (-42.60%) 5.69 ( 59.31%) CoeffVar real-fsmark 0.56 ( 0.00%)0.34 ( 38.74%) 0.12 ( 77.97%) CoeffVar syst-fsmark 0.40 ( 0.00%)0.38 ( 6.57%) 0.24 ( 39.93%) CoeffVar real-xfsrepair2.20 ( 0.00%)0.74 ( 66.22%) 1.24 ( 43.47%) CoeffVar syst-xfsrepair2.69 ( 0.00%)7.08 (-163.23%) 2.80 ( -4.23%) Max real-fsmark1171.98 ( 0.00%) 1159.25 ( 1.09%) 1167.96 ( 0.34%) Max syst-fsmark4033.84 ( 0.00%) 4024.53 ( 0.23%) 4039.20 ( -0.13%) Max real-xfsrepair 523.40 ( 0.00%) 464.40 ( 11.27%) 455.42 ( 12.99%) Max syst-xfsrepair 533.37 ( 0.00%) 309.38 ( 42.00%) 207.94 ( 61.01%) The key point is that system CPU usage for xfsrepair (syst-xfsrepair) is almost cut in half. It's still not as low as 3.19-vanilla but it's much closer 4.0.0-rc1 4.0.0-rc1 3.19.0 vanilla slowscan-v2 vanilla NUMA alloc hit 146138883 121929782 104019526 NUMA alloc miss 1314632811456356 7806370 NUMA interleave hit 0 0 0 NUMA alloc local 146060848 121865921 103953085 NUMA base PTE updates242201535 117237258 216624143 NUMA huge PMD updates 113270 52121 127782 NUMA page range updates 300195775 143923210 282048527 NUMA hint faults 18038802587299060 147235021 NUMA hint local faults727845323293925861866265 NUMA hint local percent 40 37 42 NUMA pages migrated 711752624139530223237799 Note the big differences in faults trapped and pages migrated. 3.19-vanilla still migrated fewer pages but if necessary the threshold at which we start throttling migrations can be lowered. This too is still worse than what v3.19 had. So what worries me is that Dave bisected the regression to: 4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table manipulations) And clearly your patch #4 just tunes balancing/migration intensity - is that a workaround for the real problem/bug? And the patch Dave bisected to is a relatively simple patch. Why not simply revert it to see whether that cures much of the problem? Am I missing something fundamental? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org