Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-23 Thread Mel Gorman
On Fri, Mar 20, 2015 at 10:02:23AM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner da...@fromorbit.com wrote:
 
  Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
  7m36s. IOWs's a bit better, but not significantly. page migrations
  are pretty much unchanged, too:
 
 558,632  migrate:mm_migrate_pages ( +-  6.38% )
 
 Ok. That was kind of the expected thing.
 
 I don't really know the NUMA fault rate limiting code, but one thing
 that strikes me is that if it tries to balance the NUMA faults against
 the *regular* faults, then maybe just the fact that we end up taking
 more COW faults after a NUMA fault then means that the NUMA rate
 limiting code now gets over-eager (because it sees all those extra
 non-numa faults).
 
 Mel, does that sound at all possible? I really have never looked at
 the magic automatic rate handling..
 

It should not be trying to balance against regular faults as it has no
information on it. The trapping of additional faults to mark the PTE
writable will alter timing so it indirectly affects how many migration
faults there but this is only a side-effect IMO.

There is more overhead now due to losing the writable information and
that should be reduced so I tried a few approaches.  Ultimately, the one
that performed the best and was easiest to understand simply preserved
the writable bit across the protection update and page fault. I'll post
it later when I stick a changelog on it.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-20 Thread Mel Gorman
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
 And the VM_WRITE test should be stable and not have any subtle
 interaction with the other changes that the numa pte things
 introduced. It would be good to see if the profiles then pop something
 *else* up as the performance difference (which I'm sure will remain,
 since the 7m50s was so far off).
 

As a side-note, I did test a patch that checked pte_write and preserved
it across both faults and setting the protections. It did not alter
migration activity much but there was a  drop in minor faults - 20% drop in
autonumabench, 58% drop in xfsrepair workload. I'm assuming this is due to
refaults to mark pages writable.  The patch looks and is hacky so I won't
post it to save people bleaching their eyes. I'll spend some time soon
(hopefully today) at a smooth way of falling through to WP checks after
trapping a NUMA fault.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-20 Thread Mel Gorman
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote:
 
  My recollection wasn't faulty - I pulled it from an earlier email.
  That said, the original measurement might have been faulty. I ran
  the numbers again on the 3.19 kernel I saved away from the original
  testing. That came up at 235k, which is pretty much the same as
  yesterday's test. The runtime,however, is unchanged from my original
  measurements of 4m54s (pte_hack came in at 5m20s).
 
 Ok. Good. So the more than an order of magnitude difference was
 really about measurement differences, not quite as real. Looks like
 more a factor of two than a factor of 20.
 
 Did you do the profiles the same way? Because that would explain the
 differences in the TLB flush percentages too (the 1.4% from
 tlb_invalidate_range() vs pretty much everything from migration).
 
 The runtime variation does show that there's some *big* subtle
 difference for the numa balancing in the exact TNF_NO_GROUP details.

TNF_NO_GROUP affects whether the scheduler tries to group related processes
together. Whether migration occurs depends on what node a process is
scheduled on. If processes are aggressively grouped inappropriately then it
is possible there is a bug that causes the load balancer to move processes
off a node (possible migration) with NUMA balancing trying to pull it back
(another possible migration). Small bugs there can result in excessive
migration.

-- 
Mel Gorman
SUSE Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-20 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 9:13 PM, Dave Chinner da...@fromorbit.com wrote:

 Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
 7m36s. IOWs's a bit better, but not significantly. page migrations
 are pretty much unchanged, too:

558,632  migrate:mm_migrate_pages ( +-  6.38% )

Ok. That was kind of the expected thing.

I don't really know the NUMA fault rate limiting code, but one thing
that strikes me is that if it tries to balance the NUMA faults against
the *regular* faults, then maybe just the fact that we end up taking
more COW faults after a NUMA fault then means that the NUMA rate
limiting code now gets over-eager (because it sees all those extra
non-numa faults).

Mel, does that sound at all possible? I really have never looked at
the magic automatic rate handling..

 Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 02:41:48PM -0700, Linus Torvalds wrote:
 On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So I think there's something I'm missing. For non-shared mappings, I
  still have the idea that pte_dirty should be the same as pte_write.
  And yet, your testing of 3.19 shows that it's a big difference.
  There's clearly something I'm completely missing.
 
 Ahh. The normal page table scanning and page fault handling both clear
 and set the dirty bit together with the writable one. But fork()
 will clear the writable bit without clearing dirty. For some reason I
 thought it moved the dirty bit into the struct page like the VM
 scanning does, but that was just me having a brainfart. So yeah,
 pte_dirty doesn't have to match pte_write even under perfectly normal
 circumstances. Maybe there are other cases.
 
 Not that I see a lot of forking in the xfs repair case either, so..
 
 Dave, mind re-running the plain 3.19 numbers to really verify that the
 pte_dirty/pte_write change really made that big of a difference. Maybe
 your recollection of ~55,000 migrate_pages events was faulty. If the
 pte_write -pte_dirty change is the *only* difference, it's still very
 odd how that one difference would make migrate_rate go from ~55k to
 471k. That's an order of magnitude difference, for what really
 shouldn't be a big change.

My recollection wasn't faulty - I pulled it from an earlier email.
That said, the original measurement might have been faulty. I ran
the numbers again on the 3.19 kernel I saved away from the original
testing. That came up at 235k, which is pretty much the same as
yesterday's test. The runtime,however, is unchanged from my original
measurements of 4m54s (pte_hack came in at 5m20s).

Wondering where the 55k number came from, I played around with when
I started the measurement - all the numbers since I did the bisect
have come from starting it at roughly 130AGs into phase 3 where the
memory footprint stabilises and the tlb flush overhead kicks in.

However, if I start the measurement at the same time as the repair
test, I get something much closer to the 55k number. I also note
that my original 4.0-rc1 numbers were much lower than the more
recent steady state measurements (360k vs 470k), so I'd say the
original numbers weren't representative of the steady state
behaviour and so can be ignored...

 Maybe a system update has changed libraries and memory allocation
 patterns, and there is something bigger than that one-liner
 pte_dirty/write change going on?

Possibly. The xfs_repair binary has definitely been rebuilt (testing
unrelated bug fixes that only affect phase 6/7 behaviour), but
otherwise the system libraries are unchanged.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote:
 
  My recollection wasn't faulty - I pulled it from an earlier email.
  That said, the original measurement might have been faulty. I ran
  the numbers again on the 3.19 kernel I saved away from the original
  testing. That came up at 235k, which is pretty much the same as
  yesterday's test. The runtime,however, is unchanged from my original
  measurements of 4m54s (pte_hack came in at 5m20s).
 
 Ok. Good. So the more than an order of magnitude difference was
 really about measurement differences, not quite as real. Looks like
 more a factor of two than a factor of 20.
 
 Did you do the profiles the same way? Because that would explain the
 differences in the TLB flush percentages too (the 1.4% from
 tlb_invalidate_range() vs pretty much everything from migration).

No, the profiles all came from steady state. The profiles from the
initial startup phase hammer the mmap_sem because of page fault vs
mprotect contention (glibc runs mprotect() on every chunk of
memory it allocates). It's not until the cache reaches full and it
starts recycling old buffers rather than allocating new ones that
the tlb flush problem dominates the profiles.

 The runtime variation does show that there's some *big* subtle
 difference for the numa balancing in the exact TNF_NO_GROUP details.
 It must be *very* unstable for it to make that big of a difference.
 But I feel at least a *bit* better about unstable algorithm changes a
 small varioation into a factor-of-two vs that crazy factor-of-20.
 
 Can you try Mel's change to make it use
 
 if (!(vma-vm_flags  VM_WRITE))
 
 instead of the pte details? Again, on otherwise plain 3.19, just so
 that we have a baseline. I'd be *so* much happer with checking the vma
 details over per-pte details, especially ones that change over the
 lifetime of the pte entry, and the NUMA code explicitly mucks with.

Yup, will do. might take an hour or two before I get to it, though...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 04:05:46PM -0700, Linus Torvalds wrote:
 Can you try Mel's change to make it use
 
 if (!(vma-vm_flags  VM_WRITE))
 
 instead of the pte details? Again, on otherwise plain 3.19, just so
 that we have a baseline. I'd be *so* much happer with checking the vma
 details over per-pte details, especially ones that change over the
 lifetime of the pte entry, and the NUMA code explicitly mucks with.

$ sudo perf_3.18 stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

266,750  migrate:mm_migrate_pages ( +-  7.43% )

  10.002032292 seconds time elapsed ( +-  0.00% )

Bit more variance there than the pte checking, but runtime
difference is in the noise - 5m4s vs 4m54s - and profiles are
identical to the pte checking version.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote:

 Bit more variance there than the pte checking, but runtime
 difference is in the noise - 5m4s vs 4m54s - and profiles are
 identical to the pte checking version.

Ahh, so that !(vma-vm_flags  VM_WRITE) test works _almost_ as well
as the original !pte_write() test.

Now, can you check that on top of rc4? If I've gotten everything
right, we now have:

 - plain 3.19 (pte_write): 4m54s
 - 3.19 with vm_flags  VM_WRITE: 5m4s
 - 3.19 with pte_dirty: 5m20s

so the pte_dirty version seems to have been a bad choice indeed.

For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
_much_ worse, but I'm wondering whether that VM_WRITE test will at
least shrink the difference like it does for 3.19.

And the VM_WRITE test should be stable and not have any subtle
interaction with the other changes that the numa pte things
introduced. It would be good to see if the profiles then pop something
*else* up as the performance difference (which I'm sure will remain,
since the 7m50s was so far off).

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Dave Chinner
On Thu, Mar 19, 2015 at 06:29:47PM -0700, Linus Torvalds wrote:
 On Thu, Mar 19, 2015 at 5:23 PM, Dave Chinner da...@fromorbit.com wrote:
 
  Bit more variance there than the pte checking, but runtime
  difference is in the noise - 5m4s vs 4m54s - and profiles are
  identical to the pte checking version.
 
 Ahh, so that !(vma-vm_flags  VM_WRITE) test works _almost_ as well
 as the original !pte_write() test.
 
 Now, can you check that on top of rc4? If I've gotten everything
 right, we now have:
 
  - plain 3.19 (pte_write): 4m54s
  - 3.19 with vm_flags  VM_WRITE: 5m4s
  - 3.19 with pte_dirty: 5m20s

*nod*

 so the pte_dirty version seems to have been a bad choice indeed.
 
 For 4.0-rc4, (which uses pte_dirty) you had 7m50s, so it's still
 _much_ worse, but I'm wondering whether that VM_WRITE test will at
 least shrink the difference like it does for 3.19.

Testing now. It's a bit faster - three runs gave 7m35s, 7m20s and
7m36s. IOWs's a bit better, but not significantly. page migrations
are pretty much unchanged, too:

   558,632  migrate:mm_migrate_pages ( +-  6.38% )

 And the VM_WRITE test should be stable and not have any subtle
 interaction with the other changes that the numa pte things
 introduced. It would be good to see if the profiles then pop something
 *else* up as the performance difference (which I'm sure will remain,
 since the 7m50s was so far off).

No, nothing new pops up in the kernel profiles. All the system CPU
time is still being spent sending IPIs on the tlb flush path.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 3:41 PM, Dave Chinner da...@fromorbit.com wrote:

 My recollection wasn't faulty - I pulled it from an earlier email.
 That said, the original measurement might have been faulty. I ran
 the numbers again on the 3.19 kernel I saved away from the original
 testing. That came up at 235k, which is pretty much the same as
 yesterday's test. The runtime,however, is unchanged from my original
 measurements of 4m54s (pte_hack came in at 5m20s).

Ok. Good. So the more than an order of magnitude difference was
really about measurement differences, not quite as real. Looks like
more a factor of two than a factor of 20.

Did you do the profiles the same way? Because that would explain the
differences in the TLB flush percentages too (the 1.4% from
tlb_invalidate_range() vs pretty much everything from migration).

The runtime variation does show that there's some *big* subtle
difference for the numa balancing in the exact TNF_NO_GROUP details.
It must be *very* unstable for it to make that big of a difference.
But I feel at least a *bit* better about unstable algorithm changes a
small varioation into a factor-of-two vs that crazy factor-of-20.

Can you try Mel's change to make it use

if (!(vma-vm_flags  VM_WRITE))

instead of the pte details? Again, on otherwise plain 3.19, just so
that we have a baseline. I'd be *so* much happer with checking the vma
details over per-pte details, especially ones that change over the
lifetime of the pte entry, and the NUMA code explicitly mucks with.

   Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Mel Gorman
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote:
   - something completely different that I am entirely missing
 
 So I think there's something I'm missing. For non-shared mappings, I
 still have the idea that pte_dirty should be the same as pte_write.
 And yet, your testing of 3.19 shows that it's a big difference.
 There's clearly something I'm completely missing.
 

Minimally, there is still the window where we clear the PTE to set the
protections. During that window, a fault can occur. In the old code which
was inherently racy and unsafe, the fault might still go ahead deferring
a potential migration for a short period. In the current code, it'll stall
on the lock, notice the PTE is changed and refault so the overhead is very
different but functionally correct.

In the old code, pte_write had complex interactions with background
cleaning and sync in the case of file mappings (not applicable to Dave's
case but still it's unpredictable behaviour). pte_dirty is close but there
are interactions with the application as the timing of writes vs the PTE
scanner matter.

Even if we restored the original behaviour, it would still be very difficult
to understand all the interactions between userspace and kernel.  The patch
below should be tested because it's clearer what the intent is. Using
the VMA flags is coarse but it's not vulnerable to timing artifacts that
behave differently depending on the machine. My preliminary testing shows
it helps but not by much. It does not restore performance to where it was
but it's easier to understand which is important if there are changes in
the scheduler later.

In combination, I also think that slowing PTE scanning when migration fails
is the correct action even if it is unrelated to the patch Dave bisected
to. It's stupid to increase scanning rates and incurs more faults when
migrations are failing so I'll be testing that next.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 626e93db28ba..2f12e9fcf1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
flags |= TNF_FAULT_LOCAL;
}
 
-   /*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
-*
-* FIXME! This checks pmd_dirty() as an approximation of
-* is this a read-only page, since checking pmd_write()
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
-*/
-   if (!pmd_dirty(pmd))
+   /* See similar comment in do_numa_page for explanation */
+   if (!(vma-vm_flags  VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
diff --git a/mm/memory.c b/mm/memory.c
index 411144f977b1..20beb6647dba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
}
 
/*
-* Avoid grouping on DSO/COW pages in specific and RO pages
-* in general, RO pages shouldn't hurt as much anyway since
-* they can be in shared cache state.
+* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
+* much anyway since they can be in shared cache state. This misses
+* the case where a mapping is writable but the process never writes
+* to it but pte_write gets cleared during protection updates and
+* pte_dirty has unpredictable behaviour between PTE scan updates,
+* background writeback, dirty balancing and application behaviour.
 *
-* FIXME! This checks pmd_dirty() as an approximation of
-* is this a read-only page, since checking pmd_write()
-* is even more broken. We haven't actually turned this into
-* a writable page, so pmd_write() will always be false.
+* TODO: Note that the ideal here would be to avoid a situation where a
+* NUMA fault is taken immediately followed by a write fault in
+* some cases which would have lower overhead overall but would be
+* invasive as the fault paths would need to be unified.
 */
-   if (!pte_dirty(pte))
+   if (!(vma-vm_flags  VM_WRITE))
flags |= TNF_NO_GROUP;
 
/*
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Thu, Mar 19, 2015 at 7:10 AM, Mel Gorman mgor...@suse.de wrote:
 -   if (!pmd_dirty(pmd))
 +   /* See similar comment in do_numa_page for explanation */
 +   if (!(vma-vm_flags  VM_WRITE))

Yeah, that would certainly be a whole lot more obvious than all the
if this particular pte/pmd looks like X tests.

So that, together with scanning rate improvements (this *does* seem to
be somewhat chaotic, so it's quite possible that the current scanning
rate thing is just fairly unstable) is likely the right thing. I'd
just like to _understand_ why that write/dirty bit makes such a
difference. I thought I understood what was going on, and was happy,
and then Dave come with his crazy numbers.

Damn you Dave, and damn your numbers and facts and stuff. Sometimes
I much prefer ignorant bliss.

   Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-19 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 10:31 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 So I think there's something I'm missing. For non-shared mappings, I
 still have the idea that pte_dirty should be the same as pte_write.
 And yet, your testing of 3.19 shows that it's a big difference.
 There's clearly something I'm completely missing.

Ahh. The normal page table scanning and page fault handling both clear
and set the dirty bit together with the writable one. But fork()
will clear the writable bit without clearing dirty. For some reason I
thought it moved the dirty bit into the struct page like the VM
scanning does, but that was just me having a brainfart. So yeah,
pte_dirty doesn't have to match pte_write even under perfectly normal
circumstances. Maybe there are other cases.

Not that I see a lot of forking in the xfs repair case either, so..

Dave, mind re-running the plain 3.19 numbers to really verify that the
pte_dirty/pte_write change really made that big of a difference. Maybe
your recollection of ~55,000 migrate_pages events was faulty. If the
pte_write -pte_dirty change is the *only* difference, it's still very
odd how that one difference would make migrate_rate go from ~55k to
471k. That's an order of magnitude difference, for what really
shouldn't be a big change.

I'm running a kernel right now with a hacky update_mmu_cache() that
warns if pte_dirty is ever different from pte_write().

+void update_mmu_cache(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep)
+{
+   if (!(vma-vm_flags  VM_SHARED)) {
+   pte_t now = READ_ONCE(*ptep);
+   if (!pte_write(now) != !pte_dirty(now)) {
+   static int count = 20;
+   static unsigned int prev = 0;
+   unsigned int val = pte_val(now)  0xfff;
+   if (prev != val  count) {
+   prev = val;
+   count--;
+   WARN(1, pte value %x, val);
+   }
+   }
+   }
+}

I haven't seen a single warning so far (and there I wrote all that
code to limit repeated warnings), although admittedly
update_mu_cache() isn't called for all cases where we change a pte
(not for the fork case, for example). But it *is* called for the page
faulting cases

Maybe a system update has changed libraries and memory allocation
patterns, and there is something bigger than that one-liner
pte_dirty/write change going on?

 Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-18 Thread Linus Torvalds
On Tue, Mar 17, 2015 at 3:08 PM, Dave Chinner da...@fromorbit.com wrote:

 Damn. From a performance number standpoint, it looked like we zoomed
 in on the right thing. But now it's migrating even more pages than
 before. Odd.

 Throttling problem, like Mel originally suspected?

That doesn't much make sense for the original bisect you did, though.

Although if there are two different issues, maybe that bisect was
wrong. Or rather, incomplete.

 Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm:
 fix up numa read-only thread grouping logic) to 3.19, so that it uses
 the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19
 and 4.0-rc4 numbers comparable.

 patched 3.19 numbers on this test are slightly worse than stock
 3.19, but nowhere near as bad as 4.0-rc4:

 241,718  migrate:mm_migrate_pages   ( +-  5.17% )

Ok, that's still much worse than plain 3.19, which was ~55,000.
Assuming your memory/measurements were the same.

So apparently the pte_write() - pte_dirty() check isn't equivalent at
all. My thinking that for the common case (ie private mappings) it
would be *exactly* the same, because all normal COW pages turn dirty
at the same time they turn writable (and, in page_mkclean_one(), turn
clean and read-only again at the same time). But if the numbers change
that much, then clearly my simplistic they are the same in practice
is just complete BS.

So why am I wrong? Why is testing for dirty not the same as testing
for writable?

I can see a few cases:

 - your load has lots of writable (but not written-to) shared memory,
and maybe the test should be something like

  pte_dirty(pte) || (vma-vm_flags  (VM_WRITE|VM_SHARED) ==
(VM_WRITE|VM_SHARED))

   and we really should have some helper function for this logic.

 - something completely different that I am entirely missing

What am I missing?

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-18 Thread Linus Torvalds
On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 So why am I wrong? Why is testing for dirty not the same as testing
 for writable?

 I can see a few cases:

  - your load has lots of writable (but not written-to) shared memory

Hmm. I tried to look at the xfsprog sources, and I don't see any
MAP_SHARED activity.  It looks like it's just using pread64/pwrite64,
and the only MAP_SHARED is for the xfsio mmap test thing, not for
xfsrepair.

So I don't see any shared mappings, but I don't know the code-base.

  - something completely different that I am entirely missing

So I think there's something I'm missing. For non-shared mappings, I
still have the idea that pte_dirty should be the same as pte_write.
And yet, your testing of 3.19 shows that it's a big difference.
There's clearly something I'm completely missing.

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-18 Thread Dave Chinner
On Wed, Mar 18, 2015 at 10:31:28AM -0700, Linus Torvalds wrote:
 On Wed, Mar 18, 2015 at 9:08 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  So why am I wrong? Why is testing for dirty not the same as testing
  for writable?
 
  I can see a few cases:
 
   - your load has lots of writable (but not written-to) shared memory
 
 Hmm. I tried to look at the xfsprog sources, and I don't see any
 MAP_SHARED activity.  It looks like it's just using pread64/pwrite64,
 and the only MAP_SHARED is for the xfsio mmap test thing, not for
 xfsrepair.
 
 So I don't see any shared mappings, but I don't know the code-base.

Right - all the mmap activity in the xfs_repair test is coming from
memory allocation through glibc - we don't use mmap() directly
anywhere in xfs_repair. FWIW, all the IO into these pages that are
allocated is being done via direct IO, if that makes any
difference...

   - something completely different that I am entirely missing
 
 So I think there's something I'm missing. For non-shared mappings, I
 still have the idea that pte_dirty should be the same as pte_write.
 And yet, your testing of 3.19 shows that it's a big difference.
 There's clearly something I'm completely missing.

This level of pte interactions is beyond my level of knowledge, so
I'm afraid at this point I'm not going to be much help other than to
test patches and report the result.

FWIW, here's the distribution of the hash table we are iterating
over. There are a lot of search misses, which means we are doing a
lot of pointer chasing, but the distribution is centred directly
around the goal of 8 entries per chain and there is no long tail:

libxfs_bcache: 0x67e110
Max supported entries = 808584
Max utilized entries = 808584
Active entries = 808583
Hash table size = 101073
Hits = 9789987
Misses = 8224234
Hit ratio = 54.35
MRU 0 entries =   4667 (  0%)
MRU 1 entries =  0 (  0%)
MRU 2 entries =  4 (  0%)
MRU 3 entries = 797447 ( 98%)
MRU 4 entries =653 (  0%)
MRU 5 entries =  0 (  0%)
MRU 6 entries =   2755 (  0%)
MRU 7 entries =   1518 (  0%)
MRU 8 entries =   1518 (  0%)
MRU 9 entries =  0 (  0%)
MRU 10 entries = 21 (  0%)
MRU 11 entries =  0 (  0%)
MRU 12 entries =  0 (  0%)
MRU 13 entries =  0 (  0%)
MRU 14 entries =  0 (  0%)
MRU 15 entries =  0 (  0%)
Hash buckets with   0 entries 30 (  0%)
Hash buckets with   1 entries241 (  0%)
Hash buckets with   2 entries   1019 (  0%)
Hash buckets with   3 entries   2787 (  1%)
Hash buckets with   4 entries   5838 (  2%)
Hash buckets with   5 entries   9144 (  5%)
Hash buckets with   6 entries  12165 (  9%)
Hash buckets with   7 entries  14194 ( 12%)
Hash buckets with   8 entries  14387 ( 14%)
Hash buckets with   9 entries  12742 ( 14%)
Hash buckets with  10 entries  10253 ( 12%)
Hash buckets with  11 entries   7308 (  9%)
Hash buckets with  12 entries   4872 (  7%)
Hash buckets with  13 entries   2869 (  4%)
Hash buckets with  14 entries   1578 (  2%)
Hash buckets with  15 entries894 (  1%)
Hash buckets with  16 entries430 (  0%)
Hash buckets with  17 entries188 (  0%)
Hash buckets with  18 entries 88 (  0%)
Hash buckets with  19 entries 24 (  0%)
Hash buckets with  20 entries 11 (  0%)
Hash buckets with  21 entries 10 (  0%)
Hash buckets with  22 entries  1 (  0%)


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-17 Thread Dave Chinner
On Tue, Mar 17, 2015 at 02:30:57PM -0700, Linus Torvalds wrote:
 On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner da...@fromorbit.com wrote:
 
  On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I
  was using during steady state shows:
 
   471,752  migrate:mm_migrate_pages ( +-  7.38% )
 
  The migrate pages rate is even higher than in 4.0-rc1 (~360,000)
  and 3.19 (~55,000), so that looks like even more of a problem than
  before.
 
 Hmm. How stable are those numbers boot-to-boot?

I've run the test several times but only profiles once so far.
runtimes were 7m45, 7m50, 7m44s, 8m2s, and the profiles came from
the 8m2s run.

reboot, run again:

$ sudo perf stat -a -r 6 -e migrate:mm_migrate_pages sleep 10

 Performance counter stats for 'system wide' (6 runs):

   572,839  migrate:mm_migrate_pages( +-  3.15% )

  10.001664694 seconds time elapsed ( +-  0.00% )
$

And just to confirm, a minute later, still in phase 3:

590,974  migrate:mm_migrate_pages   ( +-  2.86% )

Reboot, run again:

575,344  migrate:mm_migrate_pages   ( +-  0.70% )

So there is boot-to-boot variation, but it doesn't look like it
gets any better

 That kind of extreme spread makes me suspicious. It's also interesting
 that if the numbers really go up even more (and by that big amount),
 then why does there seem to be almost no correlation with performance
 (which apparently went up since rc1, despite migrate_pages getting
 even _worse_).
 
  And the profile looks like:
 
  -   43.73% 0.05%  [kernel][k] native_flush_tlb_others
 
 Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%).
 And flush_tlb_page() does seem to be called about ten times more
 (flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible
 at 0.13%)
 
 Damn. From a performance number standpoint, it looked like we zoomed
 in on the right thing. But now it's migrating even more pages than
 before. Odd.

Throttling problem, like Mel originally suspected?

  And the vmstats are:
 
  3.19:
 
  numa_hit 5163221
  numa_local 5153127
 
  4.0-rc1:
 
  numa_hit 36952043
  numa_local 36927384
 
  4.0-rc4:
 
  numa_hit 23447345
  numa_local 23438564
 
  Page migrations are still up by a factor of ~20 on 3.19.
 
 The thing is, those numa_hit things come from the zone_statistics()
 call in buffered_rmqueue(), which in turn is simple from the memory
 allocator. That has *nothing* to do with virtual memory, and
 everything to do with actual physical memory allocations.  So the load
 is simply allocating a lot more pages, presumably for those stupid
 migration events.
 
 But then it doesn't correlate with performance anyway..

 Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm:
 fix up numa read-only thread grouping logic) to 3.19, so that it uses
 the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19
 and 4.0-rc4 numbers comparable.

patched 3.19 numbers on this test are slightly worse than stock
3.19, but nowhere near as bad as 4.0-rc4:

241,718  migrate:mm_migrate_pages   ( +-  5.17% )

So that pte_write-pte_dirty change makes this go from ~55k to 240k,
and runtime go from 4m54s to 5m20s. vmstats:

numa_hit 9162476
numa_miss 0
numa_foreign 0
numa_interleave 10685
numa_local 9153740
numa_other 8736
numa_pte_updates 49582103
numa_huge_pte_updates 0
numa_hint_faults 48075098
numa_hint_faults_local 12974704
numa_pages_migrated 5748256
pgmigrate_success 5748256
pgmigrate_fail 0

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-17 Thread Linus Torvalds
On Tue, Mar 17, 2015 at 1:51 PM, Dave Chinner da...@fromorbit.com wrote:

 On the -o ag_stride=-1 -o bhash=101073 config, the 60s perf stat I
 was using during steady state shows:

  471,752  migrate:mm_migrate_pages ( +-  7.38% )

 The migrate pages rate is even higher than in 4.0-rc1 (~360,000)
 and 3.19 (~55,000), so that looks like even more of a problem than
 before.

Hmm. How stable are those numbers boot-to-boot?

That kind of extreme spread makes me suspicious. It's also interesting
that if the numbers really go up even more (and by that big amount),
then why does there seem to be almost no correlation with performance
(which apparently went up since rc1, despite migrate_pages getting
even _worse_).

 And the profile looks like:

 -   43.73% 0.05%  [kernel][k] native_flush_tlb_others

Ok, that's down from rc1 (67%), but still hugely up from 3.19 (13.7%).
And flush_tlb_page() does seem to be called about ten times more
(flush_tlb_mm_range used to be 1.4% of the callers, now it's invisible
at 0.13%)

Damn. From a performance number standpoint, it looked like we zoomed
in on the right thing. But now it's migrating even more pages than
before. Odd.

 And the vmstats are:

 3.19:

 numa_hit 5163221
 numa_local 5153127

 4.0-rc1:

 numa_hit 36952043
 numa_local 36927384

 4.0-rc4:

 numa_hit 23447345
 numa_local 23438564

 Page migrations are still up by a factor of ~20 on 3.19.

The thing is, those numa_hit things come from the zone_statistics()
call in buffered_rmqueue(), which in turn is simple from the memory
allocator. That has *nothing* to do with virtual memory, and
everything to do with actual physical memory allocations.  So the load
is simply allocating a lot more pages, presumably for those stupid
migration events.

But then it doesn't correlate with performance anyway..

Can you do a simple stupid test? Apply that commit 53da3bc2ba9e (mm:
fix up numa read-only thread grouping logic) to 3.19, so that it uses
the same pte_dirty() logic as 4.0-rc4. That *should* make the 3.19
and 4.0-rc4 numbers comparable.

It does make me wonder if your load is chaotic wrt scheduling. The
load presumably wants to spread out across all cpu's, but then the
numa code tries to group things together for numa accesses, but
depending on just random allocation patterns and layout in the hash
tables, there either are patters with page access or there aren't.

Which is kind of why I wonder how stable those numbers are boot to
boot. Maybe this is at least partly about lucky allocation patterns.

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-17 Thread Linus Torvalds
On Tue, Mar 17, 2015 at 12:06 AM, Dave Chinner da...@fromorbit.com wrote:

 TO close the loop here, now I'm back home and can run tests:

 config3.19  4.0-rc1 4.0-rc4
 defaults 8m08s9m34s   9m14s
 -o ag_stride=-1  4m04s4m38s   4m11s
 -o bhash=101073  6m04s   17m43s   7m35s
 -o ag_stride=-1,bhash=101073 4m54s9m58s   7m50s

 It's better but there are still significant regressions, especially
 for the large memory footprint cases. I haven't had a chance to look
 at any stats or profiles yet, so I don't know yet whether this is
 still page fault related or some other problem

Ok. I'd love to see some data on what changed between 3.19 and rc4 in
the profiles, just to see whether it's more page faults due to extra
COW, or whether it's due to more TLB flushes because of the
pte_write() vs pte_dirty() differences. I'm *guessing*  lot of the
remaining issues are due to extra page fault overhead because I'd
expect write/dirty to be fairly 1:1, but there could be differences
due to shared memory use and/or just writebacks of dirty pages that
become clean.

I guess you can also see in vmstat.mm_migrate_pages whether it's
because of excessive migration (because of bad grouping) or not. So
not just profiles data.

At the same time, I feel fairly happy about the situation - we at
least understand what is going on, and the 3x worse performance case
is at least gone.  Even if that last case still looks horrible.

So it's still a bad performance regression, but at the same time I
think your test setup (big 500 TB filesystem, but then a fake-numa
thing with just 4GB per node) is specialized and unrealistic enough
that I don't feel it's all that relevant from a *real-world*
standpoint, and so I wouldn't be uncomfortable saying ok, the page
table handling cleanup caused some issues, but we know about them and
how to fix them longer-term.  So I don't consider this a 4.0
showstopper or a we need to revert for now issue.

If it's a case of we take a lot more page faults because we handle
the NUMA fault and then have a COW fault almost immediately, then the
fix is likely to do the same early-cow that the normal non-numa-fault
case does. In fact, my gut feel is that we should try to unify that
numa/regula fault handling path a bit more, but that would be a pretty
invasive patch.

 Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-12 Thread Mel Gorman
On Tue, Mar 10, 2015 at 04:55:52PM -0700, Linus Torvalds wrote:
 On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner da...@fromorbit.com wrote:
  On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
 
  What's your virtual environment setup? Kernel config, and
  virtualization environment to actually get that odd fake NUMA thing
  happening?
 
  I don't have the exact .config with me (test machines at home
  are shut down because I'm half a world away), but it's pretty much
  this (copied and munged from a similar test vm on my laptop):
 
 [ snip snip ]
 
 Ok, I hate debugging by symptoms anyway, so I didn't do any of this,
 and went back to actually *thinking* about the code instead of trying
 to reproduce this and figure things out by trial and error.
 
 And I think I figured it out.
 SNIP

I believe you're correct and it matches what was observed. I'm still
travelling and wireless is dirt but managed to queue a test using pmd_dirty

  3.19.0 4.0.0-rc1  
   4.0.0-rc1
 vanilla   vanilla  
  ptewrite-v1r20
Time User-NUMA01  25695.96 (  0.00%)32883.59 (-27.97%)
24012.80 (  6.55%)
Time User-NUMA01_THEADLOCAL   17404.36 (  0.00%)17453.20 ( -0.28%)
17950.54 ( -3.14%)
Time User-NUMA02   2037.65 (  0.00%) 2063.70 ( -1.28%) 
2046.88 ( -0.45%)
Time User-NUMA02_SMT981.02 (  0.00%)  983.70 ( -0.27%)  
983.68 ( -0.27%)
Time System-NUMA01  194.70 (  0.00%)  602.44 (-209.42%) 
 158.90 ( 18.39%)
Time System-NUMA01_THEADLOCAL98.52 (  0.00%)   78.10 ( 20.73%)  
107.66 ( -9.28%)
Time System-NUMA029.28 (  0.00%)6.47 ( 30.28%)  
  9.25 (  0.32%)
Time System-NUMA02_SMT3.79 (  0.00%)5.06 (-33.51%)  
  3.92 ( -3.43%)
Time Elapsed-NUMA01 558.84 (  0.00%)  755.96 (-35.27%)  
532.41 (  4.73%)
Time Elapsed-NUMA01_THEADLOCAL  382.54 (  0.00%)  382.22 (  0.08%)  
390.48 ( -2.08%)
Time Elapsed-NUMA02  49.83 (  0.00%)   49.38 (  0.90%)  
 49.79 (  0.08%)
Time Elapsed-NUMA02_SMT  46.59 (  0.00%)   47.70 ( -2.38%)  
 47.77 ( -2.53%)
Time CPU-NUMA014632.00 (  0.00%) 4429.00 (  4.38%) 
4539.00 (  2.01%)
Time CPU-NUMA01_THEADLOCAL 4575.00 (  0.00%) 4586.00 ( -0.24%) 
4624.00 ( -1.07%)
Time CPU-NUMA024107.00 (  0.00%) 4191.00 ( -2.05%) 
4129.00 ( -0.54%)
Time CPU-NUMA02_SMT2113.00 (  0.00%) 2072.00 (  1.94%) 
2067.00 (  2.18%)

  3.19.0   4.0.0-rc1   4.0.0-rc1
 vanilla vanillaptewrite-v1r20
User46119.1253384.2944994.10
System306.41  692.14  279.78
Elapsed  1039.88 1236.87 1022.92

There are still some difference but it's much closer to what it was.
The balancing stats are almost looking similar to 3.19

NUMA base PTE updates222840103   304513172   230724075
NUMA huge PMD updates   434894  594467  450274
NUMA page range updates  445505831   608880276   461264363
NUMA hint faults601358  733491  626176
NUMA hint local faults  371571  511530  359215
NUMA hint local percent 61  69  57
NUMA pages migrated707317726366701 6829196

XFS repair on the same machine is not fully restore either but a big
enough move in the right direction to indicate this was the relevant
change.

xfsrepair
   3.19.0 4.0.0-rc1 
4.0.0-rc1
  vanilla   vanilla
ptewrite-v1r20
Ameanreal-fsmark1166.28 (  0.00%) 1166.63 ( -0.03%) 1184.97 
( -1.60%)
Ameansyst-fsmark4025.87 (  0.00%) 4020.94 (  0.12%) 4071.10 
( -1.12%)
Ameanreal-xfsrepair  447.66 (  0.00%)  507.85 (-13.45%)  460.94 
( -2.97%)
Ameansyst-xfsrepair  202.93 (  0.00%)  519.88 (-156.19%)  
282.45 (-39.19%)
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-12 Thread Linus Torvalds
On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman mgor...@suse.de wrote:

 I believe you're correct and it matches what was observed. I'm still
 travelling and wireless is dirt but managed to queue a test using pmd_dirty

Ok, thanks.

I'm not entirely happy with that change, and I suspect the whole
heuristic should be looked at much more (maybe it should also look at
whether it's executable, for example), but it's a step in the right
direction.

So I committed it and added a comment, and wrote a commit log about
it. I suspect any further work is post-4.0-release, unless somebody
comes up with something small and simple and obviously better.

 Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-12 Thread Mel Gorman
On Thu, Mar 12, 2015 at 09:20:36AM -0700, Linus Torvalds wrote:
 On Thu, Mar 12, 2015 at 6:10 AM, Mel Gorman mgor...@suse.de wrote:
 
  I believe you're correct and it matches what was observed. I'm still
  travelling and wireless is dirt but managed to queue a test using pmd_dirty
 
 Ok, thanks.
 
 I'm not entirely happy with that change, and I suspect the whole
 heuristic should be looked at much more (maybe it should also look at
 whether it's executable, for example), but it's a step in the right
 direction.
 

I can follow up when I'm back in work properly. As you have already pulled
this in directly, can you also consider pulling in mm: thp: return the
correct value for change_huge_pmd please? The other two patches were very
minor can be resent through the normal paths later.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-10 Thread Dave Chinner
On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
 On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner da...@fromorbit.com wrote:
 
  Also, is there some sane way for me to actually see this behavior on a
  regular machine with just a single socket? Dave is apparently running
  in some fake-numa setup, I'm wondering if this is easy enough to
  reproduce that I could see it myself.
 
  Should be - I don't actually use 500TB of storage to generate this -
  50GB on an SSD is all you need from the storage side. I just use a
  sparse backing file to make it look like a 500TB device. :P
 
 What's your virtual environment setup? Kernel config, and
 virtualization environment to actually get that odd fake NUMA thing
 happening?

I don't have the exact .config with me (test machines at home
are shut down because I'm half a world away), but it's pretty much
this (copied and munged from a similar test vm on my laptop):

$ cat run-vm-4.sh
sudo qemu-system-x86_64 \
-machine accel=kvm \
-no-fd-bootchk \
-localtime \
-boot c \
-serial pty \
-nographic \
-alt-grab \
-smp 16 -m 16384 \
-hda /data/vm-2/root.img \
-drive file=/vm/vm-4/vm-4-test.img,if=virtio,cache=none \
-drive file=/vm/vm-4/vm-4-scratch.img,if=virtio,cache=none \
-drive file=/vm/vm-4/vm-4-500TB.img,if=virtio,cache=none \
-kernel /vm/vm-4/vmlinuz \
-append console=ttyS0,115200 root=/dev/sda1,numa=fake=4
$

And on the host I have /vm on a ssd that is an XFS filesystem, and
I've created /vm/vm-4/vm-4-500TB.img by doing:

$ xfs_io -f -c truncate 500t -c extsize 1m /vm/vm-4/vm-4-500TB.img

and in the guest the filesystem is created with:

# mkfs.xfs -f -mcrc=1,finobt=1 /dev/vdc

And that will create a 500TB filesystem that you can then mount and
run fsmark on it, then unmount and run xfs_repair on it.

the .config I have on my laptop is from 3.18-rc something, but it
should work just with a make oldconfig update. I'ts attached below.

Hopefully this will be sufficient for you, otherwise it'll have to
wait until I get home to get the exact configs for you.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.18.0-rc1 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT=elf64-x86-64
CONFIG_ARCH_DEFCONFIG=arch/x86/configs/x86_64_defconfig
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS=-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME=(none)
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_FHANDLE is not set
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# 

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-10 Thread Mel Gorman
On Mon, Mar 09, 2015 at 09:02:19PM +, Mel Gorman wrote:
 On Sun, Mar 08, 2015 at 08:40:25PM +, Mel Gorman wrote:
   Because if the answer is 'yes', then we can safely say: 'we regressed 
   performance because correctness [not dropping dirty bits] comes before 
   performance'.
   
   If the answer is 'no', then we still have a mystery (and a regression) 
   to track down.
   
   As a second hack (not to be applied), could we change:
   
#define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
   
   to:
   
#define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
   
  
  In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
  as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
  can be done.
  
 
 More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just
 updating the value should crash. For the purposes of testing the idea, I
 thought the straight-forward option was to break soft dirty page tracking
 and steal their bit for testing (patch below). Took most of the day to
 get access to the test machine so tests are not long running and only
 the autonuma one has completed;
 

And the xfsrepair workload also does not show any benefit from using a
different bit either

   3.19.0 4.0.0-rc1 
4.0.0-rc1 4.0.0-rc1
  vanilla   vanilla 
slowscan-v2r7protnone-v3r17
Min  real-fsmark1164.44 (  0.00%) 1157.41 (  0.60%) 1150.38 
(  1.21%) 1173.22 ( -0.75%)
Min  syst-fsmark4016.12 (  0.00%) 3998.06 (  0.45%) 3988.42 
(  0.69%) 4037.90 ( -0.54%)
Min  real-xfsrepair  442.64 (  0.00%)  497.64 (-12.43%)  456.87 
( -3.21%)  489.60 (-10.61%)
Min  syst-xfsrepair  194.97 (  0.00%)  500.61 (-156.76%)  
263.41 (-35.10%)  544.56 (-179.30%)
Ameanreal-fsmark1166.28 (  0.00%) 1166.63 ( -0.03%) 1155.97 
(  0.88%) 1183.19 ( -1.45%)
Ameansyst-fsmark4025.87 (  0.00%) 4020.94 (  0.12%) 4004.19 
(  0.54%) 4061.64 ( -0.89%)
Ameanreal-xfsrepair  447.66 (  0.00%)  507.85 (-13.45%)  459.58 
( -2.66%)  498.71 (-11.40%)
Ameansyst-xfsrepair  202.93 (  0.00%)  519.88 (-156.19%)  
281.63 (-38.78%)  569.21 (-180.50%)
Stddev   real-fsmark   1.44 (  0.00%)6.55 (-354.10%)
3.97 (-175.65%)9.20 (-537.90%)
Stddev   syst-fsmark   9.76 (  0.00%)   16.22 (-66.27%)   15.09 
(-54.69%)   17.47 (-79.13%)
Stddev   real-xfsrepair5.57 (  0.00%)   11.17 (-100.68%)
3.41 ( 38.66%)6.77 (-21.63%)
Stddev   syst-xfsrepair5.69 (  0.00%)   13.98 (-145.78%)   
19.94 (-250.49%)   20.03 (-252.05%)
CoeffVar real-fsmark   0.12 (  0.00%)0.56 (-353.96%)
0.34 (-178.11%)0.78 (-528.79%)
CoeffVar syst-fsmark   0.24 (  0.00%)0.40 (-66.48%)0.38 
(-55.53%)0.43 (-77.55%)
CoeffVar real-xfsrepair1.24 (  0.00%)2.20 (-76.89%)0.74 
( 40.25%)1.36 ( -9.17%)
CoeffVar syst-xfsrepair2.80 (  0.00%)2.69 (  4.06%)7.08 
(-152.54%)3.52 (-25.51%)
Max  real-fsmark1167.96 (  0.00%) 1171.98 ( -0.34%) 1159.25 
(  0.75%) 1195.41 ( -2.35%)
Max  syst-fsmark4039.20 (  0.00%) 4033.84 (  0.13%) 4024.53 
(  0.36%) 4079.45 ( -1.00%)
Max  real-xfsrepair  455.42 (  0.00%)  523.40 (-14.93%)  464.40 
( -1.97%)  505.82 (-11.07%)
Max  syst-xfsrepair  207.94 (  0.00%)  533.37 (-156.50%)  
309.38 (-48.78%)  593.62 (-185.48%)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-10 Thread Linus Torvalds
On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner da...@fromorbit.com wrote:
 On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:

 What's your virtual environment setup? Kernel config, and
 virtualization environment to actually get that odd fake NUMA thing
 happening?

 I don't have the exact .config with me (test machines at home
 are shut down because I'm half a world away), but it's pretty much
 this (copied and munged from a similar test vm on my laptop):

[ snip snip ]

Ok, I hate debugging by symptoms anyway, so I didn't do any of this,
and went back to actually *thinking* about the code instead of trying
to reproduce this and figure things out by trial and error.

And I think I figured it out. Of course, since I didn't actually test
anything, what do I know, but I feel good about it, because I think I
can explain why that patch that on the face of it shouldn't change
anything actually did.

So, the old code just did all those manual page table changes,
clearing the present bit and setting the NUMA bit instead.

The new code _ostensibly_ does the same, except it clears the present
bit and sets the PROTNONE bit instead.

However, rather than playing special games with just those two bits,
it uses the normal pte accessor functions, and in particular uses
vma-vm_page_prot to reset the protections back. Which is a nice
cleanup and really makes the code look saner, and does the same thing.

Except it really isn't the same thing at all.

Why?

The protection bits in the page tables are *not* the same as
vma-vm_page_prot. Yes, they start out that way, but they don't stay
that way. And no, I'm not talking about dirty and accessed bits.

The difference? COW. Any private mapping is marked read-only in
vma-vm_page_prot, and then the COW (or the initial write) makes it
read-write.

And so, when we did

-   pte = pte_mknonnuma(pte);
+   /* Make it present again */
+   pte = pte_modify(pte, vma-vm_page_prot);
+   pte = pte_mkyoung(pte);

that isn't equivalent at all - it makes the page read-only, because it
restores it to its original state.

Now, that isn't actually what hurts most, I suspect. Judging by the
profiles, we don't suddenly take a lot of new COW faults. No, what
hurts most is that the NUMA balancing code does this:

/*
 * Avoid grouping on DSO/COW pages in specific and RO pages
 * in general, RO pages shouldn't hurt as much anyway since
 * they can be in shared cache state.
 */
if (!pte_write(pte))
flags |= TNF_NO_GROUP;

and that !pte_write(pte) is basically now *always* true for private
mappings (which is 99% of all mappings).

In other words, I think the patch unintentionally made the NUMA code
basically always do the TNF_NO_GROUP case.

I think that a quick hack for testing might be to just replace that
!pte_write() with !pte_dirty(), and seeing how that acts.

Comments?

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-09 Thread Dave Chinner
On Sun, Mar 08, 2015 at 11:35:59AM -0700, Linus Torvalds wrote:
 On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar mi...@kernel.org wrote:
 But:
 
  As a second hack (not to be applied), could we change:
 
   #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
 
  to:
 
   #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
 
  to double check that the position of the bit does not matter?
 
 Agreed. We should definitely try that.
 
 Dave?

As Mel has already mentioned, I'm in Boston for LSFMM and don't have
access to the test rig I've used to generate this.

 Also, is there some sane way for me to actually see this behavior on a
 regular machine with just a single socket? Dave is apparently running
 in some fake-numa setup, I'm wondering if this is easy enough to
 reproduce that I could see it myself.

Should be - I don't actually use 500TB of storage to generate this -
50GB on an SSD is all you need from the storage side. I just use a
sparse backing file to make it look like a 500TB device. :P

i.e. create an XFS filesystem on a 500TB sparse file with mkfs.xfs
-d size=500t,file=1 /path/to/file.img, mount it on loopback or as a
virtio,cache=none device for the guest vm and then use fsmark to
generate several million files spread across many, many directories
such as:

$  fs_mark -D 1 -S0 -n 10 -s 1 -L 32 -d \
/mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \
/mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \
/mnt/scratch/6 -d /mnt/scratch/7

That should only take a few minutes to run - if you throw 8p at it
then it should run at 100k files/s being created.

Then unmount and run xfs_repair -o bhash=101703 /path/to/file.img
on the resultant image file.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-09 Thread Linus Torvalds
On Mon, Mar 9, 2015 at 4:29 AM, Dave Chinner da...@fromorbit.com wrote:

 Also, is there some sane way for me to actually see this behavior on a
 regular machine with just a single socket? Dave is apparently running
 in some fake-numa setup, I'm wondering if this is easy enough to
 reproduce that I could see it myself.

 Should be - I don't actually use 500TB of storage to generate this -
 50GB on an SSD is all you need from the storage side. I just use a
 sparse backing file to make it look like a 500TB device. :P

What's your virtual environment setup? Kernel config, and
virtualization environment to actually get that odd fake NUMA thing
happening?

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-09 Thread Mel Gorman
On Sun, Mar 08, 2015 at 08:40:25PM +, Mel Gorman wrote:
  Because if the answer is 'yes', then we can safely say: 'we regressed 
  performance because correctness [not dropping dirty bits] comes before 
  performance'.
  
  If the answer is 'no', then we still have a mystery (and a regression) 
  to track down.
  
  As a second hack (not to be applied), could we change:
  
   #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
  
  to:
  
   #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
  
 
 In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
 as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
 can be done.
 

More importantily, _PAGE_BIT_GLOBAL+1 == the special PTE bit so just
updating the value should crash. For the purposes of testing the idea, I
thought the straight-forward option was to break soft dirty page tracking
and steal their bit for testing (patch below). Took most of the day to
get access to the test machine so tests are not long running and only
the autonuma one has completed;

autonumabench
  3.19.0 4.0.0-rc1  
   4.0.0-rc1 4.0.0-rc1
 vanilla   vanilla  
   slowscan-v2r7protnone-v3
Time User-NUMA01  25695.96 (  0.00%)32883.59 (-27.97%)
35288.00 (-37.33%)35236.21 (-37.13%)
Time User-NUMA01_THEADLOCAL   17404.36 (  0.00%)17453.20 ( -0.28%)
17765.79 ( -2.08%)17590.10 ( -1.07%)
Time User-NUMA02   2037.65 (  0.00%) 2063.70 ( -1.28%) 
2063.22 ( -1.25%) 2072.95 ( -1.73%)
Time User-NUMA02_SMT981.02 (  0.00%)  983.70 ( -0.27%)  
976.01 (  0.51%)  983.42 ( -0.24%)
Time System-NUMA01  194.70 (  0.00%)  602.44 (-209.42%) 
 209.42 ( -7.56%)  737.36 (-278.72%)
Time System-NUMA01_THEADLOCAL98.52 (  0.00%)   78.10 ( 20.73%)  
 92.70 (  5.91%)   80.69 ( 18.10%)
Time System-NUMA029.28 (  0.00%)6.47 ( 30.28%)  
  6.06 ( 34.70%)6.63 ( 28.56%)
Time System-NUMA02_SMT3.79 (  0.00%)5.06 (-33.51%)  
  3.39 ( 10.55%)3.60 (  5.01%)
Time Elapsed-NUMA01 558.84 (  0.00%)  755.96 (-35.27%)  
833.63 (-49.17%)  804.50 (-43.96%)
Time Elapsed-NUMA01_THEADLOCAL  382.54 (  0.00%)  382.22 (  0.08%)  
395.45 ( -3.37%)  388.12 ( -1.46%)
Time Elapsed-NUMA02  49.83 (  0.00%)   49.38 (  0.90%)  
 50.21 ( -0.76%)   48.99 (  1.69%)
Time Elapsed-NUMA02_SMT  46.59 (  0.00%)   47.70 ( -2.38%)  
 48.55 ( -4.21%)   49.50 ( -6.25%)
Time CPU-NUMA014632.00 (  0.00%) 4429.00 (  4.38%) 
4258.00 (  8.07%) 4471.00 (  3.48%)
Time CPU-NUMA01_THEADLOCAL 4575.00 (  0.00%) 4586.00 ( -0.24%) 
4515.00 (  1.31%) 4552.00 (  0.50%)
Time CPU-NUMA024107.00 (  0.00%) 4191.00 ( -2.05%) 
4120.00 ( -0.32%) 4244.00 ( -3.34%)
Time CPU-NUMA02_SMT2113.00 (  0.00%) 2072.00 (  1.94%) 
2017.00 (  4.54%) 1993.00 (  5.68%)

  3.19.0   4.0.0-rc1   4.0.0-rc1   4.0.0-rc1
 vanilla vanillaslowscan-v2r7protnone-v3
User46119.1253384.2956093.1155882.82
System306.41  692.14  311.64  828.36
Elapsed  1039.88 1236.87 1328.61 1292.92

So just using a different bit doesn't seem to be it either

3.19.0   4.0.0-rc1   4.0.0-rc1   4.0.0-rc1
   vanilla vanillaslowscan-v2r7protnone-v3
NUMA alloc hit 1202922 1437560 1472578 1499274
NUMA alloc miss  0   0   0   0
NUMA interleave hit  0   0   0   0
NUMA alloc local   1200683 1436781 1472226 1498680
NUMA base PTE updates222840103   304513172   121532313   337431414
NUMA huge PMD updates   434894  594467  237170  658715
NUMA page range updates  445505831   608880276   242963353   674693494
NUMA hint faults601358  733491  334334  820793
NUMA hint local faults  371571  511530  227171  565003
NUMA hint local percent 61  69  67  68
NUMA pages migrated707317726366701 860708231288355

Patch to use a bit other than the global bit for prot none is below.

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 8c7c10802e9c..1f243323693c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -20,16 +20,16 @@
 #define _PAGE_BIT_SOFTW2   10  /*  */
 #define _PAGE_BIT_SOFTW3   11  /*  */
 #define _PAGE_BIT_PAT_LARGE12  

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Mel Gorman
On Sun, Mar 08, 2015 at 11:02:23AM +0100, Ingo Molnar wrote:
 
 * Linus Torvalds torva...@linux-foundation.org wrote:
 
  On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote:
  
   And the patch Dave bisected to is a relatively simple patch. Why 
   not simply revert it to see whether that cures much of the 
   problem?
  
  So the problem with that is that pmd_set_numa() and friends simply 
  no longer exist. So we can't just revert that one patch, it's the 
  whole series, and the whole point of the series.
 
 Yeah.
 
  What confuses me is that the only real change that I can see in that 
  patch is the change to change_huge_pmd(). Everything else is 
  pretty much a 100% equivalent transformation, afaik. Of course, I 
  may be wrong about that, and missing something silly.
 
 Well, there's a difference in what we write to the pte:
 
  #define _PAGE_BIT_NUMA  (_PAGE_BIT_GLOBAL+1)
  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
 
 and our expectation was that the two should be equivalent methods from 
 the POV of the NUMA balancing code, right?
 

Functionally yes but performance-wise no. We are now using the global bit
for NUMA faults at the very least.

  And the changes to change_huge_pmd() were basically re-done
  differently by subsequent patches anyway.
  
  The *only* change I see remaining is that change_huge_pmd() now does
  
 entry = pmdp_get_and_clear_notify(mm, addr, pmd);
 entry = pmd_modify(entry, newprot);
 set_pmd_at(mm, addr, pmd, entry);
  
  for all changes. It used to do that pmdp_set_numa() for the
  prot_numa case, which did just
  
 pmd_t pmd = *pmdp;
 pmd = pmd_mknuma(pmd);
 set_pmd_at(mm, addr, pmdp, pmd);
  
  instead.
  
  I don't like the old pmdp_set_numa() because it can drop dirty bits,
  so I think the old code was actively buggy.
 
 Could we, as a silly testing hack not to be applied, write a 
 hack-patch that re-introduces the racy way of setting the NUMA bit, to 
 confirm that it is indeed this difference that changes pte visibility 
 across CPUs enough to create so many more faults?
 

This was already done and tested by Dave but while it helped, it was
not enough.  As the approach was inherently unsafe it was dropped and the
throttling approach taken. However, the fact it made little difference
may indicate that this is somehow related to the global bit being used.

 Because if the answer is 'yes', then we can safely say: 'we regressed 
 performance because correctness [not dropping dirty bits] comes before 
 performance'.
 
 If the answer is 'no', then we still have a mystery (and a regression) 
 to track down.
 
 As a second hack (not to be applied), could we change:
 
  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
 
 to:
 
  #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)
 

In itself, that's not enough. The SWP_OFFSET_SHIFT would also need updating
as a partial revert of 21d9ee3eda7792c45880b2f11bff8e95c9a061fb but it
can be done.

 to double check that the position of the bit does not matter?
 

It's worth checking in case it's a case of how the global bit is
treated. However, note that Dave is currently travelling for LSF/MM in
Boston and there is a chance he cannot test this week at all. I'm just
after landing in the hotel myself. I'll try find time during during one
of the breaks tomorrow but if the wireless is too crap then accessing the
test machine remotely might be an issue.

 I don't think we've exhaused all avenues of analysis here.
 

True.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Linus Torvalds
On Sun, Mar 8, 2015 at 3:02 AM, Ingo Molnar mi...@kernel.org wrote:

 Well, there's a difference in what we write to the pte:

  #define _PAGE_BIT_NUMA  (_PAGE_BIT_GLOBAL+1)
  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

 and our expectation was that the two should be equivalent methods from
 the POV of the NUMA balancing code, right?

Right.

But yes, we might have screwed something up. In particular, there
might be something that thinks it cares about the global bit, but
doesn't notice that the present bit isn't set, so it considers the
protnone mappings to be global and causes lots more tlb flushes etc.

 I don't like the old pmdp_set_numa() because it can drop dirty bits,
 so I think the old code was actively buggy.

 Could we, as a silly testing hack not to be applied, write a
 hack-patch that re-introduces the racy way of setting the NUMA bit, to
 confirm that it is indeed this difference that changes pte visibility
 across CPUs enough to create so many more faults?

So one of Mel's patches did that, but I don't know if Dave tested it.

And thinking about it, it *may* be safe for huge-pages, if they always
already have the dirty bit set to begin with. And I don't see how we
could have a clean hugepage (apart from the special case of the
zeropage, which is read-only, so races on teh dirty bit aren't an
issue).

So it might actually be that the non-atomic version is safe for
hpages. And we could possibly get rid of the atomic read-and-clear
even for the non-numa case.

I'd rather do it for both cases than for just one of them.

But:

 As a second hack (not to be applied), could we change:

  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

 to:

  #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)

 to double check that the position of the bit does not matter?

Agreed. We should definitely try that.

Dave?

Also, is there some sane way for me to actually see this behavior on a
regular machine with just a single socket? Dave is apparently running
in some fake-numa setup, I'm wondering if this is easy enough to
reproduce that I could see it myself.

  Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Linus Torvalds
On Sun, Mar 8, 2015 at 11:35 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 As a second hack (not to be applied), could we change:

  #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

 to:

  #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)

 to double check that the position of the bit does not matter?

 Agreed. We should definitely try that.

There's a second reason to do that, actually: the __supported_pte_mask
thing, _and_ the pageattr stuff in __split_large_page() etc play games
with _PAGE_GLOBAL. As does drivers/lguest for some reason.

So looking at this all, there's a lot of room for confusion with _PAGE_GLOBAL.

That kind of confusion would certainly explain the whole the changes
_look_ like they do the same thing, but don't - because of silly
semantic conflicts with PROTNONE vs GLOBAL.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Ingo Molnar

* Mel Gorman mgor...@suse.de wrote:

 Elapsed time is primarily worse on one benchmark -- numa01 which is 
 an adverse workload. The user time differences are also dominated by 
 that benchmark
 
4.0.0-rc1 4.0.0-rc1
 3.19.0
  vanilla slowscan-v2r7
vanilla
 Time User-NUMA01  32883.59 (  0.00%)35288.00 ( -7.31%)
 25695.96 ( 21.86%)
 Time User-NUMA01_THEADLOCAL   17453.20 (  0.00%)17765.79 ( -1.79%)
 17404.36 (  0.28%)
 Time User-NUMA02   2063.70 (  0.00%) 2063.22 (  0.02%)
  2037.65 (  1.26%)
 Time User-NUMA02_SMT983.70 (  0.00%)  976.01 (  0.78%)
   981.02 (  0.27%)

But even for 'numa02', the simplest of the workloads, there appears to 
be some of a regression relative to v3.19, which ought to be beyond 
the noise of the measurement (which would be below 1% I suspect), and 
as such relevant, right?

And the XFS numbers still show significant regression compared to 
v3.19 - and that cannot be ignored as artificial, 'adversarial' 
workload, right?

For example, from your numbers:

xfsrepair
4.0.0-rc1 4.0.0-rc1 
   3.19.0
  vanilla   slowscan-v2 
  vanilla
...
Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  447.66 
( 11.85%)
Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  202.93 
( 60.97%)

if I interpret the numbers correctly, it shows that compared to v3.19, 
system time increased by 38% - which is rather significant!

  So what worries me is that Dave bisected the regression to:
  
4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table 
  manipulations)
  
  And clearly your patch #4 just tunes balancing/migration intensity 
  - is that a workaround for the real problem/bug?
 
 The patch makes NUMA hinting faults use standard page table handling 
 routines and protections to trap the faults. Fundamentally it's 
 safer even though it appears to cause more traps to be handled. I've 
 been assuming this is related to the different permissions PTEs get 
 and when they are visible on all CPUs. This path is addressing the 
 symptom that more faults are being handled and that it needs to be 
 less aggressive.

But the whole cleanup ought to have been close to an identity 
transformation from the CPU's point of view - and your measurements 
seem to confirm Dave's findings.

And your measurement was on bare metal, while Dave's is on a VM, and 
both show a significant slowdown on the xfs tests even with your 
slow-tuning patch applied, so it's unlikely to be a measurement fluke 
or some weird platform property.

 I've gone through that patch and didn't spot anything else that is 
 doing wrong that is not already handled in this series. Did you spot 
 anything obviously wrong in that patch that isn't addressed in this 
 series?

I didn't spot anything wrong, but is that a basis to go forward and 
work around the regression, in a way that doesn't even recover lost 
performance?

  And the patch Dave bisected to is a relatively simple patch. Why 
  not simply revert it to see whether that cures much of the 
  problem?
 
 Because it also means reverting all the PROT_NONE handling and going 
 back to _PAGE_NUMA tricks which I expect would be naked by Linus.

Yeah, I realize that (and obviously I support the PROT_NONE direction 
that Peter Zijlstra prototyped with the original sched/numa series), 
but can we leave this much of a regression on the table?

I hate to be such a pain in the neck, but especially the 'down tuning' 
of the scanning intensity will make an apples to apples comparison 
harder!

I'd rather not do the slow-tuning part and leave sucky performance in 
place for now and have an easy method plus the motivation to find and 
fix the real cause of the regression, than to partially hide it this 
way ...

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Ingo Molnar

* Mel Gorman mgor...@suse.de wrote:

 xfsrepair
 4.0.0-rc1 4.0.0-rc1   
  3.19.0
   vanilla   slowscan-v2   
 vanilla
 Min  real-fsmark1157.41 (  0.00%) 1150.38 (  0.61%) 
 1164.44 ( -0.61%)
 Min  syst-fsmark3998.06 (  0.00%) 3988.42 (  0.24%) 
 4016.12 ( -0.45%)
 Min  real-xfsrepair  497.64 (  0.00%)  456.87 (  8.19%)  
 442.64 ( 11.05%)
 Min  syst-xfsrepair  500.61 (  0.00%)  263.41 ( 47.38%)  
 194.97 ( 61.05%)
 Ameanreal-fsmark1166.63 (  0.00%) 1155.97 (  0.91%) 
 1166.28 (  0.03%)
 Ameansyst-fsmark4020.94 (  0.00%) 4004.19 (  0.42%) 
 4025.87 ( -0.12%)
 Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  
 447.66 ( 11.85%)
 Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  
 202.93 ( 60.97%)
 Stddev   real-fsmark   6.55 (  0.00%)3.97 ( 39.30%)
 1.44 ( 77.98%)
 Stddev   syst-fsmark  16.22 (  0.00%)   15.09 (  6.96%)
 9.76 ( 39.86%)
 Stddev   real-xfsrepair   11.17 (  0.00%)3.41 ( 69.43%)
 5.57 ( 50.17%)
 Stddev   syst-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)
 5.69 ( 59.31%)
 CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)
 0.12 ( 77.97%)
 CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)
 0.24 ( 39.93%)
 CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)
 1.24 ( 43.47%)
 CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
 2.80 ( -4.23%)
 Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 
 1167.96 (  0.34%)
 Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 
 4039.20 ( -0.13%)
 Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  
 455.42 ( 12.99%)
 Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  
 207.94 ( 61.01%)

Btw., I think it would be nice if these numbers listed v3.19 
performance in the first column, to make it clear at a glance
how much regression we still have?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-08 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote:
 
  And the patch Dave bisected to is a relatively simple patch. Why 
  not simply revert it to see whether that cures much of the 
  problem?
 
 So the problem with that is that pmd_set_numa() and friends simply 
 no longer exist. So we can't just revert that one patch, it's the 
 whole series, and the whole point of the series.

Yeah.

 What confuses me is that the only real change that I can see in that 
 patch is the change to change_huge_pmd(). Everything else is 
 pretty much a 100% equivalent transformation, afaik. Of course, I 
 may be wrong about that, and missing something silly.

Well, there's a difference in what we write to the pte:

 #define _PAGE_BIT_NUMA  (_PAGE_BIT_GLOBAL+1)
 #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

and our expectation was that the two should be equivalent methods from 
the POV of the NUMA balancing code, right?

 And the changes to change_huge_pmd() were basically re-done
 differently by subsequent patches anyway.
 
 The *only* change I see remaining is that change_huge_pmd() now does
 
entry = pmdp_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
set_pmd_at(mm, addr, pmd, entry);
 
 for all changes. It used to do that pmdp_set_numa() for the
 prot_numa case, which did just
 
pmd_t pmd = *pmdp;
pmd = pmd_mknuma(pmd);
set_pmd_at(mm, addr, pmdp, pmd);
 
 instead.
 
 I don't like the old pmdp_set_numa() because it can drop dirty bits,
 so I think the old code was actively buggy.

Could we, as a silly testing hack not to be applied, write a 
hack-patch that re-introduces the racy way of setting the NUMA bit, to 
confirm that it is indeed this difference that changes pte visibility 
across CPUs enough to create so many more faults?

Because if the answer is 'yes', then we can safely say: 'we regressed 
performance because correctness [not dropping dirty bits] comes before 
performance'.

If the answer is 'no', then we still have a mystery (and a regression) 
to track down.

As a second hack (not to be applied), could we change:

 #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL

to:

 #define _PAGE_BIT_PROTNONE  (_PAGE_BIT_GLOBAL+1)

to double check that the position of the bit does not matter?

I don't think we've exhaused all avenues of analysis here.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-07 Thread Linus Torvalds
On Sat, Mar 7, 2015 at 8:36 AM, Ingo Molnar mi...@kernel.org wrote:

 And the patch Dave bisected to is a relatively simple patch.
 Why not simply revert it to see whether that cures much of the
 problem?

So the problem with that is that pmd_set_numa() and friends simply
no longer exist. So we can't just revert that one patch, it's the
whole series, and the whole point of the series.

What confuses me is that the only real change that I can see in that
patch is the change to change_huge_pmd(). Everything else is pretty
much a 100% equivalent transformation, afaik. Of course, I may be
wrong about that, and missing something silly.

And the changes to change_huge_pmd() were basically re-done
differently by subsequent patches anyway.

The *only* change I see remaining is that change_huge_pmd() now does

   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
   entry = pmd_modify(entry, newprot);
   set_pmd_at(mm, addr, pmd, entry);

for all changes. It used to do that pmdp_set_numa() for the
prot_numa case, which did just

   pmd_t pmd = *pmdp;
   pmd = pmd_mknuma(pmd);
   set_pmd_at(mm, addr, pmdp, pmd);

instead.

I don't like the old pmdp_set_numa() because it can drop dirty bits,
so I think the old code was actively buggy.

But I do *not* see why the new code would cause more migrations to happen.

There's probably something really stupid I'm missing.

   Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-07 Thread Mel Gorman
On Sat, Mar 07, 2015 at 05:36:58PM +0100, Ingo Molnar wrote:
 
 * Mel Gorman mgor...@suse.de wrote:
 
  Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
  
  Across the board the 4.0-rc1 numbers are much slower, and the 
  degradation is far worse when using the large memory footprint 
  configs. Perf points straight at the cause - this is from 4.0-rc1 on 
  the -o bhash=101073 config:
  
  [...]
 
 4.0.0-rc1   4.0.0-rc1  3.19.0
   vanilla  slowscan-v2 vanilla
  User53384.2956093.1146119.12
  System692.14  311.64  306.41
  Elapsed  1236.87 1328.61 1039.88
  
  Note that the system CPU usage is now similar to 3.19-vanilla.
 
 Similar, but still worse, and also the elapsed time is still much 
 worse. User time is much higher, although it's the same amount of work 
 done on every kernel, right?
 

Elapsed time is primarily worse on one benchmark -- numa01 which is an
adverse workload. The user time differences are also dominated by that
benchmark

   4.0.0-rc1 4.0.0-rc1  
  3.19.0
 vanilla slowscan-v2r7  
 vanilla
Time User-NUMA01  32883.59 (  0.00%)35288.00 ( -7.31%)
25695.96 ( 21.86%)
Time User-NUMA01_THEADLOCAL   17453.20 (  0.00%)17765.79 ( -1.79%)
17404.36 (  0.28%)
Time User-NUMA02   2063.70 (  0.00%) 2063.22 (  0.02%) 
2037.65 (  1.26%)
Time User-NUMA02_SMT983.70 (  0.00%)  976.01 (  0.78%)  
981.02 (  0.27%)


  I also tested with a workload very similar to Dave's. The machine 
  configuration and storage is completely different so it's not an 
  equivalent test unfortunately. It's reporting the elapsed time and 
  CPU time while fsmark is running to create the inodes and when 
  runnig xfsrepair afterwards
  
  xfsrepair
  4.0.0-rc1 4.0.0-rc1 
 3.19.0
vanilla   slowscan-v2 
vanilla
  Min  real-fsmark1157.41 (  0.00%) 1150.38 (  0.61%) 
  1164.44 ( -0.61%)
  Min  syst-fsmark3998.06 (  0.00%) 3988.42 (  0.24%) 
  4016.12 ( -0.45%)
  Min  real-xfsrepair  497.64 (  0.00%)  456.87 (  8.19%)  
  442.64 ( 11.05%)
  Min  syst-xfsrepair  500.61 (  0.00%)  263.41 ( 47.38%)  
  194.97 ( 61.05%)
  Ameanreal-fsmark1166.63 (  0.00%) 1155.97 (  0.91%) 
  1166.28 (  0.03%)
  Ameansyst-fsmark4020.94 (  0.00%) 4004.19 (  0.42%) 
  4025.87 ( -0.12%)
  Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  
  447.66 ( 11.85%)
  Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  
  202.93 ( 60.97%)
  Stddev   real-fsmark   6.55 (  0.00%)3.97 ( 39.30%)
  1.44 ( 77.98%)
  Stddev   syst-fsmark  16.22 (  0.00%)   15.09 (  6.96%)
  9.76 ( 39.86%)
  Stddev   real-xfsrepair   11.17 (  0.00%)3.41 ( 69.43%)
  5.57 ( 50.17%)
  Stddev   syst-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)
  5.69 ( 59.31%)
  CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)
  0.12 ( 77.97%)
  CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)
  0.24 ( 39.93%)
  CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)
  1.24 ( 43.47%)
  CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
  2.80 ( -4.23%)
  Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 
  1167.96 (  0.34%)
  Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 
  4039.20 ( -0.13%)
  Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  
  455.42 ( 12.99%)
  Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  
  207.94 ( 61.01%)
  
  The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
  is almost cut in half. It's still not as low as 3.19-vanilla but it's
  much closer
  
   4.0.0-rc1   4.0.0-rc1  3.19.0
 vanilla  slowscan-v2 vanilla
  NUMA alloc hit   146138883   121929782   104019526
  NUMA alloc miss   1314632811456356 7806370
  NUMA interleave hit  0   0   0
  NUMA alloc local 146060848   121865921   103953085
  NUMA base PTE updates242201535   117237258   216624143
  NUMA huge PMD updates   113270   52121  127782
  NUMA page range updates  300195775   143923210   282048527
  NUMA hint faults 18038802587299060   147235021
  NUMA hint local faults727845323293925861866265
  NUMA hint local percent 40  37

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

2015-03-07 Thread Ingo Molnar

* Mel Gorman mgor...@suse.de wrote:

 Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
 
 Across the board the 4.0-rc1 numbers are much slower, and the 
 degradation is far worse when using the large memory footprint 
 configs. Perf points straight at the cause - this is from 4.0-rc1 on 
 the -o bhash=101073 config:
 
 [...]

4.0.0-rc1   4.0.0-rc1  3.19.0
  vanilla  slowscan-v2 vanilla
 User53384.2956093.1146119.12
 System692.14  311.64  306.41
 Elapsed  1236.87 1328.61 1039.88
 
 Note that the system CPU usage is now similar to 3.19-vanilla.

Similar, but still worse, and also the elapsed time is still much 
worse. User time is much higher, although it's the same amount of work 
done on every kernel, right?

 I also tested with a workload very similar to Dave's. The machine 
 configuration and storage is completely different so it's not an 
 equivalent test unfortunately. It's reporting the elapsed time and 
 CPU time while fsmark is running to create the inodes and when 
 runnig xfsrepair afterwards
 
 xfsrepair
 4.0.0-rc1 4.0.0-rc1   
  3.19.0
   vanilla   slowscan-v2   
 vanilla
 Min  real-fsmark1157.41 (  0.00%) 1150.38 (  0.61%) 
 1164.44 ( -0.61%)
 Min  syst-fsmark3998.06 (  0.00%) 3988.42 (  0.24%) 
 4016.12 ( -0.45%)
 Min  real-xfsrepair  497.64 (  0.00%)  456.87 (  8.19%)  
 442.64 ( 11.05%)
 Min  syst-xfsrepair  500.61 (  0.00%)  263.41 ( 47.38%)  
 194.97 ( 61.05%)
 Ameanreal-fsmark1166.63 (  0.00%) 1155.97 (  0.91%) 
 1166.28 (  0.03%)
 Ameansyst-fsmark4020.94 (  0.00%) 4004.19 (  0.42%) 
 4025.87 ( -0.12%)
 Ameanreal-xfsrepair  507.85 (  0.00%)  459.58 (  9.50%)  
 447.66 ( 11.85%)
 Ameansyst-xfsrepair  519.88 (  0.00%)  281.63 ( 45.83%)  
 202.93 ( 60.97%)
 Stddev   real-fsmark   6.55 (  0.00%)3.97 ( 39.30%)
 1.44 ( 77.98%)
 Stddev   syst-fsmark  16.22 (  0.00%)   15.09 (  6.96%)
 9.76 ( 39.86%)
 Stddev   real-xfsrepair   11.17 (  0.00%)3.41 ( 69.43%)
 5.57 ( 50.17%)
 Stddev   syst-xfsrepair   13.98 (  0.00%)   19.94 (-42.60%)
 5.69 ( 59.31%)
 CoeffVar real-fsmark   0.56 (  0.00%)0.34 ( 38.74%)
 0.12 ( 77.97%)
 CoeffVar syst-fsmark   0.40 (  0.00%)0.38 (  6.57%)
 0.24 ( 39.93%)
 CoeffVar real-xfsrepair2.20 (  0.00%)0.74 ( 66.22%)
 1.24 ( 43.47%)
 CoeffVar syst-xfsrepair2.69 (  0.00%)7.08 (-163.23%)
 2.80 ( -4.23%)
 Max  real-fsmark1171.98 (  0.00%) 1159.25 (  1.09%) 
 1167.96 (  0.34%)
 Max  syst-fsmark4033.84 (  0.00%) 4024.53 (  0.23%) 
 4039.20 ( -0.13%)
 Max  real-xfsrepair  523.40 (  0.00%)  464.40 ( 11.27%)  
 455.42 ( 12.99%)
 Max  syst-xfsrepair  533.37 (  0.00%)  309.38 ( 42.00%)  
 207.94 ( 61.01%)
 
 The key point is that system CPU usage for xfsrepair (syst-xfsrepair)
 is almost cut in half. It's still not as low as 3.19-vanilla but it's
 much closer
 
  4.0.0-rc1   4.0.0-rc1  3.19.0
vanilla  slowscan-v2 vanilla
 NUMA alloc hit   146138883   121929782   104019526
 NUMA alloc miss   1314632811456356 7806370
 NUMA interleave hit  0   0   0
 NUMA alloc local 146060848   121865921   103953085
 NUMA base PTE updates242201535   117237258   216624143
 NUMA huge PMD updates   113270   52121  127782
 NUMA page range updates  300195775   143923210   282048527
 NUMA hint faults 18038802587299060   147235021
 NUMA hint local faults727845323293925861866265
 NUMA hint local percent 40  37  42
 NUMA pages migrated   711752624139530223237799
 
 Note the big differences in faults trapped and pages migrated. 
 3.19-vanilla still migrated fewer pages but if necessary the 
 threshold at which we start throttling migrations can be lowered.

This too is still worse than what v3.19 had.

So what worries me is that Dave bisected the regression to:

  4d9424669946 (mm: convert p[te|md]_mknonnuma and remaining page table 
manipulations)

And clearly your patch #4 just tunes balancing/migration intensity - 
is that a workaround for the real problem/bug?

And the patch Dave bisected to is a relatively simple patch.
Why not simply revert it to see whether that cures much of the 
problem?

Am I missing something fundamental?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org