Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-17 Thread Ingo Molnar

* Mel Gorman  wrote:

> > > [...] Holding PTL across task_numa_fault is bad, but not 
> > > the bad we're looking for.
> > 
> > No, holding the PTL across task_numa_fault() is fine, 
> > because this bit got reworked in my tree rather 
> > significantly, see:
> > 
> >  6030a23a1c66 sched: Move the NUMA placement logic to a 
> >  worklet
> > 
> > and followup patches.
> 
> I believe I see your point. After that patch is applied 
> task_numa_fault() is a relatively small function and is no 
> longer calling task_numa_placement. Sure, PTL is held longer 
> than necessary but not enough to cause real scalability 
> issues.

Yes - my motivation for that was three-fold:

1) to push rebalancing into process context and thus make it
   essentially lockless and also potentially preemptable.

2) enable the flip-tasks logic, which relies on taking a
   balancing decision and acting on it immediately. If you are
   in process context then this is doable. If you are in a
   balancing irq context then not so much.

3) to simplify the 2M-emu loop was extra dressing on the cake:
   instead of taking and dropping the PTL 512 times (possibly
   interleaving two threads on the same pmd, both of them
   taking/dropping the same set of locks?), it only takes the
   ptl once.

I'll revive this aspect, it has many positives.

> > > If the bug is indeed here, it's not obvious. I don't know 
> > > why I'm triggering it or why it only triggers for specjbb 
> > > as I cannot imagine what the JVM would be doing that is 
> > > that weird or that would not have triggered before. Maybe 
> > > we both suffer this type of problem but that numacores 
> > > rate of migration is able to trigger it.
> > 
> > Agreed.
> 
> I spent some more time on this today and the bug is *really* 
> hard to trigger or at least I have been unable to trigger it 
> today. This begs the question why it triggered three times in 
> relatively quick succession separated by a few hours when 
> testing numacore on Dec 9th. Other tests ran between the 
> failures. The first failure results were discarded. I deleted 
> them to see if the same test reproduced it a second time (it 
> did).
>
> Of the three times this bug triggered in the last week, two 
> were unclear where they crashed but one showed that the bug 
> was triggered by the JVMs garbage collector. That at least is 
> a corner case and might explain why it's hard to trigger.
> 
> I feel extremely bad about how I reported this because even 
> though we differ in how we handle faults, I really cannot see 
> any difference that would explain this and I've looked long 
> enough. Triggering this by the kernel would *have* to be 
> something like missing a cache or TLB flush after page tables 
> have been modified or during migration but in most way that 
> matters we share that logic. Where we differ, it shouldn't 
> matter.

Don't worry, I really think you reported a genuine bug, even if 
it's hard to hit.

> FWIW, numacore pulled yesterday completed the same tests 
> without any error this time but none of the commits since Dec 
> 9th would account for fixing it.

Correct. I think chances are that it's still latent. Either 
fixed in your version of the code, which will be hard to 
reconstruct - or it's an active upstream bug.

I'd not blame it on the JVM for a good while - JVMs are one of 
the most abused pieces of code on the planet, literally running 
millions of applications on thousands of kernel variants.

Could you try the patch below on latest upstream with 
CONFIG_NUMA_BALANCING=y, it increases migration bandwidth 
10-fold - does it make it easier to trigger the bug on the now 
upstream NUMA-balancing feature?

It will kill throughput on a number of your tests, but it should 
make all the NUMA-specific activities during the JVM test a lot 
more frequent.

Thanks,

Ingo

diff --git a/mm/migrate.c b/mm/migrate.c
index 32efd80..8699e8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1511,7 +1511,7 @@ static struct page *alloc_misplaced_dst_page(struct page 
*page,
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
 static unsigned int pteupdate_interval_millisecs __read_mostly = 1000;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 1280 << (20 - PAGE_SHIFT);
 
 /* Returns true if NUMA migration is currently rate limited */
 bool migrate_ratelimited(int node)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-13 Thread Srikar Dronamraju
* Mel Gorman  [2012-12-07 10:23:03]:

> This is a full release of all the patches so apologies for the flood.  V9 was
> just a MIPS build fix and did not justify a full release. V10 includes Ingo's
> scalability patches because even though they increase system CPU usage,
> they also helped in a number of test cases. It would be worthwhile trying
> to reduce the system CPU usage by looking closer at how rwsem works and
> dealing with the contended case a bit better. Otherwise the rate of change
> in the last few weeks has been tiny as the preliminary objectives had been
> met and I did not want to invalidate any testing other people had conducted.
> 
> git tree: 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git 
> mm-balancenuma-v10r3
> git tag:  
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git 
> mm-balancenuma-v10

Here are the specjbb results on a 2 node 24 GB machine.
vm_1 was allocated 12 GB, while vm_2 and vm_3 were allocated 6 GB each
All vms were running specjbb2005 workload

All numbers presented are improvements/regression from v3.7-rc8

--
|  | |  nofit|  
  fit|
--
|  | |  noksm|ksm|  noksm|  
  ksm|
--
|  | |  nothp|thp|  nothp|thp|  nothp|thp|  
nothp|thp|
--
| autonuma-mels-rebase | vm_1|   2.48|  14.25|   1.80|  15.59|   8.16|  14.62|  
 8.56|  17.49|
| autonuma-mels-rebase | vm_2|  23.59|  18.67|  14.20|  23.25|  10.73|  13.18|  
17.94|  21.72|
| autonuma-mels-rebase | vm_3|  16.19|  19.40|  14.42|  22.54|  11.08|  12.04|  
 9.79|  20.34|
--
| mel-balancenuma v10r3| vm_1|   0.10|   1.49|   1.78|   4.00|  -1.01|  -1.16|  
-1.02|  -0.60|
| mel-balancenuma v10r3| vm_2|   3.45|  -0.67|  -1.54|   2.65|  -2.83|  -7.10|  
 0.10|  -2.41|
| mel-balancenuma v10r3| vm_3|   0.56|   5.49|  -0.63|   0.09|  -7.41|  -4.52|  
-0.77|  -1.80|
--
| tip-master 11-dec| vm_1|  -5.68|  12.34|  35.96|  13.33|  10.79|  15.22|  
 9.65|  12.80|
| tip-master 11-dec| vm_2|  14.70|  15.54|  77.45|  15.10|  12.82|  11.20|  
12.66|  na   |
| tip-master 11-dec| vm_3|   6.66|  19.26|  na   |  14.93|   7.62|  14.72|  
14.73|  12.34|
--


there are couple na's .. In those case, the testlog for some wierd
reason didnt have any data. this somehow seems to happen with tip/master
kernel only. May be its just coincidence.

-- 
Thanks and Regards
Srikar

PS: benchmark was run under non-standard conditions run only for the
purpose of relative comparision of different kernels.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-11 Thread Mel Gorman
On Tue, Dec 11, 2012 at 09:52:38AM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman  wrote:
> 
> > On Mon, Dec 10, 2012 at 03:24:05PM +, Mel Gorman wrote:
> > > For example, I think that point 5 above is the potential source of the
> > > corruption because. You're not flushing the TLBs for the PTEs you are
> > > updating in batch. Granted, you're relaxing rather than restricting access
> > > so it should be ok and at worse cause a spurious fault but I also find
> > > it suspicious that you do not recheck pte_same under the PTL when doing
> > > the final PTE update.
> > 
> > Looking again, the lack of a pte_same check should be ok. The 
> > addr, addr_start, ptep and ptep_start is a little messy but 
> > also look fine. You're not accidentally crossing a PMD 
> > boundary. You should be protected against huge pages being 
> > collapsed underneath you as you hold mmap_sem for read. If the 
> > first page in the pmd (or VMA) is not present then target_nid 
> > == -1 which gets passed into __do_numa_page. This check
> > 
> > if (target_nid == -1 || target_nid == page_nid)
> > goto out;
> > 
> > then means you never actually migrate for that whole PMD and 
> > will just clear the PTEs. [...]
> 
> Yes.
> 
> > [...] Possibly wrong, but not what we're looking for. [...]
> 
> It's a detail - I thought not touching partial 2MB pages is just 
> as valid as picking some other page to represent it, and went 
> for the simpler option.
> 

I very strongly suspect that in the majority of cases that it behaves just
as well. I considered whether it makes a difference if the first page
or faulting page was used as the hint but concluded it doesn't.  If the
workload is converged on the PMD, it makes no difference. If it's not,
then tasks are equally affected at least.

> But yes, I agree that using the first present page would be 
> better, as it would better handle partial vmas not 
> starting/ending at a 2MB boundary - which happens frequently in 
> practice.
> 
> > [...] Holding PTL across task_numa_fault is bad, but not the 
> > bad we're looking for.
> 
> No, holding the PTL across task_numa_fault() is fine, because 
> this bit got reworked in my tree rather significantly, see:
> 
>  6030a23a1c66 sched: Move the NUMA placement logic to a worklet
> 
> and followup patches.
> 

I believe I see your point. After that patch is applied task_numa_fault()
is a relatively small function and is no longer calling task_numa_placement.
Sure, PTL is held longer than necessary but not enough to cause real
scalability issues.

> > If the bug is indeed here, it's not obvious. I don't know why 
> > I'm triggering it or why it only triggers for specjbb as I 
> > cannot imagine what the JVM would be doing that is that weird 
> > or that would not have triggered before. Maybe we both suffer 
> > this type of problem but that numacores rate of migration is 
> > able to trigger it.
> 
> Agreed.
> 

I spent some more time on this today and the bug is *really* hard to trigger
or at least I have been unable to trigger it today. This begs the question
why it triggered three times in relatively quick succession separated by
a few hours when testing numacore on Dec 9th. Other tests ran between the
failures. The first failure results were discarded. I deleted them to see
if the same test reproduced it a second time (it did).

Of the three times this bug triggered in the last week, two were unclear
where they crashed but one showed that the bug was triggered by the JVMs
garbage collector. That at least is a corner case and might explain why
it's hard to trigger.

I feel extremely bad about how I reported this because even though we
differ in how we handle faults, I really cannot see any difference that
would explain this and I've looked long enough. Triggering this by the
kernel would *have* to be something like missing a cache or TLB flush
after page tables have been modified or during migration but in most way
that matters we share that logic. Where we differ, it shouldn't matter.

I'm contemplating even that this is a JVM timing bug that can be triggered if
page migration happens at the wrong time. numacore would only be indirectly
at fault by migrating more often. If this was the case, balancenuma would
hit the problem given enough time.

I'll keep kicking it in the background.

FWIW, numacore pulled yesterday completed the same tests without any error
this time but none of the commits since Dec 9th would account for fixing it.

> > > Basically if I felt that handling ptes in batch like this 
> > > was of critical important I would have implemented it very 
> > > differently on top of balancenuma. I would have only taken 
> > > the PTL lock if updating the PTE to keep contention down and 
> > > redid racy checks under PTL, I'd have only used trylock for 
> > > every non-faulted PTE and I would only have migrated if it 
> > > was a remote->local copy. I certainly would not hold PTL 
> > > while calling task_numa_fault(). I w

Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-11 Thread Mel Gorman
On Tue, Dec 11, 2012 at 10:18:07AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > > This is prototype only but what I was using as a reference 
> > > to see could I spot a problem in yours. It has not been even 
> > > boot tested but avoids remote->remote copies, contending on 
> > > PTL or holding it longer than necessary (should anyway)
> > 
> > So ... because time is running out and it would be nice to 
> > progress with this for v3.8, I'd suggest the following 
> > approach:
> > 
> >  - Please send your current tree to Linus as-is. You already 
> >have my Acked-by/Reviewed-by for its scheduler bits, and my
> >testing found your tree to have no regression to mainline,
> >plus it's a nice win in a number of NUMA-intense workloads.
> >So it's a good, monotonic step forward in terms of NUMA
> >balancing, very close to what the bits I'm working on need as
> >infrastructure.
> > 
> >  - I'll rebase all my devel bits on top of it. Instead of
> >removing the migration bandwidth I'll simply increase it for
> >testing - this should trigger similarly aggressive behavior.
> >I'll try to touch as little of the mm/ code as possible, to
> >keep things debuggable.
> 
> One minor last-minute request/nit before you send it to Linus, 
> would you mind doing a:
> 
>CONFIG_BALANCE_NUMA => CONFIG_NUMA_BALANCING
> 
> rename please? (I can do it for you if you don't have the time.)
> 
> CONFIG_NUMA_BALANCING is really what fits into our existing NUMA 
> namespace, CONFIG_NUMA, CONFIG_NUMA_EMU - and, more importantly, 
> the ordering of words follows the common generic -> less generic 
> ordering we do in the kernel for config names and methods.
> 
> So it would fit nicely into existing Kconfig naming schemes:
> 
>CONFIG_TRACING
>CONFIG_FILE_LOCKING
>CONFIG_EVENT_TRACING
> 
> etc.
> 

Yes, that makes sense. I should have spotted the rationale. I also took
the liberty of renaming the command-line parameter and the variables to
be consistent with this.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-11 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > This is prototype only but what I was using as a reference 
> > to see could I spot a problem in yours. It has not been even 
> > boot tested but avoids remote->remote copies, contending on 
> > PTL or holding it longer than necessary (should anyway)
> 
> So ... because time is running out and it would be nice to 
> progress with this for v3.8, I'd suggest the following 
> approach:
> 
>  - Please send your current tree to Linus as-is. You already 
>have my Acked-by/Reviewed-by for its scheduler bits, and my
>testing found your tree to have no regression to mainline,
>plus it's a nice win in a number of NUMA-intense workloads.
>So it's a good, monotonic step forward in terms of NUMA
>balancing, very close to what the bits I'm working on need as
>infrastructure.
> 
>  - I'll rebase all my devel bits on top of it. Instead of
>removing the migration bandwidth I'll simply increase it for
>testing - this should trigger similarly aggressive behavior.
>I'll try to touch as little of the mm/ code as possible, to
>keep things debuggable.

One minor last-minute request/nit before you send it to Linus, 
would you mind doing a:

   CONFIG_BALANCE_NUMA => CONFIG_NUMA_BALANCING

rename please? (I can do it for you if you don't have the time.)

CONFIG_NUMA_BALANCING is really what fits into our existing NUMA 
namespace, CONFIG_NUMA, CONFIG_NUMA_EMU - and, more importantly, 
the ordering of words follows the common generic -> less generic 
ordering we do in the kernel for config names and methods.

So it would fit nicely into existing Kconfig naming schemes:

   CONFIG_TRACING
   CONFIG_FILE_LOCKING
   CONFIG_EVENT_TRACING

etc.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-11 Thread Ingo Molnar

* Mel Gorman  wrote:

> On Mon, Dec 10, 2012 at 03:24:05PM +, Mel Gorman wrote:
> > For example, I think that point 5 above is the potential source of the
> > corruption because. You're not flushing the TLBs for the PTEs you are
> > updating in batch. Granted, you're relaxing rather than restricting access
> > so it should be ok and at worse cause a spurious fault but I also find
> > it suspicious that you do not recheck pte_same under the PTL when doing
> > the final PTE update.
> 
> Looking again, the lack of a pte_same check should be ok. The 
> addr, addr_start, ptep and ptep_start is a little messy but 
> also look fine. You're not accidentally crossing a PMD 
> boundary. You should be protected against huge pages being 
> collapsed underneath you as you hold mmap_sem for read. If the 
> first page in the pmd (or VMA) is not present then target_nid 
> == -1 which gets passed into __do_numa_page. This check
> 
> if (target_nid == -1 || target_nid == page_nid)
> goto out;
> 
> then means you never actually migrate for that whole PMD and 
> will just clear the PTEs. [...]

Yes.

> [...] Possibly wrong, but not what we're looking for. [...]

It's a detail - I thought not touching partial 2MB pages is just 
as valid as picking some other page to represent it, and went 
for the simpler option.

But yes, I agree that using the first present page would be 
better, as it would better handle partial vmas not 
starting/ending at a 2MB boundary - which happens frequently in 
practice.

> [...] Holding PTL across task_numa_fault is bad, but not the 
> bad we're looking for.

No, holding the PTL across task_numa_fault() is fine, because 
this bit got reworked in my tree rather significantly, see:

 6030a23a1c66 sched: Move the NUMA placement logic to a worklet

and followup patches.

> /me scratches his head
> 
> Machine is still unavailable so in an attempt to rattle this 
> out I prototyped the equivalent patch for balancenuma and then 
> went back to numacore to see could I spot a major difference.  
> Comparing them, there is no guarantee you clear pte_numa for 
> the address that was originally faulted if there was a racing 
> fault that cleared it underneath you but in itself that should 
> not be an issue. Your use of ptep++ instead of 
> pte_offset_map() might break on 32-bit with NUMA support if 
> PTE pages are stored in highmem. Still the wrong wrong.

Yes.

> If the bug is indeed here, it's not obvious. I don't know why 
> I'm triggering it or why it only triggers for specjbb as I 
> cannot imagine what the JVM would be doing that is that weird 
> or that would not have triggered before. Maybe we both suffer 
> this type of problem but that numacores rate of migration is 
> able to trigger it.

Agreed.

> > Basically if I felt that handling ptes in batch like this 
> > was of critical important I would have implemented it very 
> > differently on top of balancenuma. I would have only taken 
> > the PTL lock if updating the PTE to keep contention down and 
> > redid racy checks under PTL, I'd have only used trylock for 
> > every non-faulted PTE and I would only have migrated if it 
> > was a remote->local copy. I certainly would not hold PTL 
> > while calling task_numa_fault(). I would have kept the 
> > handling ona per-pmd basis when it was expected that most 
> > PTEs underneath should be on the same node.
> 
> This is prototype only but what I was using as a reference to 
> see could I spot a problem in yours. It has not been even boot 
> tested but avoids remote->remote copies, contending on PTL or 
> holding it longer than necessary (should anyway)

So ... because time is running out and it would be nice to 
progress with this for v3.8, I'd suggest the following approach:

 - Please send your current tree to Linus as-is. You already 
   have my Acked-by/Reviewed-by for its scheduler bits, and my
   testing found your tree to have no regression to mainline,
   plus it's a nice win in a number of NUMA-intense workloads.
   So it's a good, monotonic step forward in terms of NUMA
   balancing, very close to what the bits I'm working on need as
   infrastructure.

 - I'll rebase all my devel bits on top of it. Instead of
   removing the migration bandwidth I'll simply increase it for
   testing - this should trigger similarly aggressive behavior.
   I'll try to touch as little of the mm/ code as possible, to
   keep things debuggable.

If the JVM segfault is a bug introduced by some non-obvious 
difference only present in numa/core and fixed in your tree then 
the bug will be fixed magically and we can forget about it.

If it's something latent in your tree as well, then at least we 
will be able to stare at the exact same tree, instead of 
endlessly wondering about small, unnecessary differences.

( My gut feeling is that it's 50%/50%, I really cannot exclude
  any of the two possibilities. )

Agreed?

Thanks,

Ingo
--
To unsubscribe from this list: send the line 

Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Mel Gorman
On Mon, Dec 10, 2012 at 03:24:05PM +, Mel Gorman wrote:
> For example, I think that point 5 above is the potential source of the
> corruption because. You're not flushing the TLBs for the PTEs you are
> updating in batch. Granted, you're relaxing rather than restricting access
> so it should be ok and at worse cause a spurious fault but I also find
> it suspicious that you do not recheck pte_same under the PTL when doing
> the final PTE update.

Looking again, the lack of a pte_same check should be ok. The addr,
addr_start, ptep and ptep_start is a little messy but also look fine.
You're not accidentally crossing a PMD boundary. You should be protected
against huge pages being collapsed underneath you as you hold mmap_sem for
read. If the first page in the pmd (or VMA) is not present then
target_nid == -1 which gets passed into __do_numa_page. This check

if (target_nid == -1 || target_nid == page_nid)
goto out;

then means you never actually migrate for that whole PMD and will just
clear the PTEs. Possibly wrong, but not what we're looking for. Holding
PTL across task_numa_fault is bad, but not the bad we're looking for.

/me scratches his head

Machine is still unavailable so in an attempt to rattle this out I prototyped
the equivalent patch for balancenuma and then went back to numacore to see
could I spot a major difference.  Comparing them, there is no guarantee you
clear pte_numa for the address that was originally faulted if there was a
racing fault that cleared it underneath you but in itself that should not
be an issue. Your use of ptep++ instead of pte_offset_map() might break
on 32-bit with NUMA support if PTE pages are stored in highmem. Still the
wrong wrong.

If the bug is indeed here, it's not obvious. I don't know why I'm
triggering it or why it only triggers for specjbb as I cannot imagine
what the JVM would be doing that is that weird or that would not have
triggered before. Maybe we both suffer this type of problem but that
numacores rate of migration is able to trigger it.

> Basically if I felt that handling ptes in batch like this was of
> critical important I would have implemented it very differently on top of
> balancenuma. I would have only taken the PTL lock if updating the PTE to
> keep contention down and redid racy checks under PTL, I'd have only used
> trylock for every non-faulted PTE and I would only have migrated if it
> was a remote->local copy. I certainly would not hold PTL while calling
> task_numa_fault(). I would have kept the handling ona per-pmd basis when
> it was expected that most PTEs underneath should be on the same node.
> 

This is prototype only but what I was using as a reference to see could
I spot a problem in yours. It has not been even boot tested but avoids
remote->remote copies, contending on PTL or holding it longer than necessary
(should anyway)

---8<---
mm: numa: Batch pte handling

diff --git a/mm/memory.c b/mm/memory.c
index 33e20b3..f871d5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,30 +3461,14 @@ int numa_migrate_prep(struct page *page, struct 
vm_area_struct *vma,
return mpol_misplaced(page, vma, addr);
 }
 
-int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-  unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static
+int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+  unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd,
+  spinlock_t *ptl, bool only_local, bool *migrated)
 {
struct page *page = NULL;
-   spinlock_t *ptl;
int current_nid = -1;
int target_nid;
-   bool migrated = false;
-
-   /*
-   * The "pte" at this point cannot be used safely without
-   * validation through pte_unmap_same(). It's of NUMA type but
-   * the pfn may be screwed if the read is non atomic.
-   *
-   * ptep_modify_prot_start is not called as this is clearing
-   * the _PAGE_NUMA bit and it is not really expected that there
-   * would be concurrent hardware modifications to the PTE.
-   */
-   ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   if (unlikely(!pte_same(*ptep, pte))) {
-   pte_unmap_unlock(ptep, ptl);
-   goto out;
-   }
 
pte = pte_mknonnuma(pte);
set_pte_at(mm, addr, ptep, pte);
@@ -3493,7 +3477,7 @@ int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
page = vm_normal_page(vma, addr, pte);
if (!page) {
pte_unmap_unlock(ptep, ptl);
-   return 0;
+   goto out;
}
 
current_nid = page_to_nid(page);
@@ -3509,15 +3493,88 @@ int do_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
goto out;
}
 
+   /*
+* Only do remote-local copies when handling PTEs in batch. This does
+* mean we effectively lost the NUMA hinting fault if the workload
+* was n

Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Ingo Molnar

* Srikar Dronamraju  wrote:

> KernelVersion: 3.7.0-rc8-tip_master+(December 7th Snapshot)

> Please do let me know if you have questions/suggestions.

Do you still have the exact sha1 by any chance?

By the date of the snapshot I'd say that this fix:

  f0c77b62ba9d sched: Fix NUMA_EXCLUDE_AFFINE check

could improve performance on your box.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Srikar Dronamraju
Hi Mel, Ingo, 

Here are the results of running autonumabenchmark on a 64 core, 8 node
machine. Has six 32GB nodes and two 64 GB nodes.


KernelVersion: 3.7.0-rc8
Testcase:  Min  Max  Avg
  numa01:  1475.37  1615.39  1555.24
numa01_HARD_BIND:   900.42  1244.00   993.30
 numa01_INVERSE_BIND:  2835.44  5067.22  3634.86
 numa01_THREAD_ALLOC:   918.51  1384.21  1121.17
   numa01_THREAD_ALLOC_HARD_BIND:   599.58  1178.26   792.73
numa01_THREAD_ALLOC_INVERSE_BIND:  1841.33  2237.34  1988.95
  numa02:   126.95   188.31   147.04
numa02_HARD_BIND:26.0529.1726.94
 numa02_INVERSE_BIND:   341.10   369.37   349.10
  numa02_SMT:   144.32   922.65   386.43
numa02_SMT_HARD_BIND:26.61   170.71   101.98
 numa02_SMT_INVERSE_BIND:   288.12   456.45   325.26

KernelVersion: 3.7.0-rc8-tip_master+(December 7th Snapshot)
Testcase:  Min  Max  Avg  %Change
  numa01:  2927.89  3217.56  3103.21  -49.88%
numa01_HARD_BIND:  2653.09  5964.23  3431.35  -71.05%
 numa01_INVERSE_BIND:  3567.03  3933.18  3811.91   -4.64%
 numa01_THREAD_ALLOC:  1801.80  2339.16  1980.96  -43.40%
   numa01_THREAD_ALLOC_HARD_BIND:  1705.84  2110.06  1913.64  -58.57%
numa01_THREAD_ALLOC_INVERSE_BIND:  2266.12  2540.61  2376.67  -16.31%
  numa02:   179.26   358.03   264.19  -44.34%
numa02_HARD_BIND:26.0729.3827.70   -2.74%
 numa02_INVERSE_BIND:   337.99   347.95   343.511.63%
  numa02_SMT:93.65   402.58   213.15   81.29%
numa02_SMT_HARD_BIND:91.19   140.47   116.26  -12.28%
 numa02_SMT_INVERSE_BIND:   289.03   299.57   297.019.51%

KernelVersion: 3.7.0-rc6-mel_auto_balance(mm-balancenuma-v10r3)
Testcase:  Min  Max  Avg  %Change
  numa01:  1536.93  1819.85  1694.54   -8.22%
numa01_HARD_BIND:   909.67  1145.32  1055.57   -5.90%
 numa01_INVERSE_BIND:  2882.07  3287.24  2976.89   22.10%
 numa01_THREAD_ALLOC:   995.79  4845.27  1905.85  -41.17%
   numa01_THREAD_ALLOC_HARD_BIND:   582.36   818.11   655.18   20.99%
numa01_THREAD_ALLOC_INVERSE_BIND:  1790.91  1927.90  1868.496.45%
  numa02:   131.53   287.93   209.15  -29.70%
numa02_HARD_BIND:25.6831.9027.66   -2.60%
 numa02_INVERSE_BIND:   341.09   401.37   353.84   -1.34%
  numa02_SMT:   156.61  2036.63   731.97  -47.21%
numa02_SMT_HARD_BIND:25.10   196.6079.72   27.92%
 numa02_SMT_INVERSE_BIND:   294.22  1801.59   824.41  -60.55%

KernelVersion: 3.7.0-rc6-autonuma+(mm-autonuma-v28fastr4-mels-rebase)
Testcase:  Min  Max  Avg  %Change
  numa01:  1596.13  1715.34  1649.44   -5.71%
numa01_HARD_BIND:   920.75  1127.86  1012.50   -1.90%
 numa01_INVERSE_BIND:  2858.79  3146.74  2977.16   22.09%
 numa01_THREAD_ALLOC:   250.55   374.27   290.12  286.45%
   numa01_THREAD_ALLOC_HARD_BIND:   572.29   712.74   630.62   25.71%
numa01_THREAD_ALLOC_INVERSE_BIND:  1835.94  2401.04  2011.20   -1.11%
  numa02:33.93   104.8050.99  188.37%
numa02_HARD_BIND:25.9427.5126.421.97%
 numa02_INVERSE_BIND:   334.57   349.51   341.232.31%
  numa02_SMT:43.72   114.8262.41  519.18%
numa02_SMT_HARD_BIND:34.9845.6142.07  142.41%
 numa02_SMT_INVERSE_BIND:   284.57   310.62   298.518.96%

Avg refers to mean of 5 iterations of autonuma-benchmark.
%Change refers to percentage change from 3.7-rc8

Please do let me know if you have questions/suggestions.

-- 
Thanks and Regards
Srikar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Mel Gorman
On Mon, Dec 10, 2012 at 12:39:45PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman  wrote:
> 
> > On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman  wrote:
> > > 
> > > > This is a full release of all the patches so apologies for the 
> > > > flood. [...]
> > > 
> > > I have yet to process all your mails, but assuming I address all 
> > > your review feedback and the latest unified tree in tip:master 
> > > shows no regression in your testing, would you be willing to 
> > > start using it for ongoing work?
> > > 
> > 
> > Ingo,
> > 
> > If you had read the second paragraph of the mail you just responded to or
> > the results at the end then you would have seen that I had problems with
> > the performance. [...]
> 
> I've posted a (NUMA-placement sensitive workload centric) 
> performance comparisons between "balancenuma", AutoNUMA and 
> numa/core unified-v3 to:
> 
>https://lkml.org/lkml/2012/12/7/331
> 
> I tried to address all performance regressions you and others 
> have reported.
> 

I've responded to this now. I acknowledge that balancenuma does not do
great on them. I've also explained that it's very likely because I did
not hook into the scheduler and I'm relucent to do so. Once I do that,
we're directly colliding when my intention was to handle all the necessary
MM changes, the bare minimum of the scheduler hook and maintain that side
while numacore and all the additional scheduler changes was built on top.

> 
> I also tried to reproduce and fix as many bugs you reported as 
> possible - but my point is that it would be _much_ better if we 
> actually joined forces.
> 

Which is what balancenuma was meant to do and what I wanted weeks ago
-- I wanted to keep a handle on the mm side of things and establish
performance baseline for just the mm side that numacore could be compared
against.  I'd then help maintain the result, review patches particularly
affecting mm etc.  I was hoping that numacore would be rebased to carry
the necessary scheduler changes but that didn't happen. The unified tree
is not equivalent. Just off-hand

1. there is no performance comparison possible with just the mm changes
2. the vmstat fault accounting is broken in the unified tree
3. the code to allow balancenuma to be disabled from command line
   was removed which the THP experience has told us is very useful
4. The THP patch was wedged in as hard as possible making it effectively
   impossible to treat in isolation
5. ptes are treated as effective hugepage faults which potentially
   results in remote->remote copies if tasks share data on a
   PMD-boundary even if they do not share data on the page boundary.
   For this reason I dislike it quite a bit
6. the migrate rate-limiting code was removed

To be fair, the last one is a difference in opinion. I think migrate
rate-limiting is important because I think it's more important for the
workload to run than the kernel to getting too much in the way thinking
it can do better.

Some of the other changes just made no sense to me and I still fail to
see why you didn't rebase numacore a few weeks ago and instead smacked the
trees together. If it had been a plain rebase then I would have switched
to looking at just numacore on top without having to worry if something
unexpected was broken on the MM side. If something had broken on the MM
side, I'd be on it without wondering if it was due to how the trees were
merged.

For example, I think that point 5 above is the potential source of the
corruption because. You're not flushing the TLBs for the PTEs you are
updating in batch. Granted, you're relaxing rather than restricting access
so it should be ok and at worse cause a spurious fault but I also find
it suspicious that you do not recheck pte_same under the PTL when doing
the final PTE update. I also find it strange that you hold the PTL while
calling task_numa_fault(). No way should the PTL have to protect structures
in kernel/sched and I wonder was that actually part of the reason why you
saw heavy PTL contention.

Basically if I felt that handling ptes in batch like this was of
critical important I would have implemented it very differently on top of
balancenuma. I would have only taken the PTL lock if updating the PTE to
keep contention down and redid racy checks under PTL, I'd have only used
trylock for every non-faulted PTE and I would only have migrated if it
was a remote->local copy. I certainly would not hold PTL while calling
task_numa_fault(). I would have kept the handling ona per-pmd basis when
it was expected that most PTEs underneath should be on the same node.

> > [...] You would also know that tip/master testing for the last 
> > week was failing due to a boot problem (issue was in mainline 
> > not tip and has been already fixed) and would have known that 
> > since the -v18 release that numacore was effectively disabled 
> > on my test machine.
> 
> I'm glad it's fixed.
> 

Agreed.

> > Clearly you are not reading the bug 

Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Ingo Molnar

hi Srikar,

* Srikar Dronamraju  wrote:

> > 
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> I see this failure when running with THP and KSM enabled on 
> Friday's Tip master. Not sure if Mel was talking about the same issue.
> 
> [ cut here ]
> kernel BUG at ../kernel/sched/fair.c:2371!

Could you check whether today's -tip (7ea8701a1a51 or later), 
plus the patch below, addresses the crash - while still giving 
good NUMA performance?

Thanks,

Ingo

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d11a8a..6a89787 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2351,6 +2351,9 @@ void task_numa_fault(unsigned long addr, int node, int 
last_cpupid, int pages, b
int priv;
int idx;
 
+   if (!p->numa_faults)
+   return;
+
if (last_cpupid != cpu_pid_to_cpupid(-1, -1)) {
/* Did we access it last time around? */
if (last_pid == this_pid) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > reasons. As it turns out, a printk() bodge showed that 
> > nr_cpus_allowed == 80 set in sched_init_smp() while 
> > num_online_cpus() == 48. This effectively disabling 
> > numacore. If you had responded to the bug report, this would 
> > likely have been found last Wednesday.
> 
> Does changing it from num_online_cpus() to num_possible_cpus() 
> help? (Can send a patch if you want.)

I.e. something like the patch below.

Thanks,

Ingo

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 503ec29..9d11a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2646,7 +2646,7 @@ static bool task_numa_candidate(struct task_struct *p)
 
/* Don't disturb hard-bound tasks: */
if (sched_feat(NUMA_EXCLUDE_AFFINE)) {
-   if (p->nr_cpus_allowed != num_online_cpus())
+   if (p->nr_cpus_allowed != num_possible_cpus())
return false;
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Ingo Molnar

* Mel Gorman  wrote:

> On Fri, Dec 07, 2012 at 12:01:13PM +0100, Ingo Molnar wrote:
> > 
> > * Mel Gorman  wrote:
> > 
> > > This is a full release of all the patches so apologies for the 
> > > flood. [...]
> > 
> > I have yet to process all your mails, but assuming I address all 
> > your review feedback and the latest unified tree in tip:master 
> > shows no regression in your testing, would you be willing to 
> > start using it for ongoing work?
> > 
> 
> Ingo,
> 
> If you had read the second paragraph of the mail you just responded to or
> the results at the end then you would have seen that I had problems with
> the performance. [...]

I've posted a (NUMA-placement sensitive workload centric) 
performance comparisons between "balancenuma", AutoNUMA and 
numa/core unified-v3 to:

   https://lkml.org/lkml/2012/12/7/331

I tried to address all performance regressions you and others 
have reported.

Here's the direct [bandwidth] comparison of 'balancenuma v10' to 
my -v3 tree:

balancenuma  | NUMA-tip
 [test unit]:  -v10  |-v3

 2x1-bw-process : 6.136  |  9.647:  57.2%
 3x1-bw-process : 7.250  | 14.528: 100.4%
 4x1-bw-process : 6.867  | 18.903: 175.3%
 8x1-bw-process : 7.974  | 26.829: 236.5%
 8x1-bw-process-NOTHP   : 5.937  | 22.237: 274.5%
 16x1-bw-process: 5.592  | 29.294: 423.9%
 4x1-bw-thread  :13.598  | 19.290:  41.9%
 8x1-bw-thread  :16.356  | 26.391:  61.4%
 16x1-bw-thread :24.608  | 29.557:  20.1%
 32x1-bw-thread :25.477  | 30.232:  18.7%
 2x3-bw-thread  : 8.785  | 15.327:  74.5%
 4x4-bw-thread  : 6.366  | 27.957: 339.2%
 4x6-bw-thread  : 6.287  | 27.877: 343.4%
 4x8-bw-thread  : 5.860  | 28.439: 385.3%
 4x8-bw-thread-NOTHP: 6.167  | 25.067: 306.5%
 3x3-bw-thread  : 8.235  | 21.560: 161.8%
 5x5-bw-thread  : 5.762  | 26.081: 352.6%
 2x16-bw-thread : 5.920  | 23.269: 293.1%
 1x32-bw-thread : 5.828  | 18.985: 225.8%
 numa02-bw  :29.054  | 31.431:   8.2%
 numa02-bw-NOTHP:27.064  | 29.104:   7.5%
 numa01-bw-thread   :20.338  | 28.607:  40.7%
 numa01-bw-thread-NOTHP :18.528  | 21.119:  14.0%


I also tried to reproduce and fix as many bugs you reported as 
possible - but my point is that it would be _much_ better if we 
actually joined forces.

> [...] You would also know that tip/master testing for the last 
> week was failing due to a boot problem (issue was in mainline 
> not tip and has been already fixed) and would have known that 
> since the -v18 release that numacore was effectively disabled 
> on my test machine.

I'm glad it's fixed.

> Clearly you are not reading the bug reports you are receiving 
> and you're not seeing the small bit of review feedback or 
> answering the review questions you have received either. Why 
> would I be more forthcoming when I feel that it'll simply be 
> ignored? [...]

I am reading the bug reports and addressing bugs as I can.

> [...]  You simply assume that each batch of patches you place 
> on top must be fixing all known regressions and ignoring any 
> evidence to the contrary.
>
> If you had read my mail from last Tuesday you would even know 
> which patch was causing the problem that effectively disabled 
> numacore although not why. The comment about p->numa_faults 
> was completely off the mark (long journey, was tired, assumed 
> numa_faults was a counter and not a pointer which was 
> careless).  If you had called me on it then I would have 
> spotted the actual problem sooner. The problem was indeed with 
> the nr_cpus_allowed == num_online_cpus()s check which I had 
> pointed out was a suspicious check although for different 
> reasons. As it turns out, a printk() bodge showed that 
> nr_cpus_allowed == 80 set in sched_init_smp() while 
> num_online_cpus() == 48. This effectively disabling numacore. 
> If you had responded to the bug report, this would likely have 
> been found last Wednesday.

Does changing it from num_online_cpus() to num_possible_cpus() 
help? (Can send a patch if you want.)

> > It would make it much easier for me to pick up your 
> > enhancements, fixes, etc.
> > 
> > > Changelog since V9
> > >   o Migration scalability 
> > > (mingo)
> > 
> > To *really* see migration scalability bottlenecks you need to 
> > remove the migration-bandwidth throttling kludge from your tree 
> > (or configure it up very high if you want to do it simple).
> > 
> 
> Why is it a kludge? I already explained what the rational 
> behind the rate limiting was. It's not about scalability, it's 
> 

Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Mel Gorman
On Mon, Dec 10, 2012 at 10:37:10AM +0530, Srikar Dronamraju wrote:
> > 
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> I see this failure when running with THP and KSM enabled on 
> Friday's Tip master. Not sure if Mel was talking about the same issue.
> 
> [ cut here ]
> kernel BUG at ../kernel/sched/fair.c:2371!

I'm not, this is new to me. I grepped the console logs I have and the closest
I see is a WARN_ON triggered in numacore v17 which is no longer relevant.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-10 Thread Mel Gorman
On Sun, Dec 09, 2012 at 11:17:09PM +0200, Kirill A. Shutemov wrote:
> On Sun, Dec 09, 2012 at 08:36:31PM +, Mel Gorman wrote:
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> Are you talking about huge zero page, right?
> 

No, this is happening in tip/master which does not include the huge zero
page work yet. AFAIK, that's still queued in Andrew's tree for the next
merge window. It is possible that there will be collisions between numa
balancing and the huge zero page work but it hasn't happened yet.

> I've fixed a race in huge zero page implementation recently[1]. Symptoms
> were similar -- SIGSEGV in JVM. The patch is in mmotm-2012-12-05-16-56 and
> later.
> 

It might be a similar class of bug.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-09 Thread Srikar Dronamraju
* Srikar Dronamraju  [2012-12-10 10:37:10]:

> > 
> > Either way, last night I applied a patch on top of latest tip/master to
> > remove the nr_cpus_allowed check so that numacore would be enabled again
> > and tested that. In some places it has indeed much improved. In others
> > it is still regressing badly and in two case, it's corrupting memory --
> > specjbb when THP is enabled crashes when running for single or multiple
> > JVMs. It is likely that a zero page is being inserted due to a race with
> > migration and causes the JVM to throw a null pointer exception. Here is
> > the comparison on the rough off-chance you actually read it this time.
> 
> I see this failure when running with THP and KSM enabled on 
> Friday's Tip master. Not sure if Mel was talking about the same issue.
> 
 
Even occurs with !THP but KSM enabled.

> [ cut here ]
> kernel BUG at ../kernel/sched/fair.c:2371!
> invalid opcode:  [#1] SMP
> Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand 
> acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables 
> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack 
> ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt 
> iTCO_vendor_support kvm_intel kvm microcode cdc_ether usbnet mii serio_raw 
> i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 
> sg ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase 
> scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> CPU 4
> Pid: 116, comm: ksmd Not tainted 3.7.0-rc8-tip_master+ #5 IBM BladeCenter 
> HS22V -[7871AC1]-/81Y5995
> RIP: 0010:[]  [] 
> task_numa_fault+0x1a9/0x1e0
> RSP: 0018:880372237ba8  EFLAGS: 00010246
> RAX: 0074 RBX: 0001 RCX: 0001
> RDX: 12ae RSI: 0004 RDI: 7faf4fc01000
> RBP: 880372237be8 R08:  R09: 8803657463f0
> R10: 0001 R11: 0001 R12: 0012
> R13: 880372210d00 R14: 00010088 R15: 
> FS:  () GS:88037fc8() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 01d26fec CR3: 0169f000 CR4: 27e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process ksmd (pid: 116, threadinfo 880372236000, task 880372210d00)
> Stack:
>  ea0016026c58 7faf4fc0 880372237c48 0001
>  7faf4fc01000 ea000d6df928 0001 ea00166e9268
>  880372237c48 8113cd0e 88030001 0002
> Call Trace:
>  [] __do_numa_page+0xde/0x160
>  [] handle_pte_fault+0x32e/0xcd0
>  [] ? drop_large_spte+0x30/0x30 [kvm]
>  [] ? kvm_set_spte_hva+0x25/0x30 [kvm]
>  [] handle_mm_fault+0x279/0x760
>  [] break_ksm+0x74/0xa0
>  [] break_cow+0xa2/0xb0
>  [] ksm_scan_thread+0xb5c/0xd50
>  [] ? wake_up_bit+0x40/0x40
>  [] ? run_store+0x340/0x340
>  [] kthread+0xce/0xe0
>  [] ? kthread_freezable_should_stop+0x70/0x70
>  [] ret_from_fork+0x7c/0xb0
>  [] ? kthread_freezable_should_stop+0x70/0x70
> Code: 89 f0 41 bf 01 00 00 00 8b 1c 10 e9 d7 fe ff ff 8d 14 09 48 63 d2 eb bd 
> 66 2e 0f 1f 84 00 00 00 00 00 49 8b 85 98 07 00 00 eb 91 <0f> 0b eb fe 80 3d 
> 9c 3b 6b 00 01 0f 84 be fe ff ff be 42 09 00
> RIP  [] task_numa_fault+0x1a9/0x1e0
>  RSP 
> ---[ end trace 9584c9b03fc0dbc0 ]---
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-09 Thread Srikar Dronamraju
> 
> Either way, last night I applied a patch on top of latest tip/master to
> remove the nr_cpus_allowed check so that numacore would be enabled again
> and tested that. In some places it has indeed much improved. In others
> it is still regressing badly and in two case, it's corrupting memory --
> specjbb when THP is enabled crashes when running for single or multiple
> JVMs. It is likely that a zero page is being inserted due to a race with
> migration and causes the JVM to throw a null pointer exception. Here is
> the comparison on the rough off-chance you actually read it this time.

I see this failure when running with THP and KSM enabled on 
Friday's Tip master. Not sure if Mel was talking about the same issue.

[ cut here ]
kernel BUG at ../kernel/sched/fair.c:2371!
invalid opcode:  [#1] SMP
Modules linked in: ebtable_nat ebtables autofs4 sunrpc cpufreq_ondemand 
acpi_cpufreq freq_table mperf bridge stp llc iptable_filter ip_tables 
ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack 
ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun iTCO_wdt 
iTCO_vendor_support kvm_intel kvm microcode cdc_ether usbnet mii serio_raw 
i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma i7core_edac edac_core bnx2 sg 
ixgbe dca mdio ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase 
scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
CPU 4
Pid: 116, comm: ksmd Not tainted 3.7.0-rc8-tip_master+ #5 IBM BladeCenter HS22V 
-[7871AC1]-/81Y5995
RIP: 0010:[]  [] task_numa_fault+0x1a9/0x1e0
RSP: 0018:880372237ba8  EFLAGS: 00010246
RAX: 0074 RBX: 0001 RCX: 0001
RDX: 12ae RSI: 0004 RDI: 7faf4fc01000
RBP: 880372237be8 R08:  R09: 8803657463f0
R10: 0001 R11: 0001 R12: 0012
R13: 880372210d00 R14: 00010088 R15: 
FS:  () GS:88037fc8() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 01d26fec CR3: 0169f000 CR4: 27e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process ksmd (pid: 116, threadinfo 880372236000, task 880372210d00)
Stack:
 ea0016026c58 7faf4fc0 880372237c48 0001
 7faf4fc01000 ea000d6df928 0001 ea00166e9268
 880372237c48 8113cd0e 88030001 0002
Call Trace:
 [] __do_numa_page+0xde/0x160
 [] handle_pte_fault+0x32e/0xcd0
 [] ? drop_large_spte+0x30/0x30 [kvm]
 [] ? kvm_set_spte_hva+0x25/0x30 [kvm]
 [] handle_mm_fault+0x279/0x760
 [] break_ksm+0x74/0xa0
 [] break_cow+0xa2/0xb0
 [] ksm_scan_thread+0xb5c/0xd50
 [] ? wake_up_bit+0x40/0x40
 [] ? run_store+0x340/0x340
 [] kthread+0xce/0xe0
 [] ? kthread_freezable_should_stop+0x70/0x70
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_freezable_should_stop+0x70/0x70
Code: 89 f0 41 bf 01 00 00 00 8b 1c 10 e9 d7 fe ff ff 8d 14 09 48 63 d2 eb bd 
66 2e 0f 1f 84 00 00 00 00 00 49 8b 85 98 07 00 00 eb 91 <0f> 0b eb fe 80 3d 9c 
3b 6b 00 01 0f 84 be fe ff ff be 42 09 00
RIP  [] task_numa_fault+0x1a9/0x1e0
 RSP 
---[ end trace 9584c9b03fc0dbc0 ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-09 Thread Kirill A. Shutemov
On Sun, Dec 09, 2012 at 08:36:31PM +, Mel Gorman wrote:
> Either way, last night I applied a patch on top of latest tip/master to
> remove the nr_cpus_allowed check so that numacore would be enabled again
> and tested that. In some places it has indeed much improved. In others
> it is still regressing badly and in two case, it's corrupting memory --
> specjbb when THP is enabled crashes when running for single or multiple
> JVMs. It is likely that a zero page is being inserted due to a race with
> migration and causes the JVM to throw a null pointer exception. Here is
> the comparison on the rough off-chance you actually read it this time.

Are you talking about huge zero page, right?

I've fixed a race in huge zero page implementation recently[1]. Symptoms
were similar -- SIGSEGV in JVM. The patch is in mmotm-2012-12-05-16-56 and
later.

[1] http://lkml.org/lkml/2012/11/30/279
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-07 Thread Ingo Molnar

* Mel Gorman  wrote:

> This is a full release of all the patches so apologies for the 
> flood. [...]

I have yet to process all your mails, but assuming I address all 
your review feedback and the latest unified tree in tip:master 
shows no regression in your testing, would you be willing to 
start using it for ongoing work?

It would make it much easier for me to pick up your 
enhancements, fixes, etc.

> Changelog since V9
>   o Migration scalability (mingo)

To *really* see migration scalability bottlenecks you need to 
remove the migration-bandwidth throttling kludge from your tree 
(or configure it up very high if you want to do it simple).

Some (certainly not all) of the performance regressions you 
reported were certainly due to numa/core code hitting the 
migration codepaths as aggressively as the workload demanded - 
and hitting scalability bottlenecks.

The right approach is to hit scalability bottlenecks and fix 
them.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/