Re: mm: BUG in unmap_page_range

2014-09-17 Thread Sasha Levin
On 09/11/2014 06:38 PM, Sasha Levin wrote:
> On 09/11/2014 12:28 PM, Mel Gorman wrote:
>> > Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be
>> > really nice if you could bisect 3.17-rc4 to linux-next carrying the
>> > VM_BUG_ON(!(val & _PAGE_PRESENT)) check at each bisection point. I'm not
>> > 100% sure if I'm seeing the same corruption as you or some other issue and
>> > do not want to conflate numerous different problems into one. I know this
>> > is a pain in the ass but if 3.17-rc4 looks stable then a bisection might
>> > be faster overall than my constant head scratching :(
> The good news are that 3.17-rc4 seems to be stable. I'll start the bisection,
> which I suspect would take several days. I'll update when I run into 
> something.

I might need a bit of a help here. The bisection is going sideways because I
can't reliably reproduce the issue.

We don't know what's causing this issue, but we know what the symptoms are. Is
there a VM_BUG_ON we could add somewhere so that it would be more likely to
trigger?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-17 Thread Sasha Levin
On 09/11/2014 06:38 PM, Sasha Levin wrote:
 On 09/11/2014 12:28 PM, Mel Gorman wrote:
  Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be
  really nice if you could bisect 3.17-rc4 to linux-next carrying the
  VM_BUG_ON(!(val  _PAGE_PRESENT)) check at each bisection point. I'm not
  100% sure if I'm seeing the same corruption as you or some other issue and
  do not want to conflate numerous different problems into one. I know this
  is a pain in the ass but if 3.17-rc4 looks stable then a bisection might
  be faster overall than my constant head scratching :(
 The good news are that 3.17-rc4 seems to be stable. I'll start the bisection,
 which I suspect would take several days. I'll update when I run into 
 something.

I might need a bit of a help here. The bisection is going sideways because I
can't reliably reproduce the issue.

We don't know what's causing this issue, but we know what the symptoms are. Is
there a VM_BUG_ON we could add somewhere so that it would be more likely to
trigger?


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Sasha Levin
On 09/11/2014 12:28 PM, Mel Gorman wrote:
> Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be
> really nice if you could bisect 3.17-rc4 to linux-next carrying the
> VM_BUG_ON(!(val & _PAGE_PRESENT)) check at each bisection point. I'm not
> 100% sure if I'm seeing the same corruption as you or some other issue and
> do not want to conflate numerous different problems into one. I know this
> is a pain in the ass but if 3.17-rc4 looks stable then a bisection might
> be faster overall than my constant head scratching :(

The good news are that 3.17-rc4 seems to be stable. I'll start the bisection,
which I suspect would take several days. I'll update when I run into something.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Mel Gorman
On Thu, Sep 11, 2014 at 04:39:39AM -0700, Hugh Dickins wrote:
> On Wed, 10 Sep 2014, Sasha Levin wrote:
> > On 09/10/2014 03:36 PM, Hugh Dickins wrote:
> > > Right, and Sasha  reports that that can fire, but he sees the bug
> > > with this patch in and without that firing.
> > 
> > I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful
> > VMA information out, and got the following:
> 
> Well, thanks, but Mel and I have both failed to perceive any actual
> problem arising from that peculiarity.  And Mel's warning, and the 900s
> in yesterday's dumps, have shown that it is not correlated with the
> pte_mknuma() bug we are chasing.  So there isn't anything that I want to
> look up in these vmas.  Or did you notice something interesting in them?
> 
> > And on a maybe related note, I've started seeing the following today. It may
> > be because we fixed mbind() in trinity but it could also be related to
> 
> The fixed trinity may be counter-productive for now, since we think
> there is an understandable pte_mknuma() bug coming from that direction,
> but have not posted a patch for it yet.
> 
> > this issue (free_pgtables() is in the call chain). If you don't think it has
> > anything to do with it let me know and I'll start a new thread:
> > 
> > [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at 
> >   (null)
> > [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 
> > lib/rbtree.c:229 lib/rbtree.c:367)
> > [ 1196.001744] Call Trace:
> > [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24)
> > [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232)
> > [ 1196.001744] unlink_file_vma (mm/mmap.c:246)
> > [ 1196.001744] free_pgtables (mm/memory.c:547)
> > [ 1196.001744] exit_mmap (mm/mmap.c:2826)
> > [ 1196.001744] mmput (kernel/fork.c:654)
> > [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 
> > kernel/exit.c:461 kernel/exit.c:746)
> 
> I didn't study in any detail, but this one seems much more like the
> zeroing and vma corruption that you've been seeing in other dumps.
> 

I didn't look through the dumps closely today because I spent the time
putting together a KVM setup similar to Sasha's (many cpus, fake NUMA,
etc) so I could run trinity in it in another attempt to reproduce this.
I did not encounter the same VM_BUG_ON unfortunately. However, trinity
itself crashed after 2.5 hours complaining

[watchdog] pid 32188 has disappeared. Reaping.
[watchdog] pid 32024 has disappeared. Reaping.
[watchdog] pid 32300 has disappeared. Reaping.
[watchdog] Sanity check failed! Found pid 0 at pidslot 35!

This did not happen when running on bare metal. This error makes me wonder
if it is evidence that there is zeroing corruption occuring when running
inside KVM. Another possibility is that it's somehow related to fake NUMA
although it's hard to see how. It's still possible the bug is with the
page table handling and KVM affects timing enough to cause problems so
I'm not ruling that out.

> Though a single pte_mknuma() crash could presumably be caused by vma
> corruption (but I think not mere zeroing), the recurrent way in which
> you hit that pte_mknuma() bug in particular makes it unlikely to be
> caused by random corruption.
> 
> You are generating new crashes faster than we can keep up with them.
> Would this be a suitable point for you to switch over to testing
> 3.17-rc, to see if that is as unstable for you as -next is?
> 
> That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree,
> but I think you can "safely" add it to 3.17-rc.  Quotes around
> "safely" meaning that we know that there's a bug to hit, at least
> in -next, but I don't think it's going to be hit for stupid obvious
> reasons.
> 

Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be
really nice if you could bisect 3.17-rc4 to linux-next carrying the
VM_BUG_ON(!(val & _PAGE_PRESENT)) check at each bisection point. I'm not
100% sure if I'm seeing the same corruption as you or some other issue and
do not want to conflate numerous different problems into one. I know this
is a pain in the ass but if 3.17-rc4 looks stable then a bisection might
be faster overall than my constant head scratching :(

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Dave Jones
On Thu, Sep 11, 2014 at 10:22:42AM -0400, Sasha Levin wrote:

 > > The fixed trinity may be counter-productive for now, since we think
 > > there is an understandable pte_mknuma() bug coming from that direction,
 > > but have not posted a patch for it yet.
 > 
 > I'm still seeing the bug with fixed trinity, it was a matter of adding more 
 > flags
 > to mbind.
 
What did I miss ? Anything not in the MPOL_MF_VALID mask should be -EINVAL

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Sasha Levin
On 09/11/2014 07:39 AM, Hugh Dickins wrote:
> On Wed, 10 Sep 2014, Sasha Levin wrote:
>> On 09/10/2014 03:36 PM, Hugh Dickins wrote:
>>> Right, and Sasha  reports that that can fire, but he sees the bug
>>> with this patch in and without that firing.
>>
>> I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful
>> VMA information out, and got the following:
> 
> Well, thanks, but Mel and I have both failed to perceive any actual
> problem arising from that peculiarity.  And Mel's warning, and the 900s
> in yesterday's dumps, have shown that it is not correlated with the
> pte_mknuma() bug we are chasing.  So there isn't anything that I want to
> look up in these vmas.  Or did you notice something interesting in them?

I thought this was a separate issue that would need taking care of as well.

>> And on a maybe related note, I've started seeing the following today. It may
>> be because we fixed mbind() in trinity but it could also be related to
> 
> The fixed trinity may be counter-productive for now, since we think
> there is an understandable pte_mknuma() bug coming from that direction,
> but have not posted a patch for it yet.

I'm still seeing the bug with fixed trinity, it was a matter of adding more 
flags
to mbind.

>> this issue (free_pgtables() is in the call chain). If you don't think it has
>> anything to do with it let me know and I'll start a new thread:
>>
>> [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at  
>>  (null)
>> [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 
>> lib/rbtree.c:229 lib/rbtree.c:367)
>> [ 1196.001744] Call Trace:
>> [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24)
>> [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232)
>> [ 1196.001744] unlink_file_vma (mm/mmap.c:246)
>> [ 1196.001744] free_pgtables (mm/memory.c:547)
>> [ 1196.001744] exit_mmap (mm/mmap.c:2826)
>> [ 1196.001744] mmput (kernel/fork.c:654)
>> [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 
>> kernel/exit.c:461 kernel/exit.c:746)
> 
> I didn't study in any detail, but this one seems much more like the
> zeroing and vma corruption that you've been seeing in other dumps.
> 
> Though a single pte_mknuma() crash could presumably be caused by vma
> corruption (but I think not mere zeroing), the recurrent way in which
> you hit that pte_mknuma() bug in particular makes it unlikely to be
> caused by random corruption.
> 
> You are generating new crashes faster than we can keep up with them.
> Would this be a suitable point for you to switch over to testing
> 3.17-rc, to see if that is as unstable for you as -next is?
> 
> That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree,
> but I think you can "safely" add it to 3.17-rc.  Quotes around
> "safely" meaning that we know that there's a bug to hit, at least
> in -next, but I don't think it's going to be hit for stupid obvious
> reasons.

I'll try it, usually I just hit a bunch of issues that were already fixed
in -next, which is why I try sticking to one tree.

> And you're using a gcc 5 these days?  That's another variable to
> try removing from the mix, to see if it makes a difference.

I'm seeing the BUG getting hit with 4.7.2, so I don't think it's compiler
dependant. I'll try reproducing everything I reported yesterday with 4.7.2
just in case, but I don't think that this is the issue.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Hugh Dickins
On Wed, 10 Sep 2014, Sasha Levin wrote:
> On 09/10/2014 03:36 PM, Hugh Dickins wrote:
> > Right, and Sasha  reports that that can fire, but he sees the bug
> > with this patch in and without that firing.
> 
> I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful
> VMA information out, and got the following:

Well, thanks, but Mel and I have both failed to perceive any actual
problem arising from that peculiarity.  And Mel's warning, and the 900s
in yesterday's dumps, have shown that it is not correlated with the
pte_mknuma() bug we are chasing.  So there isn't anything that I want to
look up in these vmas.  Or did you notice something interesting in them?

> And on a maybe related note, I've started seeing the following today. It may
> be because we fixed mbind() in trinity but it could also be related to

The fixed trinity may be counter-productive for now, since we think
there is an understandable pte_mknuma() bug coming from that direction,
but have not posted a patch for it yet.

> this issue (free_pgtables() is in the call chain). If you don't think it has
> anything to do with it let me know and I'll start a new thread:
> 
> [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 
> lib/rbtree.c:229 lib/rbtree.c:367)
> [ 1196.001744] Call Trace:
> [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24)
> [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232)
> [ 1196.001744] unlink_file_vma (mm/mmap.c:246)
> [ 1196.001744] free_pgtables (mm/memory.c:547)
> [ 1196.001744] exit_mmap (mm/mmap.c:2826)
> [ 1196.001744] mmput (kernel/fork.c:654)
> [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 
> kernel/exit.c:461 kernel/exit.c:746)

I didn't study in any detail, but this one seems much more like the
zeroing and vma corruption that you've been seeing in other dumps.

Though a single pte_mknuma() crash could presumably be caused by vma
corruption (but I think not mere zeroing), the recurrent way in which
you hit that pte_mknuma() bug in particular makes it unlikely to be
caused by random corruption.

You are generating new crashes faster than we can keep up with them.
Would this be a suitable point for you to switch over to testing
3.17-rc, to see if that is as unstable for you as -next is?

That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree,
but I think you can "safely" add it to 3.17-rc.  Quotes around
"safely" meaning that we know that there's a bug to hit, at least
in -next, but I don't think it's going to be hit for stupid obvious
reasons.

And you're using a gcc 5 these days?  That's another variable to
try removing from the mix, to see if it makes a difference.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Sasha Levin
On 09/11/2014 12:28 PM, Mel Gorman wrote:
 Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be
 really nice if you could bisect 3.17-rc4 to linux-next carrying the
 VM_BUG_ON(!(val  _PAGE_PRESENT)) check at each bisection point. I'm not
 100% sure if I'm seeing the same corruption as you or some other issue and
 do not want to conflate numerous different problems into one. I know this
 is a pain in the ass but if 3.17-rc4 looks stable then a bisection might
 be faster overall than my constant head scratching :(

The good news are that 3.17-rc4 seems to be stable. I'll start the bisection,
which I suspect would take several days. I'll update when I run into something.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Hugh Dickins
On Wed, 10 Sep 2014, Sasha Levin wrote:
 On 09/10/2014 03:36 PM, Hugh Dickins wrote:
  Right, and Sasha  reports that that can fire, but he sees the bug
  with this patch in and without that firing.
 
 I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful
 VMA information out, and got the following:

Well, thanks, but Mel and I have both failed to perceive any actual
problem arising from that peculiarity.  And Mel's warning, and the 900s
in yesterday's dumps, have shown that it is not correlated with the
pte_mknuma() bug we are chasing.  So there isn't anything that I want to
look up in these vmas.  Or did you notice something interesting in them?

 And on a maybe related note, I've started seeing the following today. It may
 be because we fixed mbind() in trinity but it could also be related to

The fixed trinity may be counter-productive for now, since we think
there is an understandable pte_mknuma() bug coming from that direction,
but have not posted a patch for it yet.

 this issue (free_pgtables() is in the call chain). If you don't think it has
 anything to do with it let me know and I'll start a new thread:
 
 [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at   
 (null)
 [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 
 lib/rbtree.c:229 lib/rbtree.c:367)
 [ 1196.001744] Call Trace:
 [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24)
 [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232)
 [ 1196.001744] unlink_file_vma (mm/mmap.c:246)
 [ 1196.001744] free_pgtables (mm/memory.c:547)
 [ 1196.001744] exit_mmap (mm/mmap.c:2826)
 [ 1196.001744] mmput (kernel/fork.c:654)
 [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 
 kernel/exit.c:461 kernel/exit.c:746)

I didn't study in any detail, but this one seems much more like the
zeroing and vma corruption that you've been seeing in other dumps.

Though a single pte_mknuma() crash could presumably be caused by vma
corruption (but I think not mere zeroing), the recurrent way in which
you hit that pte_mknuma() bug in particular makes it unlikely to be
caused by random corruption.

You are generating new crashes faster than we can keep up with them.
Would this be a suitable point for you to switch over to testing
3.17-rc, to see if that is as unstable for you as -next is?

That VM_BUG_ON(!(val  _PAGE_PRESENT)) is not in the 3.17-rc tree,
but I think you can safely add it to 3.17-rc.  Quotes around
safely meaning that we know that there's a bug to hit, at least
in -next, but I don't think it's going to be hit for stupid obvious
reasons.

And you're using a gcc 5 these days?  That's another variable to
try removing from the mix, to see if it makes a difference.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Sasha Levin
On 09/11/2014 07:39 AM, Hugh Dickins wrote:
 On Wed, 10 Sep 2014, Sasha Levin wrote:
 On 09/10/2014 03:36 PM, Hugh Dickins wrote:
 Right, and Sasha  reports that that can fire, but he sees the bug
 with this patch in and without that firing.

 I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful
 VMA information out, and got the following:
 
 Well, thanks, but Mel and I have both failed to perceive any actual
 problem arising from that peculiarity.  And Mel's warning, and the 900s
 in yesterday's dumps, have shown that it is not correlated with the
 pte_mknuma() bug we are chasing.  So there isn't anything that I want to
 look up in these vmas.  Or did you notice something interesting in them?

I thought this was a separate issue that would need taking care of as well.

 And on a maybe related note, I've started seeing the following today. It may
 be because we fixed mbind() in trinity but it could also be related to
 
 The fixed trinity may be counter-productive for now, since we think
 there is an understandable pte_mknuma() bug coming from that direction,
 but have not posted a patch for it yet.

I'm still seeing the bug with fixed trinity, it was a matter of adding more 
flags
to mbind.

 this issue (free_pgtables() is in the call chain). If you don't think it has
 anything to do with it let me know and I'll start a new thread:

 [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at  
  (null)
 [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 
 lib/rbtree.c:229 lib/rbtree.c:367)
 [ 1196.001744] Call Trace:
 [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24)
 [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232)
 [ 1196.001744] unlink_file_vma (mm/mmap.c:246)
 [ 1196.001744] free_pgtables (mm/memory.c:547)
 [ 1196.001744] exit_mmap (mm/mmap.c:2826)
 [ 1196.001744] mmput (kernel/fork.c:654)
 [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 
 kernel/exit.c:461 kernel/exit.c:746)
 
 I didn't study in any detail, but this one seems much more like the
 zeroing and vma corruption that you've been seeing in other dumps.
 
 Though a single pte_mknuma() crash could presumably be caused by vma
 corruption (but I think not mere zeroing), the recurrent way in which
 you hit that pte_mknuma() bug in particular makes it unlikely to be
 caused by random corruption.
 
 You are generating new crashes faster than we can keep up with them.
 Would this be a suitable point for you to switch over to testing
 3.17-rc, to see if that is as unstable for you as -next is?
 
 That VM_BUG_ON(!(val  _PAGE_PRESENT)) is not in the 3.17-rc tree,
 but I think you can safely add it to 3.17-rc.  Quotes around
 safely meaning that we know that there's a bug to hit, at least
 in -next, but I don't think it's going to be hit for stupid obvious
 reasons.

I'll try it, usually I just hit a bunch of issues that were already fixed
in -next, which is why I try sticking to one tree.

 And you're using a gcc 5 these days?  That's another variable to
 try removing from the mix, to see if it makes a difference.

I'm seeing the BUG getting hit with 4.7.2, so I don't think it's compiler
dependant. I'll try reproducing everything I reported yesterday with 4.7.2
just in case, but I don't think that this is the issue.


Thanks,
Sasha

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Dave Jones
On Thu, Sep 11, 2014 at 10:22:42AM -0400, Sasha Levin wrote:

   The fixed trinity may be counter-productive for now, since we think
   there is an understandable pte_mknuma() bug coming from that direction,
   but have not posted a patch for it yet.
  
  I'm still seeing the bug with fixed trinity, it was a matter of adding more 
  flags
  to mbind.
 
What did I miss ? Anything not in the MPOL_MF_VALID mask should be -EINVAL

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-11 Thread Mel Gorman
On Thu, Sep 11, 2014 at 04:39:39AM -0700, Hugh Dickins wrote:
 On Wed, 10 Sep 2014, Sasha Levin wrote:
  On 09/10/2014 03:36 PM, Hugh Dickins wrote:
   Right, and Sasha  reports that that can fire, but he sees the bug
   with this patch in and without that firing.
  
  I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful
  VMA information out, and got the following:
 
 Well, thanks, but Mel and I have both failed to perceive any actual
 problem arising from that peculiarity.  And Mel's warning, and the 900s
 in yesterday's dumps, have shown that it is not correlated with the
 pte_mknuma() bug we are chasing.  So there isn't anything that I want to
 look up in these vmas.  Or did you notice something interesting in them?
 
  And on a maybe related note, I've started seeing the following today. It may
  be because we fixed mbind() in trinity but it could also be related to
 
 The fixed trinity may be counter-productive for now, since we think
 there is an understandable pte_mknuma() bug coming from that direction,
 but have not posted a patch for it yet.
 
  this issue (free_pgtables() is in the call chain). If you don't think it has
  anything to do with it let me know and I'll start a new thread:
  
  [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at 
(null)
  [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 
  lib/rbtree.c:229 lib/rbtree.c:367)
  [ 1196.001744] Call Trace:
  [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24)
  [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232)
  [ 1196.001744] unlink_file_vma (mm/mmap.c:246)
  [ 1196.001744] free_pgtables (mm/memory.c:547)
  [ 1196.001744] exit_mmap (mm/mmap.c:2826)
  [ 1196.001744] mmput (kernel/fork.c:654)
  [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 
  kernel/exit.c:461 kernel/exit.c:746)
 
 I didn't study in any detail, but this one seems much more like the
 zeroing and vma corruption that you've been seeing in other dumps.
 

I didn't look through the dumps closely today because I spent the time
putting together a KVM setup similar to Sasha's (many cpus, fake NUMA,
etc) so I could run trinity in it in another attempt to reproduce this.
I did not encounter the same VM_BUG_ON unfortunately. However, trinity
itself crashed after 2.5 hours complaining

[watchdog] pid 32188 has disappeared. Reaping.
[watchdog] pid 32024 has disappeared. Reaping.
[watchdog] pid 32300 has disappeared. Reaping.
[watchdog] Sanity check failed! Found pid 0 at pidslot 35!

This did not happen when running on bare metal. This error makes me wonder
if it is evidence that there is zeroing corruption occuring when running
inside KVM. Another possibility is that it's somehow related to fake NUMA
although it's hard to see how. It's still possible the bug is with the
page table handling and KVM affects timing enough to cause problems so
I'm not ruling that out.

 Though a single pte_mknuma() crash could presumably be caused by vma
 corruption (but I think not mere zeroing), the recurrent way in which
 you hit that pte_mknuma() bug in particular makes it unlikely to be
 caused by random corruption.
 
 You are generating new crashes faster than we can keep up with them.
 Would this be a suitable point for you to switch over to testing
 3.17-rc, to see if that is as unstable for you as -next is?
 
 That VM_BUG_ON(!(val  _PAGE_PRESENT)) is not in the 3.17-rc tree,
 but I think you can safely add it to 3.17-rc.  Quotes around
 safely meaning that we know that there's a bug to hit, at least
 in -next, but I don't think it's going to be hit for stupid obvious
 reasons.
 

Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be
really nice if you could bisect 3.17-rc4 to linux-next carrying the
VM_BUG_ON(!(val  _PAGE_PRESENT)) check at each bisection point. I'm not
100% sure if I'm seeing the same corruption as you or some other issue and
do not want to conflate numerous different problems into one. I know this
is a pain in the ass but if 3.17-rc4 looks stable then a bisection might
be faster overall than my constant head scratching :(

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 03:36 PM, Hugh Dickins wrote:
>> migrate: debug patch to try identify race between migration completion and 
>> mprotect
>> > 
>> > A migration entry is marked as write if pte_write was true at the
>> > time the entry was created. The VMA protections are not double checked
>> > when migration entries are being removed but mprotect itself will mark
>> > write-migration-entries as read to avoid problems. It means we potentially
>> > take a spurious fault to mark these ptes write again but otherwise it's
>> > harmless.  Still, one dump indicates that this situation can actually
>> > happen so this debugging patch spits out a warning if the situation occurs
>> > and hopefully the resulting warning will contain a clue as to how exactly
>> > it happens
>> > 
>> > Not-signed-off
>> > ---
>> >  mm/migrate.c | 12 ++--
>> >  1 file changed, 10 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/mm/migrate.c b/mm/migrate.c
>> > index 09d489c..631725c 100644
>> > --- a/mm/migrate.c
>> > +++ b/mm/migrate.c
>> > @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, 
>> > struct vm_area_struct *vma,
>> >pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>> >if (pte_swp_soft_dirty(*ptep))
>> >pte = pte_mksoft_dirty(pte);
>> > -  if (is_write_migration_entry(entry))
>> > -  pte = pte_mkwrite(pte);
>> > +  if (is_write_migration_entry(entry)) {
>> > +  /*
>> > +   * This WARN_ON_ONCE is temporary for the purposes of seeing if
>> > +   * it's a case encountered by trinity in Sasha's testing
>> > +   */
>> > +  if (!(vma->vm_flags & (VM_WRITE)))
>> > +  WARN_ON_ONCE(1);
>> > +  else
>> > +  pte = pte_mkwrite(pte);
>> > +  }
>> >  #ifdef CONFIG_HUGETLB_PAGE
>> >if (PageHuge(new)) {
>> >pte = pte_mkhuge(pte);
>> > 
> Right, and Sasha  reports that that can fire, but he sees the bug
> with this patch in and without that firing.

I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA 
information
out, and got the following:

[ 4018.870776] vma 8801a0f1e800 start 7f3fd0ca7000 end 7f3fd16a7000
[ 4018.870776] next 8804e1b89800 prev 88008cd9a000 mm 88054b17d000
[ 4018.870776] prot 120 anon_vma 880bc858a200 vm_ops   (null)
[ 4018.870776] pgoff 41bc8 file   (null) private_data   (null)
[ 4018.879731] flags: 0x8100070(mayread|maywrite|mayexec|account)
[ 4018.881324] [ cut here ]
[ 4018.882612] kernel BUG at mm/migrate.c:155!
[ 4018.883649] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 4018.889647] Dumping ftrace buffer:
[ 4018.890323](ftrace buffer empty)
[ 4018.890323] Modules linked in:
[ 4018.890323] CPU: 4 PID: 9966 Comm: trinity-main Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00042-ga4bad9b-dirty #1140
[ 4018.890323] task: 880695b83000 ti: 880560c44000 task.ti: 
880560c44000
[ 4018.890323] RIP: 0010:[]  [] 
remove_migration_pte+0x3e1/0x3f0
[ 4018.890323] RSP: :880560c477c8  EFLAGS: 00010292
[ 4018.890323] RAX: 0001 RBX: 7f3fd129b000 RCX: 
[ 4018.890323] RDX: 0001 RSI: 9e4ba395 RDI: 0001
[ 4018.890323] RBP: 880560c47800 R08: 0001 R09: 0001
[ 4018.890323] R10: 00045401 R11: 0001 R12: 8801a0f1e800
[ 4018.890323] R13: 88054b17d000 R14: ea000478eb40 R15: 880122bcf070
[ 4018.890323] FS:  7f3fd55bb700() GS:8803d6a0() 
knlGS:
[ 4018.890323] CS:  0010 DS:  ES:  CR0: 8005003b
[ 4018.890323] CR2: 00fcbca8 CR3: 000561bab000 CR4: 06a0
[ 4018.890323] DR0: 006f DR1:  DR2: 
[ 4018.890323] DR3:  DR6: 0ff0 DR7: 0600
[ 4018.890323] Stack:
[ 4018.890323]  ea00046ed980 88011079c4d8 ea000478eb40 
880560c47858
[ 4018.890323]  88019fde0330 000421bc 8801a0f1e800 
880560c47848
[ 4018.890323]  9b2d1b0f 880bc858a200 880560c47850 
ea000478eb40
[ 4018.890323] Call Trace:
[ 4018.890323]  [] rmap_walk+0x22f/0x380
[ 4018.890323]  [] remove_migration_ptes+0x41/0x50
[ 4018.890323]  [] ? 
__migration_entry_wait.isra.24+0x160/0x160
[ 4018.890323]  [] ? remove_migration_pte+0x3f0/0x3f0
[ 4018.890323]  [] move_to_new_page+0x16b/0x230
[ 4018.890323]  [] ? try_to_unmap+0x6c/0xf0
[ 4018.890323]  [] ? try_to_unmap_nonlinear+0x5c0/0x5c0
[ 4018.890323]  [] ? invalid_migration_vma+0x30/0x30
[ 4018.890323]  [] ? page_remove_rmap+0x320/0x320
[ 4018.890323]  [] migrate_pages+0x85c/0x930
[ 4018.890323]  [] ? isolate_freepages_block+0x410/0x410
[ 4018.890323]  [] ? arch_local_save_flags+0x30/0x30
[ 4018.890323]  [] compact_zone+0x4d3/0x8a0
[ 4018.890323]  [] compact_zone_order+0x5f/0xa0
[ 4018.890323]  [] try_to_compact_pages+0x127/0x2f0
[ 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Hugh Dickins
On Wed, 10 Sep 2014, Sasha Levin wrote:
> On 09/10/2014 03:09 PM, Hugh Dickins wrote:
> > Thanks for supplying, but the change in inlining means that
> > change_protection_range() and change_protection() are no longer
> > relevant for these traces, we now need to see change_pte_range()
> > instead, to confirm that what I expect are ptes are indeed ptes.
> > 
> > If you can include line numbers (objdump -ld) in the disassembly, so
> > much the better, but should be decipherable without.  (Or objdump -Sd
> > for source, but I often find that harder to unscramble, can't say why.)
> 
> Here it is. Note that the source includes both of Mel's debug patches.
> For reference, here's one trace of the issue with those patches:
> 
> [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724!
> [ 3114.541857] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 3114.543112] Dumping ftrace buffer:
> [ 3114.544056](ftrace buffer empty)
> [ 3114.545000] Modules linked in:
> [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW  
> 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
> [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 
> 88076f584000
> [ 3114.549284] RIP: 0010:[]  [] 
> change_pte_range+0x4ea/0x4f0
> [ 3114.550028] RSP: :88076f587d68  EFLAGS: 00010246
> [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 
> 0100
> [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 
> 000314625900
> [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 
> 00b5
> [ 3114.550028] R10: 00032c01 R11: 0008 R12: 
> 8802a81070c0
> [ 3114.550028] R13: 8025 R14: 41343000 R15: 
> cfff
> [ 3114.550028] FS:  7fabb91c8700() GS:88025ec0() 
> knlGS:
> [ 3114.550028] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 
> 06a0
> [ 3114.550028] DR0: 006f DR1:  DR2: 
> 
> [ 3114.550028] DR3:  DR6: 0ff0 DR7: 
> 00050602
> [ 3114.550028] Stack:
> [ 3114.550028]  0001 000314625900 0018 
> 8802685f2260
> [ 3114.550028]  1684 8802cf973600 88061684 
> 41343000
> [ 3114.550028]  880108805048 41005000 4120 
> 41343000
> [ 3114.550028] Call Trace:
> [ 3114.550028]  [] change_protection+0x2b4/0x4e0
> [ 3114.550028]  [] change_prot_numa+0x1b/0x40
> [ 3114.550028]  [] task_numa_work+0x1f6/0x330
> [ 3114.550028]  [] task_work_run+0xc4/0xf0
> [ 3114.550028]  [] do_notify_resume+0x97/0xb0
> [ 3114.550028]  [] int_signal+0x12/0x17
> [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff 
> ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 
> 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
> [ 3114.550028] RIP  [] change_pte_range+0x4ea/0x4f0
> [ 3114.550028]  RSP 
> 
> And the disassembly:
...
> /home/sasha/linux-next/mm/mprotect.c:105
>  31d: 48 8b 4d a8 mov-0x58(%rbp),%rcx
>  321: 81 e1 01 03 00 00   and$0x301,%ecx
>  327: 48 81 f9 00 02 00 00cmp$0x200,%rcx
>  32e: 0f 84 0b ff ff ff   je 23f 
> pte_val():
> /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450
>  334: 48 83 3d 00 00 00 00cmpq   $0x0,0x0(%rip)# 33c 
> 
>  33b: 00
>   337: R_X86_64_PC32  pv_mmu_ops+0xe3
> ptep_set_numa():
> /home/sasha/linux-next/include/asm-generic/pgtable.h:740
>  33c: 49 8b 3c 24 mov(%r12),%rdi
> pte_val():
> /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450
>  340: 0f 84 12 01 00 00   je 458 
>  346: ff 14 25 00 00 00 00callq  *0x0
>   349: R_X86_64_32S   pv_mmu_ops+0xe8
> pte_mknuma():
> /home/sasha/linux-next/include/asm-generic/pgtable.h:724
>  34d: a8 01   test   $0x1,%al
>  34f: 0f 84 95 01 00 00   je 4ea 
...
> ptep_set_numa():
> /home/sasha/linux-next/include/asm-generic/pgtable.h:724
>  4ea: 0f 0b   ud2

Thanks, yes, there is enough in there to be sure that the ...900 is
indeed the oldpte.  I wasn't expecting that pv_mmu_ops function call,
but there's no evidence that it does anything worse than just return
in %rax what it's given in %rdi; and the second long on the stack is
the -0x58(%rbp) from which oldpte is retrieved for !pte_numa(oldpte)
at the beginning of the extract above.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 03:09 PM, Hugh Dickins wrote:
> Thanks for supplying, but the change in inlining means that
> change_protection_range() and change_protection() are no longer
> relevant for these traces, we now need to see change_pte_range()
> instead, to confirm that what I expect are ptes are indeed ptes.
> 
> If you can include line numbers (objdump -ld) in the disassembly, so
> much the better, but should be decipherable without.  (Or objdump -Sd
> for source, but I often find that harder to unscramble, can't say why.)

Here it is. Note that the source includes both of Mel's debug patches.
For reference, here's one trace of the issue with those patches:

[ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724!
[ 3114.541857] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3114.543112] Dumping ftrace buffer:
[ 3114.544056](ftrace buffer empty)
[ 3114.545000] Modules linked in:
[ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
[ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 
88076f584000
[ 3114.549284] RIP: 0010:[]  [] 
change_pte_range+0x4ea/0x4f0
[ 3114.550028] RSP: :88076f587d68  EFLAGS: 00010246
[ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100
[ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900
[ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5
[ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0
[ 3114.550028] R13: 8025 R14: 41343000 R15: cfff
[ 3114.550028] FS:  7fabb91c8700() GS:88025ec0() 
knlGS:
[ 3114.550028] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0
[ 3114.550028] DR0: 006f DR1:  DR2: 
[ 3114.550028] DR3:  DR6: 0ff0 DR7: 00050602
[ 3114.550028] Stack:
[ 3114.550028]  0001 000314625900 0018 
8802685f2260
[ 3114.550028]  1684 8802cf973600 88061684 
41343000
[ 3114.550028]  880108805048 41005000 4120 
41343000
[ 3114.550028] Call Trace:
[ 3114.550028]  [] change_protection+0x2b4/0x4e0
[ 3114.550028]  [] change_prot_numa+0x1b/0x40
[ 3114.550028]  [] task_numa_work+0x1f6/0x330
[ 3114.550028]  [] task_work_run+0xc4/0xf0
[ 3114.550028]  [] do_notify_resume+0x97/0xb0
[ 3114.550028]  [] int_signal+0x12/0x17
[ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 3114.550028] RIP  [] change_pte_range+0x4ea/0x4f0
[ 3114.550028]  RSP 

And the disassembly:

 :
change_pte_range():
/home/sasha/linux-next/mm/mprotect.c:70
   0:   e8 00 00 00 00  callq  5 
1: R_X86_64_PC32__fentry__-0x4
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 57   push   %r15
   b:   41 56   push   %r14
   d:   49 89 cemov%rcx,%r14
  10:   41 55   push   %r13
  12:   4d 89 c5mov%r8,%r13
  15:   41 54   push   %r12
  17:   49 89 f4mov%rsi,%r12
  1a:   53  push   %rbx
  1b:   48 89 d3mov%rdx,%rbx
  1e:   48 83 ec 38 sub$0x38,%rsp
/home/sasha/linux-next/mm/mprotect.c:71
  22:   48 8b 47 40 mov0x40(%rdi),%rax
/home/sasha/linux-next/mm/mprotect.c:70
  26:   48 89 7d c8 mov%rdi,-0x38(%rbp)
lock_pte_protection():
/home/sasha/linux-next/mm/mprotect.c:53
  2a:   8b 4d 10mov0x10(%rbp),%ecx
change_pte_range():
/home/sasha/linux-next/mm/mprotect.c:70
  2d:   44 89 4d c4 mov%r9d,-0x3c(%rbp)
/home/sasha/linux-next/mm/mprotect.c:71
  31:   48 89 45 d0 mov%rax,-0x30(%rbp)
lock_pte_protection():
/home/sasha/linux-next/mm/mprotect.c:53
  35:   85 c9   test   %ecx,%ecx
  37:   0f 84 6b 03 00 00   je 3a8 
pmd_to_page():
/home/sasha/linux-next/include/linux/mm.h:1538
  3d:   48 89 f7mov%rsi,%rdi
  40:   48 81 e7 00 f0 ff ffand$0xf000,%rdi
  47:   e8 00 00 00 00  callq  4c 
48: R_X86_64_PC32   __phys_addr-0x4
  4c:   48 ba 00 00 00 00 00movabs $0xea00,%rdx
  53:   ea ff ff
  56:   48 c1 e8 0c shr$0xc,%rax
spin_lock():
/home/sasha/linux-next/include/linux/spinlock.h:309
  5a:   48 89 55 b8 mov%rdx,-0x48(%rbp)
  5e:   48 c1 e0 06 shl$0x6,%rax
  62:   4c 8b 7c 10 30  mov

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Hugh Dickins
On Wed, 10 Sep 2014, Mel Gorman wrote:
> On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote:
> > 
> > I've been rather assuming that the 9d340902 seen in many of the
> > registers in that Aug26 dump is the pte val in question: that's
> > SOFT_DIRTY|PROTNONE|RW.

The 900s in the latest dumps imply that that 902 was not important.
(If any of them are in fact the pte val.)

> > 
> > I think RW on PROTNONE is unusual but not impossible (migration entry
> > replacement racing with mprotect setting PROT_NONE, after it's updated
> > vm_page_prot, before it's reached the page table). 
> 
> At the risk of sounding thick, I need to spell this out because I'm
> having trouble seeing exactly what race you are thinking of. 
> 
> Migration entry replacement is protected against parallel NUMA hinting
> updates by the page table lock (either PMD or PTE level). It's taken by
> remove_migration_pte on one side and lock_pte_protection on the other.
> 
> For the mprotect case racing again migration, migration entries are not
> present so change_pte_range() should ignore it. On migration completion
> the VMA flags determine the permissions of the new PTE. Parallel faults
> wait on the migration entry and see the correct value afterwards.
> 
> When creating migration entries, try_to_unmap calls page_check_address
> which takes the PTL before doing anything. On the mprotect side,
> lock_pte_protection will block before seeing PROTNONE.
> 
> I think the race you are thinking of is a migration entry created for write,
> parallel mprotect(PROTNONE) and migration completion. The migration entry
> was created for write but remove_migration_pte does not double check the VMA
> protections and mmap_sem is not taken for write across a full migration to
> protect against changes to vm_page_prot.

Yes, the "if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte);"
arguably should take the latest value of vma->vm_page_prot into account.

> However, change_pte_range checks
> for migration entries marked for write under the PTL and marks them read if
> one is encountered. The consequence is that we potentially take a spurious
> fault to mark the PTE write again after migration completes but I can't
> see how that causes a problem as such.

Yes, once mprotect's page table walk reaches that pte, it updates it
correctly along with all the others nearby (which were not migrated),
removing the temporary oddity.

> 
> I'm missing some part of your reasoning that leads to the RW|PROTNONE :(

You don't appear to be missing it at all, you are seeing the possibility
of an RW|PROTNONE yourself, and how it gets "corrected" afterwards
("corrected" in quotes because without the present bit, it's not an error).

> 
> > But exciting though
> > that line of thought is, I cannot actually bring it to a pte_mknuma bug,
> > or any bug at all.
> > 

And I wasn't saying that it led to this bug, just that it was an oddity
worth thinking about, and worth mentioning to you, in case you could work
out a way it might lead to the bug, when I had failed to do so.

But we now (almost) know that 902 is irrelevant to this bug anyway.

> 
> On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It

GLOBAL once PRESENT is set, but PROTNONE so long as it is not.

> wouldn't cause this bug but it's sufficiently suspicious to be worth
> correcting. In case this is the race you're thinking of, the patch is below.
> Unfortunately, I cannot see how it would affect this problem but worth
> giving a whirl anyway.
> 
> > Mel, no way can it be the cause of this bug - unless Sasha's later
> > traces actually show a different stack - but I don't see the call
> > to change_prot_numa() from queue_pages_range() sharing the same
> > avoidance of PROT_NONE that task_numa_work() has (though it does
> > have an outdated comment about PROT_NONE which should be removed).
> > So I think that site probably does need PROT_NONE checking added.
> > 
> 
> That site should have checked PROT_NONE but it can't be the same bug
> that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
> according to git grep of the trinity source.

Yes, queue_pages_range() is not implicated in any of Sasha's traces.
Something to fix, but not relevant to this bug.

> 
> Worth adding this to the debugging mix? It should warn if it encounters
> the problem but avoid adding the problematic RW bit.
> 
> ---8<---
> migrate: debug patch to try identify race between migration completion and 
> mprotect
> 
> A migration entry is marked as write if pte_write was true at the
> time the entry was created. The VMA protections are not double checked
> when migration entries are being removed but mprotect itself will mark
> write-migration-entries as read to avoid problems. It means we potentially
> take a spurious fault to mark these ptes write again but otherwise it's
> harmless.  Still, one dump indicates that this situation can actually
> happen so this debugging patch spits out a warning if 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Hugh Dickins
On Wed, 10 Sep 2014, Sasha Levin wrote:
> On 09/09/2014 10:45 PM, Hugh Dickins wrote:
> > Sasha, you say you're getting plenty of these now, but I've only seen
> > the dump for one of them, on Aug26: please post a few more dumps, so
> > that we can look for commonality.
> 
> I wasn't saving older logs for this issue so I only have 2 traces from
> tonight. If that's not enough please let me know and I'll try to add
> a few more.

Thanks, these two are useful, mainly because the register contents most
likely to be ptes are in both of these ...900, with no sign of a ...902.

So the RW bit I got excited about yesterday is clearly not necessary for
the bug (though it's still possible that it was good for implicating page
migration, and page migration still play a part in the story).

> > And please attach a disassembly of change_protection_range() (noting
> > which of the dumps it corresponds to, in case it has changed around):
> > "Code" just shows a cluster of ud2s for the unlikely bugs at end of the
> > function, we cannot tell at all what should be in the registers by then.
> 
> change_protection_range() got inlined into change_protection(), it applies to
> both traces above:

Thanks for supplying, but the change in inlining means that
change_protection_range() and change_protection() are no longer
relevant for these traces, we now need to see change_pte_range()
instead, to confirm that what I expect are ptes are indeed ptes.

If you can include line numbers (objdump -ld) in the disassembly, so
much the better, but should be decipherable without.  (Or objdump -Sd
for source, but I often find that harder to unscramble, can't say why.)

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 08:47 AM, Mel Gorman wrote:
> migrate: debug patch to try identify race between migration completion and 
> mprotect
> 
> A migration entry is marked as write if pte_write was true at the
> time the entry was created. The VMA protections are not double checked
> when migration entries are being removed but mprotect itself will mark
> write-migration-entries as read to avoid problems. It means we potentially
> take a spurious fault to mark these ptes write again but otherwise it's
> harmless.  Still, one dump indicates that this situation can actually
> happen so this debugging patch spits out a warning if the situation occurs
> and hopefully the resulting warning will contain a clue as to how exactly
> it happens
> 
> Not-signed-off
> ---
>  mm/migrate.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 09d489c..631725c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, struct 
> vm_area_struct *vma,
>   pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>   if (pte_swp_soft_dirty(*ptep))
>   pte = pte_mksoft_dirty(pte);
> - if (is_write_migration_entry(entry))
> - pte = pte_mkwrite(pte);
> + if (is_write_migration_entry(entry)) {
> + /*
> +  * This WARN_ON_ONCE is temporary for the purposes of seeing if
> +  * it's a case encountered by trinity in Sasha's testing
> +  */
> + if (!(vma->vm_flags & (VM_WRITE)))
> + WARN_ON_ONCE(1);
> + else
> + pte = pte_mkwrite(pte);
> + }
>  #ifdef CONFIG_HUGETLB_PAGE
>   if (PageHuge(new)) {
>   pte = pte_mkhuge(pte);

I seem to have hit this warning:

[ 4782.617806] WARNING: CPU: 10 PID: 21180 at mm/migrate.c:155 
remove_migration_pte+0x3f7/0x420()
[ 4782.619315] Modules linked in:
[ 4782.622189]
[ 4782.622501] CPU: 10 PID: 21180 Comm: trinity-main Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
[ 4782.624344]  0009 8800193eb770 a04c742a 

[ 4782.627801]  8800193eb7a8 9d16e55d 7f2458d89000 
880120959600
[ 4782.629283]  88012b02c000 ea002abeab00 88063118da90 
8800193eb7b8
[ 4782.631353] Call Trace:
[ 4782.633789]  [] dump_stack+0x4e/0x7a
[ 4782.634314]  [] warn_slowpath_common+0x7d/0xa0
[ 4782.634877]  [] warn_slowpath_null+0x1a/0x20
[ 4782.635430]  [] remove_migration_pte+0x3f7/0x420
[ 4782.636042]  [] rmap_walk+0xef/0x380
[ 4782.636544]  [] remove_migration_ptes+0x41/0x50
[ 4782.637130]  [] ? 
__migration_entry_wait.isra.24+0x160/0x160
[ 4782.639928]  [] ? remove_migration_pte+0x420/0x420
[ 4782.640616]  [] move_to_new_page+0x16b/0x230
[ 4782.641251]  [] ? try_to_unmap+0x6c/0xf0
[ 4782.643950]  [] ? try_to_unmap_nonlinear+0x5c0/0x5c0
[ 4782.644690]  [] ? invalid_migration_vma+0x30/0x30
[ 4782.645273]  [] ? page_remove_rmap+0x320/0x320
[ 4782.646072]  [] migrate_pages+0x85c/0x930
[ 4782.646701]  [] ? isolate_freepages_block+0x410/0x410
[ 4782.647407]  [] ? arch_local_save_flags+0x30/0x30
[ 4782.648114]  [] compact_zone+0x4d3/0x8a0
[ 4782.650157]  [] compact_zone_order+0x5f/0xa0
[ 4782.651014]  [] try_to_compact_pages+0x127/0x2f0
[ 4782.651656]  [] __alloc_pages_direct_compact+0x68/0x200
[ 4782.652313]  [] __alloc_pages_nodemask+0x99a/0xd90
[ 4782.652916]  [] alloc_pages_vma+0x13c/0x270
[ 4782.653618]  [] ? do_huge_pmd_wp_page+0x494/0xc90
[ 4782.654487]  [] do_huge_pmd_wp_page+0x494/0xc90
[ 4782.656045]  [] ? __mem_cgroup_count_vm_event+0xd0/0x240
[ 4782.657089]  [] handle_mm_fault+0x8bd/0xc50
[ 4782.660931]  [] ? __lock_is_held+0x56/0x80
[ 4782.662695]  [] __do_page_fault+0x1b7/0x660
[ 4782.663259]  [] ? put_lock_stats.isra.13+0xe/0x30
[ 4782.663851]  [] ? vtime_account_user+0x91/0xa0
[ 4782.664419]  [] ? context_tracking_user_exit+0xb5/0x1b0
[ 4782.665119]  [] ? __this_cpu_preempt_check+0x13/0x20
[ 4782.665969]  [] ? trace_hardirqs_off_caller+0xe2/0x1b0
[ 4782.34]  [] trace_do_page_fault+0x51/0x2b0
[ 4782.667257]  [] do_async_page_fault+0x63/0xd0
[ 4782.667871]  [] async_page_fault+0x28/0x30

Although it wasn't followed by anything else, and I've seen the original issue
getting triggered without this WARN showing up, so it seems like a different,
unrelated issue?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 09:40 AM, Mel Gorman wrote:
> On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote:
>> 
>>
>> I've spotted a new trace in overnight fuzzing, it could be related to this 
>> issue:
>>
>> [ 3494.324839] general protection fault:  [#1] PREEMPT SMP 
>> DEBUG_PAGEALLOC
>> [ 3494.332153] Dumping ftrace buffer:
>> [ 3494.332153](ftrace buffer empty)
>> [ 3494.332153] Modules linked in:
>> [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted 
>> 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
>> [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: 
>> 8804d491c000
>> [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 
>> kernel/sched/fair.c:1956)
>> [ 3494.332153] RSP: :8804d491feb8  EFLAGS: 00010206
>> [ 3494.332153] RAX:  RBX: 8804bf4e8000 RCX: 
>> e8e8
>> [ 3494.343974] RDX: 000a RSI:  RDI: 
>> 8804bd6d4da8
>> [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: 
>> 
>> [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: 
>> 7f53e443c000
>> [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: 
>> 88047e52b000
>> [ 3494.343974] FS:  7f53e463f700() GS:880277e0() 
>> knlGS:
>> [ 3494.343974] CS:  0010 DS:  ES:  CR0: 8005003b
>> [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: 
>> 06a0
>> [ 3494.369895] DR0: 006f DR1:  DR2: 
>> 
>> [ 3494.369895] DR3:  DR6: 0ff0 DR7: 
>> 0600
>> [ 3494.380081] Stack:
>> [ 3494.380081]  8804bf4e80a8 0014 7f53e4437000 
>> 
>> [ 3494.380081]  9b976e70 88047e52bbd8 88047e52b000 
>> 
>> [ 3494.380081]  8804d491ff28 95193d84 0002 
>> 8804d491ff58
>> [ 3494.380081] Call Trace:
>> [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1))
>> [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 
>> arch/x86/kernel/signal.c:758)
>> [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918)
>> [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 
>> c6 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 
>> <49> f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00
> 
> Shot in dark, can you test this please? Pagetable teardown can schedule
> and I'm wondering if we are trying to add hinting faults to an address
> space that is in the process of going away. The TASK_DEAD check is bogus
> so replacing it.

Mel, I ran today's -next with both of your patches, but the issue still remains:

[ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724!
[ 3114.541857] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3114.543112] Dumping ftrace buffer:
[ 3114.544056](ftrace buffer empty)
[ 3114.545000] Modules linked in:
[ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
[ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 
88076f584000
[ 3114.549284] RIP: 0010:[]  [] 
change_pte_range+0x4ea/0x4f0
[ 3114.550028] RSP: :88076f587d68  EFLAGS: 00010246
[ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100
[ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900
[ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5
[ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0
[ 3114.550028] R13: 8025 R14: 41343000 R15: cfff
[ 3114.550028] FS:  7fabb91c8700() GS:88025ec0() 
knlGS:
[ 3114.550028] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0
[ 3114.550028] DR0: 006f DR1:  DR2: 
[ 3114.550028] DR3:  DR6: 0ff0 DR7: 00050602
[ 3114.550028] Stack:
[ 3114.550028]  0001 000314625900 0018 
8802685f2260
[ 3114.550028]  1684 8802cf973600 88061684 
41343000
[ 3114.550028]  880108805048 41005000 4120 
41343000
[ 3114.550028] Call Trace:
[ 3114.550028]  [] change_protection+0x2b4/0x4e0
[ 3114.550028]  [] change_prot_numa+0x1b/0x40
[ 3114.550028]  [] task_numa_work+0x1f6/0x330
[ 3114.550028]  [] task_work_run+0xc4/0xf0
[ 3114.550028]  [] do_notify_resume+0x97/0xb0
[ 3114.550028]  [] int_signal+0x12/0x17
[ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 3114.550028] RIP  [] 

Re: Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)

2014-09-10 Thread Dave Jones
On Wed, Sep 10, 2014 at 10:24:40AM -0400, Sasha Levin wrote:
 > On 09/10/2014 08:47 AM, Mel Gorman wrote:
 > > That site should have checked PROT_NONE but it can't be the same bug
 > > that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
 > > according to git grep of the trinity source.
 > 
 > Actually, if I'm reading it correctly I think that Trinity handles mbind()
 > calls wrong. It passes the wrong values for mode flags and actual flags.

Ugh, I think you're right.  I misinterpreted the man page that mentions
that flags like MPOL_F_STATIC_NODES/RELATIVE_NODES are OR'd with the
mode, and instead dumped those flags into .. the flags field.

So the 'flags' argument it generates is crap, because I didn't add
any of the actual correct values.

I'll fix it up, though if it's currently finding bugs, you might want
to keep the current syscalls/mbind.c for now.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)

2014-09-10 Thread Sasha Levin
On 09/10/2014 08:47 AM, Mel Gorman wrote:
> That site should have checked PROT_NONE but it can't be the same bug
> that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
> according to git grep of the trinity source.

Actually, if I'm reading it correctly I think that Trinity handles mbind()
calls wrong. It passes the wrong values for mode flags and actual flags.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Mel Gorman
On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote:
> 
> 
> I've spotted a new trace in overnight fuzzing, it could be related to this 
> issue:
> 
> [ 3494.324839] general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 3494.332153] Dumping ftrace buffer:
> [ 3494.332153](ftrace buffer empty)
> [ 3494.332153] Modules linked in:
> [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted 
> 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
> [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: 
> 8804d491c000
> [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 
> kernel/sched/fair.c:1956)
> [ 3494.332153] RSP: :8804d491feb8  EFLAGS: 00010206
> [ 3494.332153] RAX:  RBX: 8804bf4e8000 RCX: 
> e8e8
> [ 3494.343974] RDX: 000a RSI:  RDI: 
> 8804bd6d4da8
> [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: 
> 
> [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: 
> 7f53e443c000
> [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: 
> 88047e52b000
> [ 3494.343974] FS:  7f53e463f700() GS:880277e0() 
> knlGS:
> [ 3494.343974] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: 
> 06a0
> [ 3494.369895] DR0: 006f DR1:  DR2: 
> 
> [ 3494.369895] DR3:  DR6: 0ff0 DR7: 
> 0600
> [ 3494.380081] Stack:
> [ 3494.380081]  8804bf4e80a8 0014 7f53e4437000 
> 
> [ 3494.380081]  9b976e70 88047e52bbd8 88047e52b000 
> 
> [ 3494.380081]  8804d491ff28 95193d84 0002 
> 8804d491ff58
> [ 3494.380081] Call Trace:
> [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1))
> [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 
> arch/x86/kernel/signal.c:758)
> [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918)
> [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 c6 
> 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 <49> 
> f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00

Shot in dark, can you test this please? Pagetable teardown can schedule
and I'm wondering if we are trying to add hinting faults to an address
space that is in the process of going away. The TASK_DEAD check is bogus
so replacing it.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ea6006..007fc1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1810,7 +1810,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int 
pages, int flags)
return;
 
/* Do not worry about placement if exiting */
-   if (p->state == TASK_DEAD)
+   if (p->flags & PF_EXITING)
return;
 
/* Allocate buffer to track faults on a per-node basis */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/09/2014 10:45 PM, Hugh Dickins wrote:
> Sasha, you say you're getting plenty of these now, but I've only seen
> the dump for one of them, on Aug26: please post a few more dumps, so
> that we can look for commonality.

I wasn't saving older logs for this issue so I only have 2 traces from
tonight. If that's not enough please let me know and I'll try to add
a few more.

[ 1125.600123] kernel BUG at include/asm-generic/pgtable.h:724!
[ 1125.600123] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1125.600123] Dumping ftrace buffer:
[ 1125.600123](ftrace buffer empty)
[ 1125.600123] Modules linked in:
[ 1125.600123] CPU: 16 PID: 11903 Comm: trinity-c517 Not tainted 
3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
[ 1125.600123] task: 88066173 ti: 880582c2 task.ti: 
880582c2
[ 1125.600123] RIP: 0010:[]  [] 
change_pte_range+0x4ea/0x4f0
[ 1125.600123] RSP: 0018:880582c23d68  EFLAGS: 00010246
[ 1125.600123] RAX: 000936d9a900 RBX: 7ffdb17c8000 RCX: 0100
[ 1125.600123] RDX: 000936d9a900 RSI: 7ffdb17c8000 RDI: 000936d9a900
[ 1125.600123] RBP: 880582c23dc8 R08: 8802a8f2d400 R09: 00b56000
[ 1125.600123] R10: 00020201 R11: 0008 R12: 88004dd6ee40
[ 1125.600123] R13: 8025 R14: 7ffdb180 R15: cfff
[ 1125.600123] FS:  7ffdb6382700() GS:88027820() 
knlGS:
[ 1125.600123] CS:  0010 DS:  ES:  CR0: 80050033
[ 1125.600123] CR2: 7ffdb617e60c CR3: 00050ff12000 CR4: 06a0
[ 1125.600123] DR0: 006f DR1:  DR2: 
[ 1125.600123] DR3:  DR6: 0ff0 DR7: 0600
[ 1125.600123] Stack:
[ 1125.600123]  0001 000936d9a900 0046 
8804bd549f40
[ 1125.600123]  1f989000 8802a8f2d400 88051f989000 
7f9f40604cfdb1ac8000
[ 1125.600123]  88032fcc3c58 7ffdb16df000 7ffdb16df000 
7ffdb180
[ 1125.600123] Call Trace:
[ 1125.600123]  [] change_protection+0x2b4/0x4e0
[ 1125.600123]  [] change_prot_numa+0x1b/0x40
[ 1125.600123]  [] task_numa_work+0x1f6/0x330
[ 1125.600123]  [] task_work_run+0xc4/0xf0
[ 1125.600123]  [] do_notify_resume+0x97/0xb0
[ 1125.600123]  [] int_signal+0x12/0x17
[ 1125.600123] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 1125.600123] RIP  [] change_pte_range+0x4ea/0x4f0
[ 1125.600123]  RSP 

[ 3131.084176] kernel BUG at include/asm-generic/pgtable.h:724!
[ 3131.087358] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3131.090143] Dumping ftrace buffer:
[ 3131.090143](ftrace buffer empty)
[ 3131.090143] Modules linked in:
[ 3131.090143] CPU: 8 PID: 20595 Comm: trinity-c34 Not tainted 
3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
[ 3131.090143] task: 8801ded6 ti: 8803204ec000 task.ti: 
8803204ec000
[ 3131.090143] RIP: 0010:[]  [] 
change_pte_range+0x4ea/0x4f0
[ 3131.090143] RSP: :8803204efd68  EFLAGS: 00010246
[ 3131.090143] RAX: 000971bba900 RBX: 7ffda1d4d000 RCX: 0100
[ 3131.090143] RDX: 000971bba900 RSI: 7ffda1d4d000 RDI: 000971bba900
[ 3131.120281] RBP: 8803204efdc8 R08: 88026bed8800 R09: 00b48000
[ 3131.120281] R10: 00076501 R11: 0008 R12: 8801ca071a68
[ 3131.120281] R13: 8025 R14: 7ffda1dbf000 R15: cfff
[ 3131.120281] FS:  7ffda5cd4700() GS:880277e0() 
knlGS:
[ 3131.120281] CS:  0010 DS:  ES:  CR0: 80050033
[ 3131.120281] CR2: 025d6000 CR3: 0004bcde2000 CR4: 06a0
[ 3131.120281] Stack:
[ 3131.120281]  0001 000971bba900 005c 
8800661a7b60
[ 3131.120281]  f4953000 88026bed8800 8801f4953000 
7ffda1dbf000
[ 3131.120281]  8802b3319870 7ffda1c1b000 7ffda1c1b000 
7ffda1dbf000
[ 3131.120281] Call Trace:
[ 3131.120281]  [] change_protection+0x2b4/0x4e0
[ 3131.120281]  [] change_prot_numa+0x1b/0x40
[ 3131.120281]  [] task_numa_work+0x1f6/0x330
[ 3131.120281]  [] task_work_run+0xc4/0xf0
[ 3131.120281]  [] do_notify_resume+0x97/0xb0
[ 3131.120281]  [] retint_signal+0x4d/0x9f
[ 3131.120281] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 3131.120281] RIP  [] change_pte_range+0x4ea/0x4f0
[ 3131.120281]  RSP 

> And please attach a disassembly of change_protection_range() (noting
> which of the dumps it corresponds to, in case it has changed around):
> "Code" just shows a cluster of ud2s for the unlikely bugs at end of the
> function, we cannot tell at all what should be in the registers by 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Mel Gorman
On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote:
> On Tue, 9 Sep 2014, Sasha Levin wrote:
> > On 09/09/2014 05:33 PM, Mel Gorman wrote:
> > > On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
> > >> On 09/08/2014 01:18 PM, Mel Gorman wrote:
> > >>> A worse possibility is that somehow the lock is getting corrupted but
> > >>> that's also a tough sell considering that the locks should be allocated
> > >>> from a dedicated cache. I guess I could try breaking that to allocate
> > >>> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
> > >>> optimistic.
> > >>
> > >> I did see ptl corruption couple days ago:
> > >>
> > >>  https://lkml.org/lkml/2014/9/4/599
> > >>
> > >> Could this be related?
> > >>
> > > 
> > > Possibly although the likely explanation then would be that there is
> > > just general corruption coming from somewhere. Even using your config
> > > and applying a patch to make linux-next boot (already in Tejun's tree)
> > > I was unable to reproduce the problem after running for several hours. I
> > > had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
> > > so I have a few questions.
> > 
> > I agree it could be a case of random corruption somewhere else, it's just
> > that the amount of times this exact issue reproduced
> 
> Yes, I doubt it's random corruption; but I've been no more successful
> than Mel in working it out (I share responsibility for that VM_BUG_ON).
> 
> Sasha, you say you're getting plenty of these now, but I've only seen
> the dump for one of them, on Aug26: please post a few more dumps, so
> that we can look for commonality.
> 

It's also worth knowing that this is a test running in KVM and fake NUMA. The
hint was that the filesystem used was virtio-9p. I haven't formulated a
theory on how KVM could cause any damage here but it's interesting.

> And please attach a disassembly of change_protection_range() (noting
> which of the dumps it corresponds to, in case it has changed around):
> "Code" just shows a cluster of ud2s for the unlikely bugs at end of the
> function, we cannot tell at all what should be in the registers by then.
> 
> I've been rather assuming that the 9d340902 seen in many of the
> registers in that Aug26 dump is the pte val in question: that's
> SOFT_DIRTY|PROTNONE|RW.
> 
> I think RW on PROTNONE is unusual but not impossible (migration entry
> replacement racing with mprotect setting PROT_NONE, after it's updated
> vm_page_prot, before it's reached the page table). 

At the risk of sounding thick, I need to spell this out because I'm
having trouble seeing exactly what race you are thinking of. 

Migration entry replacement is protected against parallel NUMA hinting
updates by the page table lock (either PMD or PTE level). It's taken by
remove_migration_pte on one side and lock_pte_protection on the other.

For the mprotect case racing again migration, migration entries are not
present so change_pte_range() should ignore it. On migration completion
the VMA flags determine the permissions of the new PTE. Parallel faults
wait on the migration entry and see the correct value afterwards.

When creating migration entries, try_to_unmap calls page_check_address
which takes the PTL before doing anything. On the mprotect side,
lock_pte_protection will block before seeing PROTNONE.

I think the race you are thinking of is a migration entry created for write,
parallel mprotect(PROTNONE) and migration completion. The migration entry
was created for write but remove_migration_pte does not double check the VMA
protections and mmap_sem is not taken for write across a full migration to
protect against changes to vm_page_prot. However, change_pte_range checks
for migration entries marked for write under the PTL and marks them read if
one is encountered. The consequence is that we potentially take a spurious
fault to mark the PTE write again after migration completes but I can't
see how that causes a problem as such.

I'm missing some part of your reasoning that leads to the RW|PROTNONE :(

> But exciting though
> that line of thought is, I cannot actually bring it to a pte_mknuma bug,
> or any bug at all.
> 

On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It
wouldn't cause this bug but it's sufficiently suspicious to be worth
correcting. In case this is the race you're thinking of, the patch is below.
Unfortunately, I cannot see how it would affect this problem but worth
giving a whirl anyway.

> Mel, no way can it be the cause of this bug - unless Sasha's later
> traces actually show a different stack - but I don't see the call
> to change_prot_numa() from queue_pages_range() sharing the same
> avoidance of PROT_NONE that task_numa_work() has (though it does
> have an outdated comment about PROT_NONE which should be removed).
> So I think that site probably does need PROT_NONE checking added.
> 

That site should have checked PROT_NONE but it can't be the same bug
that trinity is seeing. 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Mel Gorman
On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote:
 On Tue, 9 Sep 2014, Sasha Levin wrote:
  On 09/09/2014 05:33 PM, Mel Gorman wrote:
   On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
   On 09/08/2014 01:18 PM, Mel Gorman wrote:
   A worse possibility is that somehow the lock is getting corrupted but
   that's also a tough sell considering that the locks should be allocated
   from a dedicated cache. I guess I could try breaking that to allocate
   one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
   optimistic.
  
   I did see ptl corruption couple days ago:
  
https://lkml.org/lkml/2014/9/4/599
  
   Could this be related?
  
   
   Possibly although the likely explanation then would be that there is
   just general corruption coming from somewhere. Even using your config
   and applying a patch to make linux-next boot (already in Tejun's tree)
   I was unable to reproduce the problem after running for several hours. I
   had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
   so I have a few questions.
  
  I agree it could be a case of random corruption somewhere else, it's just
  that the amount of times this exact issue reproduced
 
 Yes, I doubt it's random corruption; but I've been no more successful
 than Mel in working it out (I share responsibility for that VM_BUG_ON).
 
 Sasha, you say you're getting plenty of these now, but I've only seen
 the dump for one of them, on Aug26: please post a few more dumps, so
 that we can look for commonality.
 

It's also worth knowing that this is a test running in KVM and fake NUMA. The
hint was that the filesystem used was virtio-9p. I haven't formulated a
theory on how KVM could cause any damage here but it's interesting.

 And please attach a disassembly of change_protection_range() (noting
 which of the dumps it corresponds to, in case it has changed around):
 Code just shows a cluster of ud2s for the unlikely bugs at end of the
 function, we cannot tell at all what should be in the registers by then.
 
 I've been rather assuming that the 9d340902 seen in many of the
 registers in that Aug26 dump is the pte val in question: that's
 SOFT_DIRTY|PROTNONE|RW.
 
 I think RW on PROTNONE is unusual but not impossible (migration entry
 replacement racing with mprotect setting PROT_NONE, after it's updated
 vm_page_prot, before it's reached the page table). 

At the risk of sounding thick, I need to spell this out because I'm
having trouble seeing exactly what race you are thinking of. 

Migration entry replacement is protected against parallel NUMA hinting
updates by the page table lock (either PMD or PTE level). It's taken by
remove_migration_pte on one side and lock_pte_protection on the other.

For the mprotect case racing again migration, migration entries are not
present so change_pte_range() should ignore it. On migration completion
the VMA flags determine the permissions of the new PTE. Parallel faults
wait on the migration entry and see the correct value afterwards.

When creating migration entries, try_to_unmap calls page_check_address
which takes the PTL before doing anything. On the mprotect side,
lock_pte_protection will block before seeing PROTNONE.

I think the race you are thinking of is a migration entry created for write,
parallel mprotect(PROTNONE) and migration completion. The migration entry
was created for write but remove_migration_pte does not double check the VMA
protections and mmap_sem is not taken for write across a full migration to
protect against changes to vm_page_prot. However, change_pte_range checks
for migration entries marked for write under the PTL and marks them read if
one is encountered. The consequence is that we potentially take a spurious
fault to mark the PTE write again after migration completes but I can't
see how that causes a problem as such.

I'm missing some part of your reasoning that leads to the RW|PROTNONE :(

 But exciting though
 that line of thought is, I cannot actually bring it to a pte_mknuma bug,
 or any bug at all.
 

On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It
wouldn't cause this bug but it's sufficiently suspicious to be worth
correcting. In case this is the race you're thinking of, the patch is below.
Unfortunately, I cannot see how it would affect this problem but worth
giving a whirl anyway.

 Mel, no way can it be the cause of this bug - unless Sasha's later
 traces actually show a different stack - but I don't see the call
 to change_prot_numa() from queue_pages_range() sharing the same
 avoidance of PROT_NONE that task_numa_work() has (though it does
 have an outdated comment about PROT_NONE which should be removed).
 So I think that site probably does need PROT_NONE checking added.
 

That site should have checked PROT_NONE but it can't be the same bug
that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
according to git grep of the trinity source.

Worth adding this to the debugging 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/09/2014 10:45 PM, Hugh Dickins wrote:
 Sasha, you say you're getting plenty of these now, but I've only seen
 the dump for one of them, on Aug26: please post a few more dumps, so
 that we can look for commonality.

I wasn't saving older logs for this issue so I only have 2 traces from
tonight. If that's not enough please let me know and I'll try to add
a few more.

[ 1125.600123] kernel BUG at include/asm-generic/pgtable.h:724!
[ 1125.600123] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1125.600123] Dumping ftrace buffer:
[ 1125.600123](ftrace buffer empty)
[ 1125.600123] Modules linked in:
[ 1125.600123] CPU: 16 PID: 11903 Comm: trinity-c517 Not tainted 
3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
[ 1125.600123] task: 88066173 ti: 880582c2 task.ti: 
880582c2
[ 1125.600123] RIP: 0010:[a32e500a]  [a32e500a] 
change_pte_range+0x4ea/0x4f0
[ 1125.600123] RSP: 0018:880582c23d68  EFLAGS: 00010246
[ 1125.600123] RAX: 000936d9a900 RBX: 7ffdb17c8000 RCX: 0100
[ 1125.600123] RDX: 000936d9a900 RSI: 7ffdb17c8000 RDI: 000936d9a900
[ 1125.600123] RBP: 880582c23dc8 R08: 8802a8f2d400 R09: 00b56000
[ 1125.600123] R10: 00020201 R11: 0008 R12: 88004dd6ee40
[ 1125.600123] R13: 8025 R14: 7ffdb180 R15: cfff
[ 1125.600123] FS:  7ffdb6382700() GS:88027820() 
knlGS:
[ 1125.600123] CS:  0010 DS:  ES:  CR0: 80050033
[ 1125.600123] CR2: 7ffdb617e60c CR3: 00050ff12000 CR4: 06a0
[ 1125.600123] DR0: 006f DR1:  DR2: 
[ 1125.600123] DR3:  DR6: 0ff0 DR7: 0600
[ 1125.600123] Stack:
[ 1125.600123]  0001 000936d9a900 0046 
8804bd549f40
[ 1125.600123]  1f989000 8802a8f2d400 88051f989000 
7f9f40604cfdb1ac8000
[ 1125.600123]  88032fcc3c58 7ffdb16df000 7ffdb16df000 
7ffdb180
[ 1125.600123] Call Trace:
[ 1125.600123]  [a32e52c4] change_protection+0x2b4/0x4e0
[ 1125.600123]  [a32fefdb] change_prot_numa+0x1b/0x40
[ 1125.600123]  [a31add86] task_numa_work+0x1f6/0x330
[ 1125.600123]  [a3193d84] task_work_run+0xc4/0xf0
[ 1125.600123]  [a3071477] do_notify_resume+0x97/0xb0
[ 1125.600123]  [a650daea] int_signal+0x12/0x17
[ 1125.600123] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 1125.600123] RIP  [a32e500a] change_pte_range+0x4ea/0x4f0
[ 1125.600123]  RSP 880582c23d68

[ 3131.084176] kernel BUG at include/asm-generic/pgtable.h:724!
[ 3131.087358] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3131.090143] Dumping ftrace buffer:
[ 3131.090143](ftrace buffer empty)
[ 3131.090143] Modules linked in:
[ 3131.090143] CPU: 8 PID: 20595 Comm: trinity-c34 Not tainted 
3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
[ 3131.090143] task: 8801ded6 ti: 8803204ec000 task.ti: 
8803204ec000
[ 3131.090143] RIP: 0010:[a72e500a]  [a72e500a] 
change_pte_range+0x4ea/0x4f0
[ 3131.090143] RSP: :8803204efd68  EFLAGS: 00010246
[ 3131.090143] RAX: 000971bba900 RBX: 7ffda1d4d000 RCX: 0100
[ 3131.090143] RDX: 000971bba900 RSI: 7ffda1d4d000 RDI: 000971bba900
[ 3131.120281] RBP: 8803204efdc8 R08: 88026bed8800 R09: 00b48000
[ 3131.120281] R10: 00076501 R11: 0008 R12: 8801ca071a68
[ 3131.120281] R13: 8025 R14: 7ffda1dbf000 R15: cfff
[ 3131.120281] FS:  7ffda5cd4700() GS:880277e0() 
knlGS:
[ 3131.120281] CS:  0010 DS:  ES:  CR0: 80050033
[ 3131.120281] CR2: 025d6000 CR3: 0004bcde2000 CR4: 06a0
[ 3131.120281] Stack:
[ 3131.120281]  0001 000971bba900 005c 
8800661a7b60
[ 3131.120281]  f4953000 88026bed8800 8801f4953000 
7ffda1dbf000
[ 3131.120281]  8802b3319870 7ffda1c1b000 7ffda1c1b000 
7ffda1dbf000
[ 3131.120281] Call Trace:
[ 3131.120281]  [a72e52c4] change_protection+0x2b4/0x4e0
[ 3131.120281]  [a72fefdb] change_prot_numa+0x1b/0x40
[ 3131.120281]  [a71add86] task_numa_work+0x1f6/0x330
[ 3131.120281]  [a7193d84] task_work_run+0xc4/0xf0
[ 3131.120281]  [a7071477] do_notify_resume+0x97/0xb0
[ 3131.120281]  [aa50e6ae] retint_signal+0x4d/0x9f
[ 3131.120281] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 3131.120281] RIP  [a72e500a] change_pte_range+0x4ea/0x4f0
[ 3131.120281]  

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Mel Gorman
On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote:
 SNIP, haven't digested the rest
 
 I've spotted a new trace in overnight fuzzing, it could be related to this 
 issue:
 
 [ 3494.324839] general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 [ 3494.332153] Dumping ftrace buffer:
 [ 3494.332153](ftrace buffer empty)
 [ 3494.332153] Modules linked in:
 [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted 
 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
 [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: 
 8804d491c000
 [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 
 kernel/sched/fair.c:1956)
 [ 3494.332153] RSP: :8804d491feb8  EFLAGS: 00010206
 [ 3494.332153] RAX:  RBX: 8804bf4e8000 RCX: 
 e8e8
 [ 3494.343974] RDX: 000a RSI:  RDI: 
 8804bd6d4da8
 [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: 
 
 [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: 
 7f53e443c000
 [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: 
 88047e52b000
 [ 3494.343974] FS:  7f53e463f700() GS:880277e0() 
 knlGS:
 [ 3494.343974] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: 
 06a0
 [ 3494.369895] DR0: 006f DR1:  DR2: 
 
 [ 3494.369895] DR3:  DR6: 0ff0 DR7: 
 0600
 [ 3494.380081] Stack:
 [ 3494.380081]  8804bf4e80a8 0014 7f53e4437000 
 
 [ 3494.380081]  9b976e70 88047e52bbd8 88047e52b000 
 
 [ 3494.380081]  8804d491ff28 95193d84 0002 
 8804d491ff58
 [ 3494.380081] Call Trace:
 [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1))
 [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 
 arch/x86/kernel/signal.c:758)
 [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918)
 [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 c6 
 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 49 
 f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00

Shot in dark, can you test this please? Pagetable teardown can schedule
and I'm wondering if we are trying to add hinting faults to an address
space that is in the process of going away. The TASK_DEAD check is bogus
so replacing it.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ea6006..007fc1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1810,7 +1810,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int 
pages, int flags)
return;
 
/* Do not worry about placement if exiting */
-   if (p-state == TASK_DEAD)
+   if (p-flags  PF_EXITING)
return;
 
/* Allocate buffer to track faults on a per-node basis */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)

2014-09-10 Thread Sasha Levin
On 09/10/2014 08:47 AM, Mel Gorman wrote:
 That site should have checked PROT_NONE but it can't be the same bug
 that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
 according to git grep of the trinity source.

Actually, if I'm reading it correctly I think that Trinity handles mbind()
calls wrong. It passes the wrong values for mode flags and actual flags.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)

2014-09-10 Thread Dave Jones
On Wed, Sep 10, 2014 at 10:24:40AM -0400, Sasha Levin wrote:
  On 09/10/2014 08:47 AM, Mel Gorman wrote:
   That site should have checked PROT_NONE but it can't be the same bug
   that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
   according to git grep of the trinity source.
  
  Actually, if I'm reading it correctly I think that Trinity handles mbind()
  calls wrong. It passes the wrong values for mode flags and actual flags.

Ugh, I think you're right.  I misinterpreted the man page that mentions
that flags like MPOL_F_STATIC_NODES/RELATIVE_NODES are OR'd with the
mode, and instead dumped those flags into .. the flags field.

So the 'flags' argument it generates is crap, because I didn't add
any of the actual correct values.

I'll fix it up, though if it's currently finding bugs, you might want
to keep the current syscalls/mbind.c for now.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 09:40 AM, Mel Gorman wrote:
 On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote:
 SNIP, haven't digested the rest

 I've spotted a new trace in overnight fuzzing, it could be related to this 
 issue:

 [ 3494.324839] general protection fault:  [#1] PREEMPT SMP 
 DEBUG_PAGEALLOC
 [ 3494.332153] Dumping ftrace buffer:
 [ 3494.332153](ftrace buffer empty)
 [ 3494.332153] Modules linked in:
 [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted 
 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135
 [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: 
 8804d491c000
 [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 
 kernel/sched/fair.c:1956)
 [ 3494.332153] RSP: :8804d491feb8  EFLAGS: 00010206
 [ 3494.332153] RAX:  RBX: 8804bf4e8000 RCX: 
 e8e8
 [ 3494.343974] RDX: 000a RSI:  RDI: 
 8804bd6d4da8
 [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: 
 
 [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: 
 7f53e443c000
 [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: 
 88047e52b000
 [ 3494.343974] FS:  7f53e463f700() GS:880277e0() 
 knlGS:
 [ 3494.343974] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: 
 06a0
 [ 3494.369895] DR0: 006f DR1:  DR2: 
 
 [ 3494.369895] DR3:  DR6: 0ff0 DR7: 
 0600
 [ 3494.380081] Stack:
 [ 3494.380081]  8804bf4e80a8 0014 7f53e4437000 
 
 [ 3494.380081]  9b976e70 88047e52bbd8 88047e52b000 
 
 [ 3494.380081]  8804d491ff28 95193d84 0002 
 8804d491ff58
 [ 3494.380081] Call Trace:
 [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1))
 [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 
 arch/x86/kernel/signal.c:758)
 [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918)
 [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 
 c6 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 
 49 f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00
 
 Shot in dark, can you test this please? Pagetable teardown can schedule
 and I'm wondering if we are trying to add hinting faults to an address
 space that is in the process of going away. The TASK_DEAD check is bogus
 so replacing it.

Mel, I ran today's -next with both of your patches, but the issue still remains:

[ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724!
[ 3114.541857] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3114.543112] Dumping ftrace buffer:
[ 3114.544056](ftrace buffer empty)
[ 3114.545000] Modules linked in:
[ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
[ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 
88076f584000
[ 3114.549284] RIP: 0010:[952e527a]  [952e527a] 
change_pte_range+0x4ea/0x4f0
[ 3114.550028] RSP: :88076f587d68  EFLAGS: 00010246
[ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100
[ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900
[ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5
[ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0
[ 3114.550028] R13: 8025 R14: 41343000 R15: cfff
[ 3114.550028] FS:  7fabb91c8700() GS:88025ec0() 
knlGS:
[ 3114.550028] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0
[ 3114.550028] DR0: 006f DR1:  DR2: 
[ 3114.550028] DR3:  DR6: 0ff0 DR7: 00050602
[ 3114.550028] Stack:
[ 3114.550028]  0001 000314625900 0018 
8802685f2260
[ 3114.550028]  1684 8802cf973600 88061684 
41343000
[ 3114.550028]  880108805048 41005000 4120 
41343000
[ 3114.550028] Call Trace:
[ 3114.550028]  [952e5534] change_protection+0x2b4/0x4e0
[ 3114.550028]  [952ff24b] change_prot_numa+0x1b/0x40
[ 3114.550028]  [951adf16] task_numa_work+0x1f6/0x330
[ 3114.550028]  [95193de4] task_work_run+0xc4/0xf0
[ 3114.550028]  [95071477] do_notify_resume+0x97/0xb0
[ 3114.550028]  [9850f06a] int_signal+0x12/0x17
[ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 08:47 AM, Mel Gorman wrote:
 migrate: debug patch to try identify race between migration completion and 
 mprotect
 
 A migration entry is marked as write if pte_write was true at the
 time the entry was created. The VMA protections are not double checked
 when migration entries are being removed but mprotect itself will mark
 write-migration-entries as read to avoid problems. It means we potentially
 take a spurious fault to mark these ptes write again but otherwise it's
 harmless.  Still, one dump indicates that this situation can actually
 happen so this debugging patch spits out a warning if the situation occurs
 and hopefully the resulting warning will contain a clue as to how exactly
 it happens
 
 Not-signed-off
 ---
  mm/migrate.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)
 
 diff --git a/mm/migrate.c b/mm/migrate.c
 index 09d489c..631725c 100644
 --- a/mm/migrate.c
 +++ b/mm/migrate.c
 @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, struct 
 vm_area_struct *vma,
   pte = pte_mkold(mk_pte(new, vma-vm_page_prot));
   if (pte_swp_soft_dirty(*ptep))
   pte = pte_mksoft_dirty(pte);
 - if (is_write_migration_entry(entry))
 - pte = pte_mkwrite(pte);
 + if (is_write_migration_entry(entry)) {
 + /*
 +  * This WARN_ON_ONCE is temporary for the purposes of seeing if
 +  * it's a case encountered by trinity in Sasha's testing
 +  */
 + if (!(vma-vm_flags  (VM_WRITE)))
 + WARN_ON_ONCE(1);
 + else
 + pte = pte_mkwrite(pte);
 + }
  #ifdef CONFIG_HUGETLB_PAGE
   if (PageHuge(new)) {
   pte = pte_mkhuge(pte);

I seem to have hit this warning:

[ 4782.617806] WARNING: CPU: 10 PID: 21180 at mm/migrate.c:155 
remove_migration_pte+0x3f7/0x420()
[ 4782.619315] Modules linked in:
[ 4782.622189]
[ 4782.622501] CPU: 10 PID: 21180 Comm: trinity-main Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
[ 4782.624344]  0009 8800193eb770 a04c742a 

[ 4782.627801]  8800193eb7a8 9d16e55d 7f2458d89000 
880120959600
[ 4782.629283]  88012b02c000 ea002abeab00 88063118da90 
8800193eb7b8
[ 4782.631353] Call Trace:
[ 4782.633789]  [a04c742a] dump_stack+0x4e/0x7a
[ 4782.634314]  [9d16e55d] warn_slowpath_common+0x7d/0xa0
[ 4782.634877]  [9d16e63a] warn_slowpath_null+0x1a/0x20
[ 4782.635430]  [9d315487] remove_migration_pte+0x3f7/0x420
[ 4782.636042]  [9d2e99cf] rmap_walk+0xef/0x380
[ 4782.636544]  [9d3147f1] remove_migration_ptes+0x41/0x50
[ 4782.637130]  [9d315090] ? 
__migration_entry_wait.isra.24+0x160/0x160
[ 4782.639928]  [9d3154b0] ? remove_migration_pte+0x420/0x420
[ 4782.640616]  [9d31671b] move_to_new_page+0x16b/0x230
[ 4782.641251]  [9d2e9e8c] ? try_to_unmap+0x6c/0xf0
[ 4782.643950]  [9d2e88a0] ? try_to_unmap_nonlinear+0x5c0/0x5c0
[ 4782.644690]  [9d2e70a0] ? invalid_migration_vma+0x30/0x30
[ 4782.645273]  [9d2e82e0] ? page_remove_rmap+0x320/0x320
[ 4782.646072]  [9d31717c] migrate_pages+0x85c/0x930
[ 4782.646701]  [9d2d0e20] ? isolate_freepages_block+0x410/0x410
[ 4782.647407]  [9d2cfa60] ? arch_local_save_flags+0x30/0x30
[ 4782.648114]  [9d2d1803] compact_zone+0x4d3/0x8a0
[ 4782.650157]  [9d2d1c2f] compact_zone_order+0x5f/0xa0
[ 4782.651014]  [9d2d1f87] try_to_compact_pages+0x127/0x2f0
[ 4782.651656]  [9d2b0c98] __alloc_pages_direct_compact+0x68/0x200
[ 4782.652313]  [9d2b17ca] __alloc_pages_nodemask+0x99a/0xd90
[ 4782.652916]  [9d300a1c] alloc_pages_vma+0x13c/0x270
[ 4782.653618]  [9d31d914] ? do_huge_pmd_wp_page+0x494/0xc90
[ 4782.654487]  [9d31d914] do_huge_pmd_wp_page+0x494/0xc90
[ 4782.656045]  [9d320d20] ? __mem_cgroup_count_vm_event+0xd0/0x240
[ 4782.657089]  [9d2dcb7d] handle_mm_fault+0x8bd/0xc50
[ 4782.660931]  [9d1d26e6] ? __lock_is_held+0x56/0x80
[ 4782.662695]  [9d0c7bc7] __do_page_fault+0x1b7/0x660
[ 4782.663259]  [9d1cdc5e] ? put_lock_stats.isra.13+0xe/0x30
[ 4782.663851]  [9d1abf41] ? vtime_account_user+0x91/0xa0
[ 4782.664419]  [9d2a2c35] ? context_tracking_user_exit+0xb5/0x1b0
[ 4782.665119]  [9db6e103] ? __this_cpu_preempt_check+0x13/0x20
[ 4782.665969]  [9d1ce2e2] ? trace_hardirqs_off_caller+0xe2/0x1b0
[ 4782.34]  [9d0c8141] trace_do_page_fault+0x51/0x2b0
[ 4782.667257]  [9d0bee83] do_async_page_fault+0x63/0xd0
[ 4782.667871]  [a0511018] async_page_fault+0x28/0x30

Although it wasn't followed by anything else, and I've seen the original issue
getting triggered without this WARN showing up, so it seems like a different,
unrelated issue?


Thanks,
Sasha
--
To unsubscribe from this 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Hugh Dickins
On Wed, 10 Sep 2014, Sasha Levin wrote:
 On 09/09/2014 10:45 PM, Hugh Dickins wrote:
  Sasha, you say you're getting plenty of these now, but I've only seen
  the dump for one of them, on Aug26: please post a few more dumps, so
  that we can look for commonality.
 
 I wasn't saving older logs for this issue so I only have 2 traces from
 tonight. If that's not enough please let me know and I'll try to add
 a few more.

Thanks, these two are useful, mainly because the register contents most
likely to be ptes are in both of these ...900, with no sign of a ...902.

So the RW bit I got excited about yesterday is clearly not necessary for
the bug (though it's still possible that it was good for implicating page
migration, and page migration still play a part in the story).

  And please attach a disassembly of change_protection_range() (noting
  which of the dumps it corresponds to, in case it has changed around):
  Code just shows a cluster of ud2s for the unlikely bugs at end of the
  function, we cannot tell at all what should be in the registers by then.
 
 change_protection_range() got inlined into change_protection(), it applies to
 both traces above:

Thanks for supplying, but the change in inlining means that
change_protection_range() and change_protection() are no longer
relevant for these traces, we now need to see change_pte_range()
instead, to confirm that what I expect are ptes are indeed ptes.

If you can include line numbers (objdump -ld) in the disassembly, so
much the better, but should be decipherable without.  (Or objdump -Sd
for source, but I often find that harder to unscramble, can't say why.)

Thanks,
Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-10 Thread Hugh Dickins
On Wed, 10 Sep 2014, Mel Gorman wrote:
 On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote:
  
  I've been rather assuming that the 9d340902 seen in many of the
  registers in that Aug26 dump is the pte val in question: that's
  SOFT_DIRTY|PROTNONE|RW.

The 900s in the latest dumps imply that that 902 was not important.
(If any of them are in fact the pte val.)

  
  I think RW on PROTNONE is unusual but not impossible (migration entry
  replacement racing with mprotect setting PROT_NONE, after it's updated
  vm_page_prot, before it's reached the page table). 
 
 At the risk of sounding thick, I need to spell this out because I'm
 having trouble seeing exactly what race you are thinking of. 
 
 Migration entry replacement is protected against parallel NUMA hinting
 updates by the page table lock (either PMD or PTE level). It's taken by
 remove_migration_pte on one side and lock_pte_protection on the other.
 
 For the mprotect case racing again migration, migration entries are not
 present so change_pte_range() should ignore it. On migration completion
 the VMA flags determine the permissions of the new PTE. Parallel faults
 wait on the migration entry and see the correct value afterwards.
 
 When creating migration entries, try_to_unmap calls page_check_address
 which takes the PTL before doing anything. On the mprotect side,
 lock_pte_protection will block before seeing PROTNONE.
 
 I think the race you are thinking of is a migration entry created for write,
 parallel mprotect(PROTNONE) and migration completion. The migration entry
 was created for write but remove_migration_pte does not double check the VMA
 protections and mmap_sem is not taken for write across a full migration to
 protect against changes to vm_page_prot.

Yes, the if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte);
arguably should take the latest value of vma-vm_page_prot into account.

 However, change_pte_range checks
 for migration entries marked for write under the PTL and marks them read if
 one is encountered. The consequence is that we potentially take a spurious
 fault to mark the PTE write again after migration completes but I can't
 see how that causes a problem as such.

Yes, once mprotect's page table walk reaches that pte, it updates it
correctly along with all the others nearby (which were not migrated),
removing the temporary oddity.

 
 I'm missing some part of your reasoning that leads to the RW|PROTNONE :(

You don't appear to be missing it at all, you are seeing the possibility
of an RW|PROTNONE yourself, and how it gets corrected afterwards
(corrected in quotes because without the present bit, it's not an error).

 
  But exciting though
  that line of thought is, I cannot actually bring it to a pte_mknuma bug,
  or any bug at all.
  

And I wasn't saying that it led to this bug, just that it was an oddity
worth thinking about, and worth mentioning to you, in case you could work
out a way it might lead to the bug, when I had failed to do so.

But we now (almost) know that 902 is irrelevant to this bug anyway.

 
 On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It

GLOBAL once PRESENT is set, but PROTNONE so long as it is not.

 wouldn't cause this bug but it's sufficiently suspicious to be worth
 correcting. In case this is the race you're thinking of, the patch is below.
 Unfortunately, I cannot see how it would affect this problem but worth
 giving a whirl anyway.
 
  Mel, no way can it be the cause of this bug - unless Sasha's later
  traces actually show a different stack - but I don't see the call
  to change_prot_numa() from queue_pages_range() sharing the same
  avoidance of PROT_NONE that task_numa_work() has (though it does
  have an outdated comment about PROT_NONE which should be removed).
  So I think that site probably does need PROT_NONE checking added.
  
 
 That site should have checked PROT_NONE but it can't be the same bug
 that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY
 according to git grep of the trinity source.

Yes, queue_pages_range() is not implicated in any of Sasha's traces.
Something to fix, but not relevant to this bug.

 
 Worth adding this to the debugging mix? It should warn if it encounters
 the problem but avoid adding the problematic RW bit.
 
 ---8---
 migrate: debug patch to try identify race between migration completion and 
 mprotect
 
 A migration entry is marked as write if pte_write was true at the
 time the entry was created. The VMA protections are not double checked
 when migration entries are being removed but mprotect itself will mark
 write-migration-entries as read to avoid problems. It means we potentially
 take a spurious fault to mark these ptes write again but otherwise it's
 harmless.  Still, one dump indicates that this situation can actually
 happen so this debugging patch spits out a warning if the situation occurs
 and hopefully the resulting warning will contain a clue as to how exactly
 it 

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 03:09 PM, Hugh Dickins wrote:
 Thanks for supplying, but the change in inlining means that
 change_protection_range() and change_protection() are no longer
 relevant for these traces, we now need to see change_pte_range()
 instead, to confirm that what I expect are ptes are indeed ptes.
 
 If you can include line numbers (objdump -ld) in the disassembly, so
 much the better, but should be decipherable without.  (Or objdump -Sd
 for source, but I often find that harder to unscramble, can't say why.)

Here it is. Note that the source includes both of Mel's debug patches.
For reference, here's one trace of the issue with those patches:

[ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724!
[ 3114.541857] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3114.543112] Dumping ftrace buffer:
[ 3114.544056](ftrace buffer empty)
[ 3114.545000] Modules linked in:
[ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
[ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 
88076f584000
[ 3114.549284] RIP: 0010:[952e527a]  [952e527a] 
change_pte_range+0x4ea/0x4f0
[ 3114.550028] RSP: :88076f587d68  EFLAGS: 00010246
[ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100
[ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900
[ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5
[ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0
[ 3114.550028] R13: 8025 R14: 41343000 R15: cfff
[ 3114.550028] FS:  7fabb91c8700() GS:88025ec0() 
knlGS:
[ 3114.550028] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0
[ 3114.550028] DR0: 006f DR1:  DR2: 
[ 3114.550028] DR3:  DR6: 0ff0 DR7: 00050602
[ 3114.550028] Stack:
[ 3114.550028]  0001 000314625900 0018 
8802685f2260
[ 3114.550028]  1684 8802cf973600 88061684 
41343000
[ 3114.550028]  880108805048 41005000 4120 
41343000
[ 3114.550028] Call Trace:
[ 3114.550028]  [952e5534] change_protection+0x2b4/0x4e0
[ 3114.550028]  [952ff24b] change_prot_numa+0x1b/0x40
[ 3114.550028]  [951adf16] task_numa_work+0x1f6/0x330
[ 3114.550028]  [95193de4] task_work_run+0xc4/0xf0
[ 3114.550028]  [95071477] do_notify_resume+0x97/0xb0
[ 3114.550028]  [9850f06a] int_signal+0x12/0x17
[ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff 
ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 
0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
[ 3114.550028] RIP  [952e527a] change_pte_range+0x4ea/0x4f0
[ 3114.550028]  RSP 88076f587d68

And the disassembly:

 change_pte_range:
change_pte_range():
/home/sasha/linux-next/mm/mprotect.c:70
   0:   e8 00 00 00 00  callq  5 change_pte_range+0x5
1: R_X86_64_PC32__fentry__-0x4
   5:   55  push   %rbp
   6:   48 89 e5mov%rsp,%rbp
   9:   41 57   push   %r15
   b:   41 56   push   %r14
   d:   49 89 cemov%rcx,%r14
  10:   41 55   push   %r13
  12:   4d 89 c5mov%r8,%r13
  15:   41 54   push   %r12
  17:   49 89 f4mov%rsi,%r12
  1a:   53  push   %rbx
  1b:   48 89 d3mov%rdx,%rbx
  1e:   48 83 ec 38 sub$0x38,%rsp
/home/sasha/linux-next/mm/mprotect.c:71
  22:   48 8b 47 40 mov0x40(%rdi),%rax
/home/sasha/linux-next/mm/mprotect.c:70
  26:   48 89 7d c8 mov%rdi,-0x38(%rbp)
lock_pte_protection():
/home/sasha/linux-next/mm/mprotect.c:53
  2a:   8b 4d 10mov0x10(%rbp),%ecx
change_pte_range():
/home/sasha/linux-next/mm/mprotect.c:70
  2d:   44 89 4d c4 mov%r9d,-0x3c(%rbp)
/home/sasha/linux-next/mm/mprotect.c:71
  31:   48 89 45 d0 mov%rax,-0x30(%rbp)
lock_pte_protection():
/home/sasha/linux-next/mm/mprotect.c:53
  35:   85 c9   test   %ecx,%ecx
  37:   0f 84 6b 03 00 00   je 3a8 change_pte_range+0x3a8
pmd_to_page():
/home/sasha/linux-next/include/linux/mm.h:1538
  3d:   48 89 f7mov%rsi,%rdi
  40:   48 81 e7 00 f0 ff ffand$0xf000,%rdi
  47:   e8 00 00 00 00  callq  4c change_pte_range+0x4c
48: R_X86_64_PC32   __phys_addr-0x4
  4c:   48 ba 00 00 00 00 00movabs $0xea00,%rdx
  53:   ea ff ff
  56:   48 c1 e8 0c shr

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Hugh Dickins
On Wed, 10 Sep 2014, Sasha Levin wrote:
 On 09/10/2014 03:09 PM, Hugh Dickins wrote:
  Thanks for supplying, but the change in inlining means that
  change_protection_range() and change_protection() are no longer
  relevant for these traces, we now need to see change_pte_range()
  instead, to confirm that what I expect are ptes are indeed ptes.
  
  If you can include line numbers (objdump -ld) in the disassembly, so
  much the better, but should be decipherable without.  (Or objdump -Sd
  for source, but I often find that harder to unscramble, can't say why.)
 
 Here it is. Note that the source includes both of Mel's debug patches.
 For reference, here's one trace of the issue with those patches:
 
 [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724!
 [ 3114.541857] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 [ 3114.543112] Dumping ftrace buffer:
 [ 3114.544056](ftrace buffer empty)
 [ 3114.545000] Modules linked in:
 [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW  
 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137
 [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 
 88076f584000
 [ 3114.549284] RIP: 0010:[952e527a]  [952e527a] 
 change_pte_range+0x4ea/0x4f0
 [ 3114.550028] RSP: :88076f587d68  EFLAGS: 00010246
 [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 
 0100
 [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 
 000314625900
 [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 
 00b5
 [ 3114.550028] R10: 00032c01 R11: 0008 R12: 
 8802a81070c0
 [ 3114.550028] R13: 8025 R14: 41343000 R15: 
 cfff
 [ 3114.550028] FS:  7fabb91c8700() GS:88025ec0() 
 knlGS:
 [ 3114.550028] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 
 06a0
 [ 3114.550028] DR0: 006f DR1:  DR2: 
 
 [ 3114.550028] DR3:  DR6: 0ff0 DR7: 
 00050602
 [ 3114.550028] Stack:
 [ 3114.550028]  0001 000314625900 0018 
 8802685f2260
 [ 3114.550028]  1684 8802cf973600 88061684 
 41343000
 [ 3114.550028]  880108805048 41005000 4120 
 41343000
 [ 3114.550028] Call Trace:
 [ 3114.550028]  [952e5534] change_protection+0x2b4/0x4e0
 [ 3114.550028]  [952ff24b] change_prot_numa+0x1b/0x40
 [ 3114.550028]  [951adf16] task_numa_work+0x1f6/0x330
 [ 3114.550028]  [95193de4] task_work_run+0xc4/0xf0
 [ 3114.550028]  [95071477] do_notify_resume+0x97/0xb0
 [ 3114.550028]  [9850f06a] int_signal+0x12/0x17
 [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff 
 ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 
 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41
 [ 3114.550028] RIP  [952e527a] change_pte_range+0x4ea/0x4f0
 [ 3114.550028]  RSP 88076f587d68
 
 And the disassembly:
...
 /home/sasha/linux-next/mm/mprotect.c:105
  31d: 48 8b 4d a8 mov-0x58(%rbp),%rcx
  321: 81 e1 01 03 00 00   and$0x301,%ecx
  327: 48 81 f9 00 02 00 00cmp$0x200,%rcx
  32e: 0f 84 0b ff ff ff   je 23f change_pte_range+0x23f
 pte_val():
 /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450
  334: 48 83 3d 00 00 00 00cmpq   $0x0,0x0(%rip)# 33c 
 change_pte_range+0x33c
  33b: 00
   337: R_X86_64_PC32  pv_mmu_ops+0xe3
 ptep_set_numa():
 /home/sasha/linux-next/include/asm-generic/pgtable.h:740
  33c: 49 8b 3c 24 mov(%r12),%rdi
 pte_val():
 /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450
  340: 0f 84 12 01 00 00   je 458 change_pte_range+0x458
  346: ff 14 25 00 00 00 00callq  *0x0
   349: R_X86_64_32S   pv_mmu_ops+0xe8
 pte_mknuma():
 /home/sasha/linux-next/include/asm-generic/pgtable.h:724
  34d: a8 01   test   $0x1,%al
  34f: 0f 84 95 01 00 00   je 4ea change_pte_range+0x4ea
...
 ptep_set_numa():
 /home/sasha/linux-next/include/asm-generic/pgtable.h:724
  4ea: 0f 0b   ud2

Thanks, yes, there is enough in there to be sure that the ...900 is
indeed the oldpte.  I wasn't expecting that pv_mmu_ops function call,
but there's no evidence that it does anything worse than just return
in %rax what it's given in %rdi; and the second long on the stack is
the -0x58(%rbp) from which oldpte is retrieved for !pte_numa(oldpte)
at the beginning of the extract above.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: mm: BUG in unmap_page_range

2014-09-10 Thread Sasha Levin
On 09/10/2014 03:36 PM, Hugh Dickins wrote:
 migrate: debug patch to try identify race between migration completion and 
 mprotect
  
  A migration entry is marked as write if pte_write was true at the
  time the entry was created. The VMA protections are not double checked
  when migration entries are being removed but mprotect itself will mark
  write-migration-entries as read to avoid problems. It means we potentially
  take a spurious fault to mark these ptes write again but otherwise it's
  harmless.  Still, one dump indicates that this situation can actually
  happen so this debugging patch spits out a warning if the situation occurs
  and hopefully the resulting warning will contain a clue as to how exactly
  it happens
  
  Not-signed-off
  ---
   mm/migrate.c | 12 ++--
   1 file changed, 10 insertions(+), 2 deletions(-)
  
  diff --git a/mm/migrate.c b/mm/migrate.c
  index 09d489c..631725c 100644
  --- a/mm/migrate.c
  +++ b/mm/migrate.c
  @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, 
  struct vm_area_struct *vma,
 pte = pte_mkold(mk_pte(new, vma-vm_page_prot));
 if (pte_swp_soft_dirty(*ptep))
 pte = pte_mksoft_dirty(pte);
  -  if (is_write_migration_entry(entry))
  -  pte = pte_mkwrite(pte);
  +  if (is_write_migration_entry(entry)) {
  +  /*
  +   * This WARN_ON_ONCE is temporary for the purposes of seeing if
  +   * it's a case encountered by trinity in Sasha's testing
  +   */
  +  if (!(vma-vm_flags  (VM_WRITE)))
  +  WARN_ON_ONCE(1);
  +  else
  +  pte = pte_mkwrite(pte);
  +  }
   #ifdef CONFIG_HUGETLB_PAGE
 if (PageHuge(new)) {
 pte = pte_mkhuge(pte);
  
 Right, and Sasha  reports that that can fire, but he sees the bug
 with this patch in and without that firing.

I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA 
information
out, and got the following:

[ 4018.870776] vma 8801a0f1e800 start 7f3fd0ca7000 end 7f3fd16a7000
[ 4018.870776] next 8804e1b89800 prev 88008cd9a000 mm 88054b17d000
[ 4018.870776] prot 120 anon_vma 880bc858a200 vm_ops   (null)
[ 4018.870776] pgoff 41bc8 file   (null) private_data   (null)
[ 4018.879731] flags: 0x8100070(mayread|maywrite|mayexec|account)
[ 4018.881324] [ cut here ]
[ 4018.882612] kernel BUG at mm/migrate.c:155!
[ 4018.883649] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 4018.889647] Dumping ftrace buffer:
[ 4018.890323](ftrace buffer empty)
[ 4018.890323] Modules linked in:
[ 4018.890323] CPU: 4 PID: 9966 Comm: trinity-main Tainted: GW  
3.17.0-rc4-next-20140910-sasha-00042-ga4bad9b-dirty #1140
[ 4018.890323] task: 880695b83000 ti: 880560c44000 task.ti: 
880560c44000
[ 4018.890323] RIP: 0010:[9b2fd4c1]  [9b2fd4c1] 
remove_migration_pte+0x3e1/0x3f0
[ 4018.890323] RSP: :880560c477c8  EFLAGS: 00010292
[ 4018.890323] RAX: 0001 RBX: 7f3fd129b000 RCX: 
[ 4018.890323] RDX: 0001 RSI: 9e4ba395 RDI: 0001
[ 4018.890323] RBP: 880560c47800 R08: 0001 R09: 0001
[ 4018.890323] R10: 00045401 R11: 0001 R12: 8801a0f1e800
[ 4018.890323] R13: 88054b17d000 R14: ea000478eb40 R15: 880122bcf070
[ 4018.890323] FS:  7f3fd55bb700() GS:8803d6a0() 
knlGS:
[ 4018.890323] CS:  0010 DS:  ES:  CR0: 8005003b
[ 4018.890323] CR2: 00fcbca8 CR3: 000561bab000 CR4: 06a0
[ 4018.890323] DR0: 006f DR1:  DR2: 
[ 4018.890323] DR3:  DR6: 0ff0 DR7: 0600
[ 4018.890323] Stack:
[ 4018.890323]  ea00046ed980 88011079c4d8 ea000478eb40 
880560c47858
[ 4018.890323]  88019fde0330 000421bc 8801a0f1e800 
880560c47848
[ 4018.890323]  9b2d1b0f 880bc858a200 880560c47850 
ea000478eb40
[ 4018.890323] Call Trace:
[ 4018.890323]  [9b2d1b0f] rmap_walk+0x22f/0x380
[ 4018.890323]  [9b2fc841] remove_migration_ptes+0x41/0x50
[ 4018.890323]  [9b2fd0e0] ? 
__migration_entry_wait.isra.24+0x160/0x160
[ 4018.890323]  [9b2fd4d0] ? remove_migration_pte+0x3f0/0x3f0
[ 4018.890323]  [9b2fe73b] move_to_new_page+0x16b/0x230
[ 4018.890323]  [9b2d1e8c] ? try_to_unmap+0x6c/0xf0
[ 4018.890323]  [9b2d08a0] ? try_to_unmap_nonlinear+0x5c0/0x5c0
[ 4018.890323]  [9b2cf0a0] ? invalid_migration_vma+0x30/0x30
[ 4018.890323]  [9b2d02e0] ? page_remove_rmap+0x320/0x320
[ 4018.890323]  [9b2ff19c] migrate_pages+0x85c/0x930
[ 4018.890323]  [9b2b8e20] ? isolate_freepages_block+0x410/0x410
[ 4018.890323]  [9b2b7a60] ? arch_local_save_flags+0x30/0x30
[ 4018.890323]  [9b2b9803] 

Re: mm: BUG in unmap_page_range

2014-09-09 Thread Hugh Dickins
On Tue, 9 Sep 2014, Sasha Levin wrote:
> On 09/09/2014 05:33 PM, Mel Gorman wrote:
> > On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
> >> On 09/08/2014 01:18 PM, Mel Gorman wrote:
> >>> A worse possibility is that somehow the lock is getting corrupted but
> >>> that's also a tough sell considering that the locks should be allocated
> >>> from a dedicated cache. I guess I could try breaking that to allocate
> >>> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
> >>> optimistic.
> >>
> >> I did see ptl corruption couple days ago:
> >>
> >>https://lkml.org/lkml/2014/9/4/599
> >>
> >> Could this be related?
> >>
> > 
> > Possibly although the likely explanation then would be that there is
> > just general corruption coming from somewhere. Even using your config
> > and applying a patch to make linux-next boot (already in Tejun's tree)
> > I was unable to reproduce the problem after running for several hours. I
> > had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
> > so I have a few questions.
> 
> I agree it could be a case of random corruption somewhere else, it's just
> that the amount of times this exact issue reproduced

Yes, I doubt it's random corruption; but I've been no more successful
than Mel in working it out (I share responsibility for that VM_BUG_ON).

Sasha, you say you're getting plenty of these now, but I've only seen
the dump for one of them, on Aug26: please post a few more dumps, so
that we can look for commonality.

And please attach a disassembly of change_protection_range() (noting
which of the dumps it corresponds to, in case it has changed around):
"Code" just shows a cluster of ud2s for the unlikely bugs at end of the
function, we cannot tell at all what should be in the registers by then.

I've been rather assuming that the 9d340902 seen in many of the
registers in that Aug26 dump is the pte val in question: that's
SOFT_DIRTY|PROTNONE|RW.

I think RW on PROTNONE is unusual but not impossible (migration entry
replacement racing with mprotect setting PROT_NONE, after it's updated
vm_page_prot, before it's reached the page table).  But exciting though
that line of thought is, I cannot actually bring it to a pte_mknuma bug,
or any bug at all.

Mel, no way can it be the cause of this bug - unless Sasha's later
traces actually show a different stack - but I don't see the call
to change_prot_numa() from queue_pages_range() sharing the same
avoidance of PROT_NONE that task_numa_work() has (though it does
have an outdated comment about PROT_NONE which should be removed).
So I think that site probably does need PROT_NONE checking added.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-09 Thread Sasha Levin
On 09/09/2014 05:33 PM, Mel Gorman wrote:
> On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
>> On 09/08/2014 01:18 PM, Mel Gorman wrote:
>>> A worse possibility is that somehow the lock is getting corrupted but
>>> that's also a tough sell considering that the locks should be allocated
>>> from a dedicated cache. I guess I could try breaking that to allocate
>>> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
>>> optimistic.
>>
>> I did see ptl corruption couple days ago:
>>
>>  https://lkml.org/lkml/2014/9/4/599
>>
>> Could this be related?
>>
> 
> Possibly although the likely explanation then would be that there is
> just general corruption coming from somewhere. Even using your config
> and applying a patch to make linux-next boot (already in Tejun's tree)
> I was unable to reproduce the problem after running for several hours. I
> had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
> so I have a few questions.

I agree it could be a case of random corruption somewhere else, it's just
that the amount of times this exact issue reproduced

> 1. What filesystem are you using?

virtio-9p. I'm willing to try something more "common" if you feel this could
be related, but I haven't seen any issues coming out of 9p in a while now.

> 2. What compiler in case it's an experimental compiler? I ask because I
>think I saw a patch from you adding support so that the kernel would
>build with gcc 5

Right, I've been testing with gcc 5 as well as Debian's gcc 4.7.2, it
reproduces with both compilers.

> 3. Does your hardware support TSX or anything similarly funky that would
>potentially affect locking?

Not that I know of, here are the cpu flags for reference:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush 
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc 
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt 
lahf_lm ida epb dtherm tpr_shadow vnmi flexpriority ept vpid

> 4. How many sockets are on your test machine in case reproducing it
>depends in a machine large enough to open a timing race?

128 sockets.

> As I'm drawing a blank on what would trigger the bug I'm hoping I can
> reproduce this locally and experiement a bit.

I was thinking about sneaking in something like the following (untested) patch
to see if it's really memory corruption that is wiping out stuff:
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 0f9724c..0205655 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -25,6 +25,7 @@
 #define _PAGE_BIT_SPLITTING_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_IOMAP_PAGE_BIT_SOFTW2 /* flag used to 
indicate IO mapping */
 #define _PAGE_BIT_HIDDEN   _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
+#define _PAGE_BIT_SANITY   _PAGE_BIT_SOFTW3 /* Memory corruption canary */
 #define _PAGE_BIT_SOFT_DIRTY   _PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX   63   /* No execute: only valid after cpuid 
check */

@@ -66,6 +67,8 @@
 #define _PAGE_HIDDEN   (_AT(pteval_t, 0))
 #endif

+#define _PAGE_SANITY   (_AT(pteval_t, 1) << _PAGE_BIT_SANITY)
+
 /*
  * The same hidden bit is used by kmemcheck, but since kmemcheck
  * works on kernel pages while soft-dirty engine on user space,
@@ -312,7 +315,7 @@ static inline pmdval_t pmd_flags(pmd_t pmd)

 static inline pte_t native_make_pte(pteval_t val)
 {
-   return (pte_t) { .pte = val };
+   return (pte_t) { .pte = val | _PAGE_SANITY };
 }

 static inline pteval_t native_pte_val(pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ffea570..bc897a1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -720,6 +720,8 @@ static inline pmd_t pmd_mknonnuma(pmd_t pmd)
 static inline pte_t pte_mknuma(pte_t pte)
 {
pteval_t val = pte_val(pte);
+
+   VM_BUG_ON(!(val & _PAGE_SANITY));

VM_BUG_ON(!(val & _PAGE_PRESENT));

Does it make sense at all?


Thanks,
Sasha



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-09 Thread Mel Gorman
On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
> On 09/08/2014 01:18 PM, Mel Gorman wrote:
> > A worse possibility is that somehow the lock is getting corrupted but
> > that's also a tough sell considering that the locks should be allocated
> > from a dedicated cache. I guess I could try breaking that to allocate
> > one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
> > optimistic.
> 
> I did see ptl corruption couple days ago:
> 
>   https://lkml.org/lkml/2014/9/4/599
> 
> Could this be related?
> 

Possibly although the likely explanation then would be that there is
just general corruption coming from somewhere. Even using your config
and applying a patch to make linux-next boot (already in Tejun's tree)
I was unable to reproduce the problem after running for several hours. I
had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
so I have a few questions.

1. What filesystem are you using?

2. What compiler in case it's an experimental compiler? I ask because I
   think I saw a patch from you adding support so that the kernel would
   build with gcc 5

3. Does your hardware support TSX or anything similarly funky that would
   potentially affect locking?

4. How many sockets are on your test machine in case reproducing it
   depends in a machine large enough to open a timing race?

As I'm drawing a blank on what would trigger the bug I'm hoping I can
reproduce this locally and experiement a bit.

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-09 Thread Mel Gorman
On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
 On 09/08/2014 01:18 PM, Mel Gorman wrote:
  A worse possibility is that somehow the lock is getting corrupted but
  that's also a tough sell considering that the locks should be allocated
  from a dedicated cache. I guess I could try breaking that to allocate
  one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
  optimistic.
 
 I did see ptl corruption couple days ago:
 
   https://lkml.org/lkml/2014/9/4/599
 
 Could this be related?
 

Possibly although the likely explanation then would be that there is
just general corruption coming from somewhere. Even using your config
and applying a patch to make linux-next boot (already in Tejun's tree)
I was unable to reproduce the problem after running for several hours. I
had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
so I have a few questions.

1. What filesystem are you using?

2. What compiler in case it's an experimental compiler? I ask because I
   think I saw a patch from you adding support so that the kernel would
   build with gcc 5

3. Does your hardware support TSX or anything similarly funky that would
   potentially affect locking?

4. How many sockets are on your test machine in case reproducing it
   depends in a machine large enough to open a timing race?

As I'm drawing a blank on what would trigger the bug I'm hoping I can
reproduce this locally and experiement a bit.

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-09 Thread Sasha Levin
On 09/09/2014 05:33 PM, Mel Gorman wrote:
 On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
 On 09/08/2014 01:18 PM, Mel Gorman wrote:
 A worse possibility is that somehow the lock is getting corrupted but
 that's also a tough sell considering that the locks should be allocated
 from a dedicated cache. I guess I could try breaking that to allocate
 one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
 optimistic.

 I did see ptl corruption couple days ago:

  https://lkml.org/lkml/2014/9/4/599

 Could this be related?

 
 Possibly although the likely explanation then would be that there is
 just general corruption coming from somewhere. Even using your config
 and applying a patch to make linux-next boot (already in Tejun's tree)
 I was unable to reproduce the problem after running for several hours. I
 had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
 so I have a few questions.

I agree it could be a case of random corruption somewhere else, it's just
that the amount of times this exact issue reproduced

 1. What filesystem are you using?

virtio-9p. I'm willing to try something more common if you feel this could
be related, but I haven't seen any issues coming out of 9p in a while now.

 2. What compiler in case it's an experimental compiler? I ask because I
think I saw a patch from you adding support so that the kernel would
build with gcc 5

Right, I've been testing with gcc 5 as well as Debian's gcc 4.7.2, it
reproduces with both compilers.

 3. Does your hardware support TSX or anything similarly funky that would
potentially affect locking?

Not that I know of, here are the cpu flags for reference:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush 
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc 
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt 
lahf_lm ida epb dtherm tpr_shadow vnmi flexpriority ept vpid

 4. How many sockets are on your test machine in case reproducing it
depends in a machine large enough to open a timing race?

128 sockets.

 As I'm drawing a blank on what would trigger the bug I'm hoping I can
 reproduce this locally and experiement a bit.

I was thinking about sneaking in something like the following (untested) patch
to see if it's really memory corruption that is wiping out stuff:
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 0f9724c..0205655 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -25,6 +25,7 @@
 #define _PAGE_BIT_SPLITTING_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_IOMAP_PAGE_BIT_SOFTW2 /* flag used to 
indicate IO mapping */
 #define _PAGE_BIT_HIDDEN   _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
+#define _PAGE_BIT_SANITY   _PAGE_BIT_SOFTW3 /* Memory corruption canary */
 #define _PAGE_BIT_SOFT_DIRTY   _PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX   63   /* No execute: only valid after cpuid 
check */

@@ -66,6 +67,8 @@
 #define _PAGE_HIDDEN   (_AT(pteval_t, 0))
 #endif

+#define _PAGE_SANITY   (_AT(pteval_t, 1)  _PAGE_BIT_SANITY)
+
 /*
  * The same hidden bit is used by kmemcheck, but since kmemcheck
  * works on kernel pages while soft-dirty engine on user space,
@@ -312,7 +315,7 @@ static inline pmdval_t pmd_flags(pmd_t pmd)

 static inline pte_t native_make_pte(pteval_t val)
 {
-   return (pte_t) { .pte = val };
+   return (pte_t) { .pte = val | _PAGE_SANITY };
 }

 static inline pteval_t native_pte_val(pte_t pte)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ffea570..bc897a1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -720,6 +720,8 @@ static inline pmd_t pmd_mknonnuma(pmd_t pmd)
 static inline pte_t pte_mknuma(pte_t pte)
 {
pteval_t val = pte_val(pte);
+
+   VM_BUG_ON(!(val  _PAGE_SANITY));

VM_BUG_ON(!(val  _PAGE_PRESENT));

Does it make sense at all?


Thanks,
Sasha



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-09 Thread Hugh Dickins
On Tue, 9 Sep 2014, Sasha Levin wrote:
 On 09/09/2014 05:33 PM, Mel Gorman wrote:
  On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote:
  On 09/08/2014 01:18 PM, Mel Gorman wrote:
  A worse possibility is that somehow the lock is getting corrupted but
  that's also a tough sell considering that the locks should be allocated
  from a dedicated cache. I guess I could try breaking that to allocate
  one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
  optimistic.
 
  I did see ptl corruption couple days ago:
 
 https://lkml.org/lkml/2014/9/4/599
 
  Could this be related?
 
  
  Possibly although the likely explanation then would be that there is
  just general corruption coming from somewhere. Even using your config
  and applying a patch to make linux-next boot (already in Tejun's tree)
  I was unable to reproduce the problem after running for several hours. I
  had to run trinity on tmpfs as ext4 and xfs blew up almost immediately
  so I have a few questions.
 
 I agree it could be a case of random corruption somewhere else, it's just
 that the amount of times this exact issue reproduced

Yes, I doubt it's random corruption; but I've been no more successful
than Mel in working it out (I share responsibility for that VM_BUG_ON).

Sasha, you say you're getting plenty of these now, but I've only seen
the dump for one of them, on Aug26: please post a few more dumps, so
that we can look for commonality.

And please attach a disassembly of change_protection_range() (noting
which of the dumps it corresponds to, in case it has changed around):
Code just shows a cluster of ud2s for the unlikely bugs at end of the
function, we cannot tell at all what should be in the registers by then.

I've been rather assuming that the 9d340902 seen in many of the
registers in that Aug26 dump is the pte val in question: that's
SOFT_DIRTY|PROTNONE|RW.

I think RW on PROTNONE is unusual but not impossible (migration entry
replacement racing with mprotect setting PROT_NONE, after it's updated
vm_page_prot, before it's reached the page table).  But exciting though
that line of thought is, I cannot actually bring it to a pte_mknuma bug,
or any bug at all.

Mel, no way can it be the cause of this bug - unless Sasha's later
traces actually show a different stack - but I don't see the call
to change_prot_numa() from queue_pages_range() sharing the same
avoidance of PROT_NONE that task_numa_work() has (though it does
have an outdated comment about PROT_NONE which should be removed).
So I think that site probably does need PROT_NONE checking added.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-08 Thread Sasha Levin
On 09/08/2014 01:18 PM, Mel Gorman wrote:
> A worse possibility is that somehow the lock is getting corrupted but
> that's also a tough sell considering that the locks should be allocated
> from a dedicated cache. I guess I could try breaking that to allocate
> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
> optimistic.

I did see ptl corruption couple days ago:

https://lkml.org/lkml/2014/9/4/599

Could this be related?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-08 Thread Mel Gorman
On Thu, Sep 04, 2014 at 05:04:37AM -0400, Sasha Levin wrote:
> On 08/29/2014 09:23 PM, Sasha Levin wrote:
> > On 08/27/2014 11:26 AM, Mel Gorman wrote:
> >> > diff --git a/include/asm-generic/pgtable.h 
> >> > b/include/asm-generic/pgtable.h
> >> > index 281870f..ffea570 100644
> >> > --- a/include/asm-generic/pgtable.h
> >> > +++ b/include/asm-generic/pgtable.h
> >> > @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)
> >> >  
> >> >  VM_BUG_ON(!(val & _PAGE_PRESENT));
> >> >  
> >> > +/* debugging only, specific to x86 */
> >> > +VM_BUG_ON(val & _PAGE_PROTNONE);
> >> > +
> >> >  val &= ~_PAGE_PRESENT;
> >> >  val |= _PAGE_NUMA;
> > Triggered again, the first VM_BUG_ON got hit, the second one never did.
> 
> Okay, this bug has reproduced quite a few times since then that I no longer
> suspect it's random memory corruption. I'd be happy to try out more debug
> patches if you have any leads.
> 

The fact the second one doesn't trigger makes me think that this is not
related to how the helpers are called and is instead relating to timing.
I tried reproducing this but got nothing after 3 hours. How long does it
typically take to reproduce in a given run? You mentioned that it takes a
few weeks to hit but maybe the frequency has changed since. I tried todays
linux-next kernel but it didn't even boot so next-20140826 to match your
original report but got nothing. Can you also send me the config you used
in case that's a factor.

I had one hunch that this may somehow be related to a collision between
pagetable teardown during exit and the scanner but I could not find a
way that could actually happen. During teardown there should be only one
user of the mm and it can't race with itself.

A worse possibility is that somehow the lock is getting corrupted but
that's also a tough sell considering that the locks should be allocated
from a dedicated cache. I guess I could try breaking that to allocate
one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
optimistic.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-08 Thread Mel Gorman
On Thu, Sep 04, 2014 at 05:04:37AM -0400, Sasha Levin wrote:
 On 08/29/2014 09:23 PM, Sasha Levin wrote:
  On 08/27/2014 11:26 AM, Mel Gorman wrote:
   diff --git a/include/asm-generic/pgtable.h 
   b/include/asm-generic/pgtable.h
   index 281870f..ffea570 100644
   --- a/include/asm-generic/pgtable.h
   +++ b/include/asm-generic/pgtable.h
   @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)

VM_BUG_ON(!(val  _PAGE_PRESENT));

   +/* debugging only, specific to x86 */
   +VM_BUG_ON(val  _PAGE_PROTNONE);
   +
val = ~_PAGE_PRESENT;
val |= _PAGE_NUMA;
  Triggered again, the first VM_BUG_ON got hit, the second one never did.
 
 Okay, this bug has reproduced quite a few times since then that I no longer
 suspect it's random memory corruption. I'd be happy to try out more debug
 patches if you have any leads.
 

The fact the second one doesn't trigger makes me think that this is not
related to how the helpers are called and is instead relating to timing.
I tried reproducing this but got nothing after 3 hours. How long does it
typically take to reproduce in a given run? You mentioned that it takes a
few weeks to hit but maybe the frequency has changed since. I tried todays
linux-next kernel but it didn't even boot so next-20140826 to match your
original report but got nothing. Can you also send me the config you used
in case that's a factor.

I had one hunch that this may somehow be related to a collision between
pagetable teardown during exit and the scanner but I could not find a
way that could actually happen. During teardown there should be only one
user of the mm and it can't race with itself.

A worse possibility is that somehow the lock is getting corrupted but
that's also a tough sell considering that the locks should be allocated
from a dedicated cache. I guess I could try breaking that to allocate
one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
optimistic.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-08 Thread Sasha Levin
On 09/08/2014 01:18 PM, Mel Gorman wrote:
 A worse possibility is that somehow the lock is getting corrupted but
 that's also a tough sell considering that the locks should be allocated
 from a dedicated cache. I guess I could try breaking that to allocate
 one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
 optimistic.

I did see ptl corruption couple days ago:

https://lkml.org/lkml/2014/9/4/599

Could this be related?


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-04 Thread Sasha Levin
On 08/29/2014 09:23 PM, Sasha Levin wrote:
> On 08/27/2014 11:26 AM, Mel Gorman wrote:
>> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
>> > index 281870f..ffea570 100644
>> > --- a/include/asm-generic/pgtable.h
>> > +++ b/include/asm-generic/pgtable.h
>> > @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)
>> >  
>> >VM_BUG_ON(!(val & _PAGE_PRESENT));
>> >  
>> > +  /* debugging only, specific to x86 */
>> > +  VM_BUG_ON(val & _PAGE_PROTNONE);
>> > +
>> >val &= ~_PAGE_PRESENT;
>> >val |= _PAGE_NUMA;
> Triggered again, the first VM_BUG_ON got hit, the second one never did.

Okay, this bug has reproduced quite a few times since then that I no longer
suspect it's random memory corruption. I'd be happy to try out more debug
patches if you have any leads.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-09-04 Thread Sasha Levin
On 08/29/2014 09:23 PM, Sasha Levin wrote:
 On 08/27/2014 11:26 AM, Mel Gorman wrote:
  diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
  index 281870f..ffea570 100644
  --- a/include/asm-generic/pgtable.h
  +++ b/include/asm-generic/pgtable.h
  @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)
   
 VM_BUG_ON(!(val  _PAGE_PRESENT));
   
  +  /* debugging only, specific to x86 */
  +  VM_BUG_ON(val  _PAGE_PROTNONE);
  +
 val = ~_PAGE_PRESENT;
 val |= _PAGE_NUMA;
 Triggered again, the first VM_BUG_ON got hit, the second one never did.

Okay, this bug has reproduced quite a few times since then that I no longer
suspect it's random memory corruption. I'd be happy to try out more debug
patches if you have any leads.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-29 Thread Sasha Levin
On 08/27/2014 11:26 AM, Mel Gorman wrote:
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 281870f..ffea570 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)
>  
>   VM_BUG_ON(!(val & _PAGE_PRESENT));
>  
> + /* debugging only, specific to x86 */
> + VM_BUG_ON(val & _PAGE_PROTNONE);
> +
>   val &= ~_PAGE_PRESENT;
>   val |= _PAGE_NUMA;

Triggered again, the first VM_BUG_ON got hit, the second one never did.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-29 Thread Sasha Levin
On 08/27/2014 11:26 AM, Mel Gorman wrote:
 diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
 index 281870f..ffea570 100644
 --- a/include/asm-generic/pgtable.h
 +++ b/include/asm-generic/pgtable.h
 @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)
  
   VM_BUG_ON(!(val  _PAGE_PRESENT));
  
 + /* debugging only, specific to x86 */
 + VM_BUG_ON(val  _PAGE_PROTNONE);
 +
   val = ~_PAGE_PRESENT;
   val |= _PAGE_NUMA;

Triggered again, the first VM_BUG_ON got hit, the second one never did.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-27 Thread Sasha Levin
On 08/27/2014 11:26 AM, Mel Gorman wrote:
> Sasha, how long does it typically take to trigger this? Are you
> using any particular switches for trinity that would trigger the bug
> faster?

It took couple of weeks (I've been running with it since the beginning
of August). I don't have any special trinity options, just the default
fuzzing. Do you think that focusing on any of the mm syscalls would
increase the odds of hitting it?

There's always the chance that this is a fluke due to corruption somewhere
else. I'll keep running it with the new debug patch and if it won't reproduce
any time soon we can probably safely assume that.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-27 Thread Mel Gorman
On Tue, Aug 26, 2014 at 11:16:47PM -0400, Sasha Levin wrote:
> On 08/11/2014 11:28 PM, Sasha Levin wrote:
> > On 08/05/2014 09:04 PM, Sasha Levin wrote:
> >> > Thanks Hugh, Mel. I've added both patches to my local tree and will 
> >> > update tomorrow
> >> > with the weather.
> >> > 
> >> > Also:
> >> > 
> >> > On 08/05/2014 08:42 PM, Hugh Dickins wrote:
> >>> >> One thing I did wonder, though: at first I was reassured by the
> >>> >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
> >>> >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
> >>> >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
> >>> >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
> >> > 
> >> > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll
> >> > update how that one looks as well.
> > Sorry for the rather long delay.
> > 
> > The patch looks fine, the issue didn't reproduce.
> > 
> > The added VM_BUG_ON didn't trigger either, so maybe we should consider 
> > adding
> > it in.
> 
> It took a while, but I've managed to hit that VM_BUG_ON:
> 
> [  707.975456] kernel BUG at include/asm-generic/pgtable.h:724!
> [  707.977147] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  707.978974] Dumping ftrace buffer:
> [  707.980110](ftrace buffer empty)
> [  707.981221] Modules linked in:
> [  707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 
> 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079
> [  707.982801] task: 880165e28000 ti: 880165e3 task.ti: 
> 880165e3
> [  707.982801] RIP: 0010:[]  [] 
> change_protection_range+0x94a/0x970
> [  707.982801] RSP: 0018:880165e33d98  EFLAGS: 00010246
> [  707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 
> 0100
> [  707.982801] RDX: 9d340902 RSI: 41741000 RDI: 
> 9d340902
> [  707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 
> 00b52000
> [  707.982801] R10: 1e01 R11: 0008 R12: 
> 41751000
> [  707.982801] R13: 00f7 R14: 9d340902 R15: 
> 41741000
> [  707.982801] FS:  7f358a9aa700() GS:88071c60() 
> knlGS:
> [  707.982801] CS:  0010 DS:  ES:  CR0: 8005003b
> [  707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 
> 06a0
> [  707.982801] Stack:
> [  707.982801]  8804db88d058  88070fb17cf0 
> 
> [  707.982801]  880165d88000  8801686a5000 
> 4163e000
> [  707.982801]  8801686a5000 0001 0025 
> 41750fff
> [  707.982801] Call Trace:
> [  707.982801]  [] change_protection+0x14/0x30
> [  707.982801]  [] change_prot_numa+0x1b/0x40
> [  707.982801]  [] task_numa_work+0x1f6/0x330
> [  707.982801]  [] task_work_run+0xc4/0xf0
> [  707.982801]  [] do_notify_resume+0x97/0xb0
> [  707.982801]  [] int_signal+0x12/0x17
> [  707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 
> 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b <0f> 
> 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01
> [  707.982801] RIP  [] change_protection_range+0x94a/0x970
> [  707.982801]  RSP 
> 

The tests to reach here are

pte_present any of  _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA
pte_numaonly_PAGE_NUMA out of _PAGE_PRESENT | _PAGE_PROTNONE | 
_PAGE_NUMA
VM_BUG_ON   not set _PAGE_PRESENT

To trigger the bug the PTE bits must then be _PAGE_PROTNONE | _PAGE_NUMA. The
NUMA PTE scanner is skipping PROT_NONE VMAs so it should be "impossible"
for it to be set there. The mmap_sem is held for read during scans so
the protections should not be altering underneath us and the PTL is held
against parallel faults.

That leaves setting prot_none leaveing _PAGE_NUMA behind. Potentially
that's an issue due to

/* Set of bits not changed in pte_modify */
#define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
 _PAGE_SOFT_DIRTY | _PAGE_NUMA)

The _PAGE_NUMA bit is not cleared as removing it potentially leaves the
PTE in an unexpected state due to a "present" PTE marked for NUMA hinting
fault becoming non-present. Instead there is this check in change_pte_range()
to move PTEs to a known state before changing protections

if (pte_numa(ptent))
ptent = pte_mknonnuma(ptent);
ptent = pte_modify(ptent, newprot);

So right now, I'm not seeing what path gets us to this inconsistent
state. Sasha, how long does it typically take to trigger this? Are you
using any particular switches for trinity that would trigger the bug
faster?

This untested patch might help pinpoint the source of the corruption
early though it's 

Re: mm: BUG in unmap_page_range

2014-08-27 Thread Mel Gorman
On Tue, Aug 26, 2014 at 11:16:47PM -0400, Sasha Levin wrote:
 On 08/11/2014 11:28 PM, Sasha Levin wrote:
  On 08/05/2014 09:04 PM, Sasha Levin wrote:
   Thanks Hugh, Mel. I've added both patches to my local tree and will 
   update tomorrow
   with the weather.
   
   Also:
   
   On 08/05/2014 08:42 PM, Hugh Dickins wrote:
   One thing I did wonder, though: at first I was reassured by the
   VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
   it would be better as VM_BUG_ON(!(val  _PAGE_PRESENT)), being stronger
   - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
   (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
   
   I've added VM_BUG_ON(!(val  _PAGE_PRESENT)) in just as a curiosity, I'll
   update how that one looks as well.
  Sorry for the rather long delay.
  
  The patch looks fine, the issue didn't reproduce.
  
  The added VM_BUG_ON didn't trigger either, so maybe we should consider 
  adding
  it in.
 
 It took a while, but I've managed to hit that VM_BUG_ON:
 
 [  707.975456] kernel BUG at include/asm-generic/pgtable.h:724!
 [  707.977147] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 [  707.978974] Dumping ftrace buffer:
 [  707.980110](ftrace buffer empty)
 [  707.981221] Modules linked in:
 [  707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 
 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079
 [  707.982801] task: 880165e28000 ti: 880165e3 task.ti: 
 880165e3
 [  707.982801] RIP: 0010:[b42e3dda]  [b42e3dda] 
 change_protection_range+0x94a/0x970
 [  707.982801] RSP: 0018:880165e33d98  EFLAGS: 00010246
 [  707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 
 0100
 [  707.982801] RDX: 9d340902 RSI: 41741000 RDI: 
 9d340902
 [  707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 
 00b52000
 [  707.982801] R10: 1e01 R11: 0008 R12: 
 41751000
 [  707.982801] R13: 00f7 R14: 9d340902 R15: 
 41741000
 [  707.982801] FS:  7f358a9aa700() GS:88071c60() 
 knlGS:
 [  707.982801] CS:  0010 DS:  ES:  CR0: 8005003b
 [  707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 
 06a0
 [  707.982801] Stack:
 [  707.982801]  8804db88d058  88070fb17cf0 
 
 [  707.982801]  880165d88000  8801686a5000 
 4163e000
 [  707.982801]  8801686a5000 0001 0025 
 41750fff
 [  707.982801] Call Trace:
 [  707.982801]  [b42e3e14] change_protection+0x14/0x30
 [  707.982801]  [b42fda3b] change_prot_numa+0x1b/0x40
 [  707.982801]  [b41ad766] task_numa_work+0x1f6/0x330
 [  707.982801]  [b41937c4] task_work_run+0xc4/0xf0
 [  707.982801]  [b40712e7] do_notify_resume+0x97/0xb0
 [  707.982801]  [b74fd6ea] int_signal+0x12/0x17
 [  707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 
 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b 0f 
 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01
 [  707.982801] RIP  [b42e3dda] change_protection_range+0x94a/0x970
 [  707.982801]  RSP 880165e33d98
 

The tests to reach here are

pte_present any of  _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA
pte_numaonly_PAGE_NUMA out of _PAGE_PRESENT | _PAGE_PROTNONE | 
_PAGE_NUMA
VM_BUG_ON   not set _PAGE_PRESENT

To trigger the bug the PTE bits must then be _PAGE_PROTNONE | _PAGE_NUMA. The
NUMA PTE scanner is skipping PROT_NONE VMAs so it should be impossible
for it to be set there. The mmap_sem is held for read during scans so
the protections should not be altering underneath us and the PTL is held
against parallel faults.

That leaves setting prot_none leaveing _PAGE_NUMA behind. Potentially
that's an issue due to

/* Set of bits not changed in pte_modify */
#define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
 _PAGE_SOFT_DIRTY | _PAGE_NUMA)

The _PAGE_NUMA bit is not cleared as removing it potentially leaves the
PTE in an unexpected state due to a present PTE marked for NUMA hinting
fault becoming non-present. Instead there is this check in change_pte_range()
to move PTEs to a known state before changing protections

if (pte_numa(ptent))
ptent = pte_mknonnuma(ptent);
ptent = pte_modify(ptent, newprot);

So right now, I'm not seeing what path gets us to this inconsistent
state. Sasha, how long does it typically take to trigger this? Are you
using any particular switches for trinity that would trigger the bug
faster?

This untested patch might help pinpoint the source of the corruption

Re: mm: BUG in unmap_page_range

2014-08-27 Thread Sasha Levin
On 08/27/2014 11:26 AM, Mel Gorman wrote:
 Sasha, how long does it typically take to trigger this? Are you
 using any particular switches for trinity that would trigger the bug
 faster?

It took couple of weeks (I've been running with it since the beginning
of August). I don't have any special trinity options, just the default
fuzzing. Do you think that focusing on any of the mm syscalls would
increase the odds of hitting it?

There's always the chance that this is a fluke due to corruption somewhere
else. I'll keep running it with the new debug patch and if it won't reproduce
any time soon we can probably safely assume that.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-26 Thread Sasha Levin
On 08/11/2014 11:28 PM, Sasha Levin wrote:
> On 08/05/2014 09:04 PM, Sasha Levin wrote:
>> > Thanks Hugh, Mel. I've added both patches to my local tree and will update 
>> > tomorrow
>> > with the weather.
>> > 
>> > Also:
>> > 
>> > On 08/05/2014 08:42 PM, Hugh Dickins wrote:
>>> >> One thing I did wonder, though: at first I was reassured by the
>>> >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
>>> >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
>>> >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
>>> >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
>> > 
>> > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll
>> > update how that one looks as well.
> Sorry for the rather long delay.
> 
> The patch looks fine, the issue didn't reproduce.
> 
> The added VM_BUG_ON didn't trigger either, so maybe we should consider adding
> it in.

It took a while, but I've managed to hit that VM_BUG_ON:

[  707.975456] kernel BUG at include/asm-generic/pgtable.h:724!
[  707.977147] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  707.978974] Dumping ftrace buffer:
[  707.980110](ftrace buffer empty)
[  707.981221] Modules linked in:
[  707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 
3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079
[  707.982801] task: 880165e28000 ti: 880165e3 task.ti: 
880165e3
[  707.982801] RIP: 0010:[]  [] 
change_protection_range+0x94a/0x970
[  707.982801] RSP: 0018:880165e33d98  EFLAGS: 00010246
[  707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100
[  707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902
[  707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000
[  707.982801] R10: 1e01 R11: 0008 R12: 41751000
[  707.982801] R13: 00f7 R14: 9d340902 R15: 41741000
[  707.982801] FS:  7f358a9aa700() GS:88071c60() 
knlGS:
[  707.982801] CS:  0010 DS:  ES:  CR0: 8005003b
[  707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0
[  707.982801] Stack:
[  707.982801]  8804db88d058  88070fb17cf0 

[  707.982801]  880165d88000  8801686a5000 
4163e000
[  707.982801]  8801686a5000 0001 0025 
41750fff
[  707.982801] Call Trace:
[  707.982801]  [] change_protection+0x14/0x30
[  707.982801]  [] change_prot_numa+0x1b/0x40
[  707.982801]  [] task_numa_work+0x1f6/0x330
[  707.982801]  [] task_work_run+0xc4/0xf0
[  707.982801]  [] do_notify_resume+0x97/0xb0
[  707.982801]  [] int_signal+0x12/0x17
[  707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 
48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b <0f> 0b 
0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01
[  707.982801] RIP  [] change_protection_range+0x94a/0x970
[  707.982801]  RSP 


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-26 Thread Sasha Levin
On 08/11/2014 11:28 PM, Sasha Levin wrote:
 On 08/05/2014 09:04 PM, Sasha Levin wrote:
  Thanks Hugh, Mel. I've added both patches to my local tree and will update 
  tomorrow
  with the weather.
  
  Also:
  
  On 08/05/2014 08:42 PM, Hugh Dickins wrote:
  One thing I did wonder, though: at first I was reassured by the
  VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
  it would be better as VM_BUG_ON(!(val  _PAGE_PRESENT)), being stronger
  - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
  (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
  
  I've added VM_BUG_ON(!(val  _PAGE_PRESENT)) in just as a curiosity, I'll
  update how that one looks as well.
 Sorry for the rather long delay.
 
 The patch looks fine, the issue didn't reproduce.
 
 The added VM_BUG_ON didn't trigger either, so maybe we should consider adding
 it in.

It took a while, but I've managed to hit that VM_BUG_ON:

[  707.975456] kernel BUG at include/asm-generic/pgtable.h:724!
[  707.977147] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  707.978974] Dumping ftrace buffer:
[  707.980110](ftrace buffer empty)
[  707.981221] Modules linked in:
[  707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 
3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079
[  707.982801] task: 880165e28000 ti: 880165e3 task.ti: 
880165e3
[  707.982801] RIP: 0010:[b42e3dda]  [b42e3dda] 
change_protection_range+0x94a/0x970
[  707.982801] RSP: 0018:880165e33d98  EFLAGS: 00010246
[  707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100
[  707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902
[  707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000
[  707.982801] R10: 1e01 R11: 0008 R12: 41751000
[  707.982801] R13: 00f7 R14: 9d340902 R15: 41741000
[  707.982801] FS:  7f358a9aa700() GS:88071c60() 
knlGS:
[  707.982801] CS:  0010 DS:  ES:  CR0: 8005003b
[  707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0
[  707.982801] Stack:
[  707.982801]  8804db88d058  88070fb17cf0 

[  707.982801]  880165d88000  8801686a5000 
4163e000
[  707.982801]  8801686a5000 0001 0025 
41750fff
[  707.982801] Call Trace:
[  707.982801]  [b42e3e14] change_protection+0x14/0x30
[  707.982801]  [b42fda3b] change_prot_numa+0x1b/0x40
[  707.982801]  [b41ad766] task_numa_work+0x1f6/0x330
[  707.982801]  [b41937c4] task_work_run+0xc4/0xf0
[  707.982801]  [b40712e7] do_notify_resume+0x97/0xb0
[  707.982801]  [b74fd6ea] int_signal+0x12/0x17
[  707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 
48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b 0f 0b 
0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01
[  707.982801] RIP  [b42e3dda] change_protection_range+0x94a/0x970
[  707.982801]  RSP 880165e33d98


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-11 Thread Sasha Levin
On 08/05/2014 09:04 PM, Sasha Levin wrote:
> Thanks Hugh, Mel. I've added both patches to my local tree and will update 
> tomorrow
> with the weather.
> 
> Also:
> 
> On 08/05/2014 08:42 PM, Hugh Dickins wrote:
>> One thing I did wonder, though: at first I was reassured by the
>> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
>> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
>> - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
>> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
> 
> I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll
> update how that one looks as well.

Sorry for the rather long delay.

The patch looks fine, the issue didn't reproduce.

The added VM_BUG_ON didn't trigger either, so maybe we should consider adding
it in.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-11 Thread Sasha Levin
On 08/05/2014 09:04 PM, Sasha Levin wrote:
 Thanks Hugh, Mel. I've added both patches to my local tree and will update 
 tomorrow
 with the weather.
 
 Also:
 
 On 08/05/2014 08:42 PM, Hugh Dickins wrote:
 One thing I did wonder, though: at first I was reassured by the
 VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
 it would be better as VM_BUG_ON(!(val  _PAGE_PRESENT)), being stronger
 - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
 (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
 
 I've added VM_BUG_ON(!(val  _PAGE_PRESENT)) in just as a curiosity, I'll
 update how that one looks as well.

Sorry for the rather long delay.

The patch looks fine, the issue didn't reproduce.

The added VM_BUG_ON didn't trigger either, so maybe we should consider adding
it in.


Thanks,
Sasha

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-07 Thread Aneesh Kumar K.V
Mel Gorman  writes:

> On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote:
>> > -#define pmd_mknonnuma pmd_mknonnuma
>> > -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
>> > +/*
>> > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
>> > + * which was inherited from x86. For the purposes of powerpc pte_basic_t 
>> > is
>> > + * equivalent
>> > + */
>> > +#define pteval_t pte_basic_t
>> > +#define pmdval_t pmd_t
>> > +static inline pteval_t pte_flags(pte_t pte)
>> >  {
>> > -  return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
>> > +  return pte_val(pte) & PAGE_PROT_BITS;
>> 
>> PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
>> to check further to find out why the mask doesn't include
>> _PAGE_PRESENT. 
>> 
>
> Dumb of me, not sure how I managed that. For the purposes of what is required
> it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask
> that defines what bits are of interest to the generic helpers which is what
> this version attempts to do. It's not tested on powerpc at all
> unfortunately.


Boot tested on ppc64.

# grep numa /proc/vmstat 
numa_hit 156722
numa_miss 0
numa_foreign 0
numa_interleave 6365
numa_local 153457
numa_other 3265
numa_pte_updates 169
numa_huge_pte_updates 0
numa_hint_faults 150
numa_hint_faults_local 138
numa_pages_migrated 10

>
> ---8<---
> mm: Remove misleading ARCH_USES_NUMA_PROT_NONE
>
> ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
> _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
> relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
> fault scanner. This was found to be conceptually confusing with a lot of
> implicit assumptions and it was asked that an alternative be found.
>
> Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
> PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap
> PTE bits and shrunk the maximum possible swap size but it did not go far
> enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
> but the relics still exist.
>
> This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
> duplication in powerpc vs the generic implementation by defining the types
> the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
> This necessitated that a PTE bit mask be created that identified the bits
> that distinguish present from NUMA pte entries but it is expected this
> will only differ between arches based on _PAGE_PROTNONE. The naming for
> the generic helpers was taken from x86 originally but ppc64 has types that
> are equivalent for the purposes of the helper so they are mapped instead
> of duplicating code.
>
> Signed-off-by: Mel Gorman 
> ---
>  arch/powerpc/include/asm/pgtable.h| 57 
> ---
>  arch/powerpc/include/asm/pte-common.h |  5 +++
>  arch/x86/Kconfig  |  1 -
>  arch/x86/include/asm/pgtable_types.h  |  7 +
>  include/asm-generic/pgtable.h | 27 ++---
>  init/Kconfig  | 11 ---
>  6 files changed, 33 insertions(+), 75 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index d98c1ec..beeb09e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte)  { 
> return (pte_val(pte) & ~_PTE_NONE_MASK)
>  static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) 
> & PAGE_PROT_BITS); }
>
>  #ifdef CONFIG_NUMA_BALANCING
> -
>  static inline int pte_present(pte_t pte)
>  {
> - return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
> + return pte_val(pte) & _PAGE_NUMA_MASK;
>  }
>
>  #define pte_present_nonuma pte_present_nonuma
> @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte)
>   return pte_val(pte) & (_PAGE_PRESENT);
>  }
>
> -#define pte_numa pte_numa
> -static inline int pte_numa(pte_t pte)
> -{
> - return (pte_val(pte) &
> - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> -}
> -
> -#define pte_mknonnuma pte_mknonnuma
> -static inline pte_t pte_mknonnuma(pte_t pte)
> -{
> - pte_val(pte) &= ~_PAGE_NUMA;
> - pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
> - return pte;
> -}
> -
> -#define pte_mknuma pte_mknuma
> -static inline pte_t pte_mknuma(pte_t pte)
> -{
> - /*
> -  * We should not set _PAGE_NUMA on non present ptes. Also clear the
> -  * present bit so that hash_page will return 1 and we collect this
> -  * as numa fault.
> -  */
> - if (pte_present(pte)) {
> - pte_val(pte) |= _PAGE_NUMA;
> - pte_val(pte) &= ~_PAGE_PRESENT;
> - } else
> - VM_BUG_ON(1);
> - return pte;
> -}
> -
>  #define ptep_set_numa ptep_set_numa
>  static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
>  

Re: mm: BUG in unmap_page_range

2014-08-07 Thread Aneesh Kumar K.V
Mel Gorman mgor...@suse.de writes:

 On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote:
  -#define pmd_mknonnuma pmd_mknonnuma
  -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
  +/*
  + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
  + * which was inherited from x86. For the purposes of powerpc pte_basic_t 
  is
  + * equivalent
  + */
  +#define pteval_t pte_basic_t
  +#define pmdval_t pmd_t
  +static inline pteval_t pte_flags(pte_t pte)
   {
  -  return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
  +  return pte_val(pte)  PAGE_PROT_BITS;
 
 PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
 to check further to find out why the mask doesn't include
 _PAGE_PRESENT. 
 

 Dumb of me, not sure how I managed that. For the purposes of what is required
 it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask
 that defines what bits are of interest to the generic helpers which is what
 this version attempts to do. It's not tested on powerpc at all
 unfortunately.


Boot tested on ppc64.

# grep numa /proc/vmstat 
numa_hit 156722
numa_miss 0
numa_foreign 0
numa_interleave 6365
numa_local 153457
numa_other 3265
numa_pte_updates 169
numa_huge_pte_updates 0
numa_hint_faults 150
numa_hint_faults_local 138
numa_pages_migrated 10


 ---8---
 mm: Remove misleading ARCH_USES_NUMA_PROT_NONE

 ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
 _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
 relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
 fault scanner. This was found to be conceptually confusing with a lot of
 implicit assumptions and it was asked that an alternative be found.

 Commit c46a7c81 x86: define _PAGE_NUMA by reusing software bits on the
 PMD and PTE levels redefined _PAGE_NUMA on x86 to be one of the swap
 PTE bits and shrunk the maximum possible swap size but it did not go far
 enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
 but the relics still exist.

 This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
 duplication in powerpc vs the generic implementation by defining the types
 the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
 This necessitated that a PTE bit mask be created that identified the bits
 that distinguish present from NUMA pte entries but it is expected this
 will only differ between arches based on _PAGE_PROTNONE. The naming for
 the generic helpers was taken from x86 originally but ppc64 has types that
 are equivalent for the purposes of the helper so they are mapped instead
 of duplicating code.

 Signed-off-by: Mel Gorman mgor...@suse.de
 ---
  arch/powerpc/include/asm/pgtable.h| 57 
 ---
  arch/powerpc/include/asm/pte-common.h |  5 +++
  arch/x86/Kconfig  |  1 -
  arch/x86/include/asm/pgtable_types.h  |  7 +
  include/asm-generic/pgtable.h | 27 ++---
  init/Kconfig  | 11 ---
  6 files changed, 33 insertions(+), 75 deletions(-)

 diff --git a/arch/powerpc/include/asm/pgtable.h 
 b/arch/powerpc/include/asm/pgtable.h
 index d98c1ec..beeb09e 100644
 --- a/arch/powerpc/include/asm/pgtable.h
 +++ b/arch/powerpc/include/asm/pgtable.h
 @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte)  { 
 return (pte_val(pte)  ~_PTE_NONE_MASK)
  static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) 
  PAGE_PROT_BITS); }

  #ifdef CONFIG_NUMA_BALANCING
 -
  static inline int pte_present(pte_t pte)
  {
 - return pte_val(pte)  (_PAGE_PRESENT | _PAGE_NUMA);
 + return pte_val(pte)  _PAGE_NUMA_MASK;
  }

  #define pte_present_nonuma pte_present_nonuma
 @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte)
   return pte_val(pte)  (_PAGE_PRESENT);
  }

 -#define pte_numa pte_numa
 -static inline int pte_numa(pte_t pte)
 -{
 - return (pte_val(pte) 
 - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
 -}
 -
 -#define pte_mknonnuma pte_mknonnuma
 -static inline pte_t pte_mknonnuma(pte_t pte)
 -{
 - pte_val(pte) = ~_PAGE_NUMA;
 - pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
 - return pte;
 -}
 -
 -#define pte_mknuma pte_mknuma
 -static inline pte_t pte_mknuma(pte_t pte)
 -{
 - /*
 -  * We should not set _PAGE_NUMA on non present ptes. Also clear the
 -  * present bit so that hash_page will return 1 and we collect this
 -  * as numa fault.
 -  */
 - if (pte_present(pte)) {
 - pte_val(pte) |= _PAGE_NUMA;
 - pte_val(pte) = ~_PAGE_PRESENT;
 - } else
 - VM_BUG_ON(1);
 - return pte;
 -}
 -
  #define ptep_set_numa ptep_set_numa
  static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
 @@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, 
 unsigned long addr,
   

Re: mm: BUG in unmap_page_range

2014-08-06 Thread Mel Gorman
On Tue, Aug 05, 2014 at 05:42:03PM -0700, Hugh Dickins wrote:
> > 
> > 
> > I'm attaching a preliminary pair of patches. The first which deals with
> > ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised
> > changelog. I'm adding Aneesh to the cc to look at the powerpc portion of
> > the first patch.
> 
> Thanks a lot, Mel.
> 
> I am surprised by the ordering, but perhaps you meant nothing by it.

I didn't mean anything by it. It was based on the order I looked at the
patches in. Revisited c46a7c817, looked at ARCH_USES_NUMA_PROT_NONE issue
to see if it had any potential impact to your patch and then moved on to
your patch.

> Isn't the first one a welcome but optional cleanup, and the second one
> a fix that we need in 3.16-stable?  Or does the fix actually depend in
> some unstated way upon the cleanup, in powerpc-land perhaps?
> 

It shouldn't as powerpc can use its old helpers. I've included Aneesh in
the cc just in case.

> Aside from that, for the first patch: yes, I heartily approve of the
> disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and
> CONFIG_ARCH_USES_NUMA_PROT_NONE.  If you wish, add
> Acked-by: Hugh Dickins 
> but of course it's really Aneesh and powerpc who are the test of it.
> 

Thanks. I have a second version finished for that which I'll send once
this bug is addressed.

> One thing I did wonder, though: at first I was reassured by the
> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
> - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
> 

It shouldn't so I'll use the stronger test.

Sasha, if it's not too late would you mind testing this patch in isolation
as a -stable candidate for 3.16 please? It worked for me including within
trinity but then again I was not seeing crashes with 3.16 either so I do
not consider my trinity testing to be a reliable indicator.

---8<---
x86,mm: fix pte_special versus pte_numa

Sasha Levin has shown oopses on ea0003480048 and ea0003480008
at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
where zap_pte_range() checks page->mapping to see if PageAnon(page).

Those addresses fit struct pages for pfns d2001 and d2000, and in each
dump a register or a stack slot showed d2001730 or d2000730: pte flags
0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
a hole between cfff and 1, which would need special access.

Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on
the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
pte no longer passes the pte_special() test, so zap_pte_range() goes on
to try to access a non-existent struct page.

Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
A hint that this was a problem was that c46a7c817e66 added pte_numa()
test to vm_normal_page(), and moved its is_zero_pfn() test from slow to
fast path: This was papering over a pte_special() snag when the zero page
was encountered during zap. This patch reverts vm_normal_page() to how it
was before, relying on pte_special().

It still appears that this patch may be incomplete: aren't there other
places which need to be handling PROTNONE along with PRESENT?  For example,
pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE
area, that would make it pte_special(). This is side-stepped by the fact
that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds
where a NUMA hinting fault on a PROT_NONE VMA would be interesting.

Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the 
PMD and PTE levels")
Reported-by: Sasha Levin 
Signed-off-by: Hugh Dickins 
Signed-off-by: Mel Gorman 
Cc: sta...@vger.kernel.org [3.16]
---
 arch/x86/include/asm/pgtable.h | 9 +++--
 mm/memory.c| 7 +++
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ec0560..aa97a07 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-   return (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_SPECIAL)) ==
-(_PAGE_PRESENT|_PAGE_SPECIAL);
+   /*
+* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
+* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
+* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
+*/
+   return (pte_flags(pte) & _PAGE_SPECIAL) &&
+   (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
diff --git a/mm/memory.c b/mm/memory.c
index 8b44f76..0a21f3d 100644
--- 

Re: mm: BUG in unmap_page_range

2014-08-06 Thread Mel Gorman
On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote:
> > -#define pmd_mknonnuma pmd_mknonnuma
> > -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> > +/*
> > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
> > + * which was inherited from x86. For the purposes of powerpc pte_basic_t is
> > + * equivalent
> > + */
> > +#define pteval_t pte_basic_t
> > +#define pmdval_t pmd_t
> > +static inline pteval_t pte_flags(pte_t pte)
> >  {
> > -   return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
> > +   return pte_val(pte) & PAGE_PROT_BITS;
> 
> PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
> to check further to find out why the mask doesn't include
> _PAGE_PRESENT. 
> 

Dumb of me, not sure how I managed that. For the purposes of what is required
it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask
that defines what bits are of interest to the generic helpers which is what
this version attempts to do. It's not tested on powerpc at all unfortunately.

---8<---
mm: Remove misleading ARCH_USES_NUMA_PROT_NONE

ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
_PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
fault scanner. This was found to be conceptually confusing with a lot of
implicit assumptions and it was asked that an alternative be found.

Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap
PTE bits and shrunk the maximum possible swap size but it did not go far
enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
but the relics still exist.

This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
duplication in powerpc vs the generic implementation by defining the types
the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
This necessitated that a PTE bit mask be created that identified the bits
that distinguish present from NUMA pte entries but it is expected this
will only differ between arches based on _PAGE_PROTNONE. The naming for
the generic helpers was taken from x86 originally but ppc64 has types that
are equivalent for the purposes of the helper so they are mapped instead
of duplicating code.

Signed-off-by: Mel Gorman 
---
 arch/powerpc/include/asm/pgtable.h| 57 ---
 arch/powerpc/include/asm/pte-common.h |  5 +++
 arch/x86/Kconfig  |  1 -
 arch/x86/include/asm/pgtable_types.h  |  7 +
 include/asm-generic/pgtable.h | 27 ++---
 init/Kconfig  | 11 ---
 6 files changed, 33 insertions(+), 75 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index d98c1ec..beeb09e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte) & ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
-
 static inline int pte_present(pte_t pte)
 {
-   return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA);
+   return pte_val(pte) & _PAGE_NUMA_MASK;
 }
 
 #define pte_present_nonuma pte_present_nonuma
@@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte)
return pte_val(pte) & (_PAGE_PRESENT);
 }
 
-#define pte_numa pte_numa
-static inline int pte_numa(pte_t pte)
-{
-   return (pte_val(pte) &
-   (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-
-#define pte_mknonnuma pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-   pte_val(pte) &= ~_PAGE_NUMA;
-   pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
-   return pte;
-}
-
-#define pte_mknuma pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-   /*
-* We should not set _PAGE_NUMA on non present ptes. Also clear the
-* present bit so that hash_page will return 1 and we collect this
-* as numa fault.
-*/
-   if (pte_present(pte)) {
-   pte_val(pte) |= _PAGE_NUMA;
-   pte_val(pte) &= ~_PAGE_PRESENT;
-   } else
-   VM_BUG_ON(1);
-   return pte;
-}
-
 #define ptep_set_numa ptep_set_numa
 static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep)
@@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, 
unsigned long addr,
return;
 }
 
-#define pmd_numa pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-   return pte_numa(pmd_pte(pmd));
-}
-
 #define pmdp_set_numa pmdp_set_numa
 static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
 pmd_t *pmdp)
@@ -109,16 +71,21 @@ static inline void 

Re: mm: BUG in unmap_page_range

2014-08-06 Thread Aneesh Kumar K.V
Mel Gorman  writes:

> From d0c77a2b497da46c52792ead066d461e5111a594 Mon Sep 17 00:00:00 2001
> From: Mel Gorman 
> Date: Tue, 5 Aug 2014 12:06:50 +0100
> Subject: [PATCH] mm: Remove misleading ARCH_USES_NUMA_PROT_NONE
>
> ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
> _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
> relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
> fault scanner. This was found to be conceptually confusing with a lot of
> implicit assumptions and it was asked that an alternative be found.
>
> Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the
> PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap
> PTE bits and shrunk the maximum possible swap size but it did not go far
> enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
> but the relics still exist.
>
> This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
> duplication in powerpc vs the generic implementation by defining the types
> the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
> The unification for ppc64 is less than ideal because types do not exist
> that the "generic" code expects to. This patch works around the problem
> but it would be preferred if the powerpc people would look at this to see
> if they have opinions on what might suit them better.
>
> Signed-off-by: Mel Gorman 
> ---
>  arch/powerpc/include/asm/pgtable.h | 55 
> --
>  arch/x86/Kconfig   |  1 -
>  include/asm-generic/pgtable.h  | 35 
>  init/Kconfig   | 11 
>  4 files changed, 29 insertions(+), 73 deletions(-)
>



> -
>  #define pmdp_set_numa pmdp_set_numa
>  static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
>pmd_t *pmdp)
> @@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct mm_struct *mm, 
> unsigned long addr,
>   return;
>  }
>  
> -#define pmd_mknonnuma pmd_mknonnuma
> -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> +/*
> + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
> + * which was inherited from x86. For the purposes of powerpc pte_basic_t is
> + * equivalent
> + */
> +#define pteval_t pte_basic_t
> +#define pmdval_t pmd_t
> +static inline pteval_t pte_flags(pte_t pte)
>  {
> - return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
> + return pte_val(pte) & PAGE_PROT_BITS;

PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
to check further to find out why the mask doesn't include
_PAGE_PRESENT. 


>  }
>  
> -#define pmd_mknuma pmd_mknuma
> -static inline pmd_t pmd_mknuma(pmd_t pmd)
> +static inline pteval_t pmd_flags(pte_t pte)
>  {


static inline pmdval_t ?

> - return pte_pmd(pte_mknuma(pmd_pte(pmd)));
> + return pmd_val(pte) & PAGE_PROT_BITS;
>  }
>  

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-06 Thread Aneesh Kumar K.V
Mel Gorman mgor...@suse.de writes:

 From d0c77a2b497da46c52792ead066d461e5111a594 Mon Sep 17 00:00:00 2001
 From: Mel Gorman mgor...@suse.de
 Date: Tue, 5 Aug 2014 12:06:50 +0100
 Subject: [PATCH] mm: Remove misleading ARCH_USES_NUMA_PROT_NONE

 ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
 _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
 relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
 fault scanner. This was found to be conceptually confusing with a lot of
 implicit assumptions and it was asked that an alternative be found.

 Commit c46a7c81 x86: define _PAGE_NUMA by reusing software bits on the
 PMD and PTE levels redefined _PAGE_NUMA on x86 to be one of the swap
 PTE bits and shrunk the maximum possible swap size but it did not go far
 enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
 but the relics still exist.

 This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
 duplication in powerpc vs the generic implementation by defining the types
 the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
 The unification for ppc64 is less than ideal because types do not exist
 that the generic code expects to. This patch works around the problem
 but it would be preferred if the powerpc people would look at this to see
 if they have opinions on what might suit them better.

 Signed-off-by: Mel Gorman mgor...@suse.de
 ---
  arch/powerpc/include/asm/pgtable.h | 55 
 --
  arch/x86/Kconfig   |  1 -
  include/asm-generic/pgtable.h  | 35 
  init/Kconfig   | 11 
  4 files changed, 29 insertions(+), 73 deletions(-)




 -
  #define pmdp_set_numa pmdp_set_numa
  static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp)
 @@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct mm_struct *mm, 
 unsigned long addr,
   return;
  }
  
 -#define pmd_mknonnuma pmd_mknonnuma
 -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
 +/*
 + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
 + * which was inherited from x86. For the purposes of powerpc pte_basic_t is
 + * equivalent
 + */
 +#define pteval_t pte_basic_t
 +#define pmdval_t pmd_t
 +static inline pteval_t pte_flags(pte_t pte)
  {
 - return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
 + return pte_val(pte)  PAGE_PROT_BITS;

PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
to check further to find out why the mask doesn't include
_PAGE_PRESENT. 


  }
  
 -#define pmd_mknuma pmd_mknuma
 -static inline pmd_t pmd_mknuma(pmd_t pmd)
 +static inline pteval_t pmd_flags(pte_t pte)
  {


static inline pmdval_t ?

 - return pte_pmd(pte_mknuma(pmd_pte(pmd)));
 + return pmd_val(pte)  PAGE_PROT_BITS;
  }
  

-aneesh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-06 Thread Mel Gorman
On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote:
  -#define pmd_mknonnuma pmd_mknonnuma
  -static inline pmd_t pmd_mknonnuma(pmd_t pmd)
  +/*
  + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist
  + * which was inherited from x86. For the purposes of powerpc pte_basic_t is
  + * equivalent
  + */
  +#define pteval_t pte_basic_t
  +#define pmdval_t pmd_t
  +static inline pteval_t pte_flags(pte_t pte)
   {
  -   return pte_pmd(pte_mknonnuma(pmd_pte(pmd)));
  +   return pte_val(pte)  PAGE_PROT_BITS;
 
 PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have
 to check further to find out why the mask doesn't include
 _PAGE_PRESENT. 
 

Dumb of me, not sure how I managed that. For the purposes of what is required
it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask
that defines what bits are of interest to the generic helpers which is what
this version attempts to do. It's not tested on powerpc at all unfortunately.

---8---
mm: Remove misleading ARCH_USES_NUMA_PROT_NONE

ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented
_PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and
relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting
fault scanner. This was found to be conceptually confusing with a lot of
implicit assumptions and it was asked that an alternative be found.

Commit c46a7c81 x86: define _PAGE_NUMA by reusing software bits on the
PMD and PTE levels redefined _PAGE_NUMA on x86 to be one of the swap
PTE bits and shrunk the maximum possible swap size but it did not go far
enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA
but the relics still exist.

This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary
duplication in powerpc vs the generic implementation by defining the types
the core NUMA helpers expected to exist from x86 with their ppc64 equivalent.
This necessitated that a PTE bit mask be created that identified the bits
that distinguish present from NUMA pte entries but it is expected this
will only differ between arches based on _PAGE_PROTNONE. The naming for
the generic helpers was taken from x86 originally but ppc64 has types that
are equivalent for the purposes of the helper so they are mapped instead
of duplicating code.

Signed-off-by: Mel Gorman mgor...@suse.de
---
 arch/powerpc/include/asm/pgtable.h| 57 ---
 arch/powerpc/include/asm/pte-common.h |  5 +++
 arch/x86/Kconfig  |  1 -
 arch/x86/include/asm/pgtable_types.h  |  7 +
 include/asm-generic/pgtable.h | 27 ++---
 init/Kconfig  | 11 ---
 6 files changed, 33 insertions(+), 75 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index d98c1ec..beeb09e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte){ 
return (pte_val(pte)  ~_PTE_NONE_MASK)
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
 PAGE_PROT_BITS); }
 
 #ifdef CONFIG_NUMA_BALANCING
-
 static inline int pte_present(pte_t pte)
 {
-   return pte_val(pte)  (_PAGE_PRESENT | _PAGE_NUMA);
+   return pte_val(pte)  _PAGE_NUMA_MASK;
 }
 
 #define pte_present_nonuma pte_present_nonuma
@@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte)
return pte_val(pte)  (_PAGE_PRESENT);
 }
 
-#define pte_numa pte_numa
-static inline int pte_numa(pte_t pte)
-{
-   return (pte_val(pte) 
-   (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
-}
-
-#define pte_mknonnuma pte_mknonnuma
-static inline pte_t pte_mknonnuma(pte_t pte)
-{
-   pte_val(pte) = ~_PAGE_NUMA;
-   pte_val(pte) |=  _PAGE_PRESENT | _PAGE_ACCESSED;
-   return pte;
-}
-
-#define pte_mknuma pte_mknuma
-static inline pte_t pte_mknuma(pte_t pte)
-{
-   /*
-* We should not set _PAGE_NUMA on non present ptes. Also clear the
-* present bit so that hash_page will return 1 and we collect this
-* as numa fault.
-*/
-   if (pte_present(pte)) {
-   pte_val(pte) |= _PAGE_NUMA;
-   pte_val(pte) = ~_PAGE_PRESENT;
-   } else
-   VM_BUG_ON(1);
-   return pte;
-}
-
 #define ptep_set_numa ptep_set_numa
 static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep)
@@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, 
unsigned long addr,
return;
 }
 
-#define pmd_numa pmd_numa
-static inline int pmd_numa(pmd_t pmd)
-{
-   return pte_numa(pmd_pte(pmd));
-}
-
 #define pmdp_set_numa pmdp_set_numa
 static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr,
 pmd_t *pmdp)
@@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct 

Re: mm: BUG in unmap_page_range

2014-08-06 Thread Mel Gorman
On Tue, Aug 05, 2014 at 05:42:03PM -0700, Hugh Dickins wrote:
  SNIP
  
  I'm attaching a preliminary pair of patches. The first which deals with
  ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised
  changelog. I'm adding Aneesh to the cc to look at the powerpc portion of
  the first patch.
 
 Thanks a lot, Mel.
 
 I am surprised by the ordering, but perhaps you meant nothing by it.

I didn't mean anything by it. It was based on the order I looked at the
patches in. Revisited c46a7c817, looked at ARCH_USES_NUMA_PROT_NONE issue
to see if it had any potential impact to your patch and then moved on to
your patch.

 Isn't the first one a welcome but optional cleanup, and the second one
 a fix that we need in 3.16-stable?  Or does the fix actually depend in
 some unstated way upon the cleanup, in powerpc-land perhaps?
 

It shouldn't as powerpc can use its old helpers. I've included Aneesh in
the cc just in case.

 Aside from that, for the first patch: yes, I heartily approve of the
 disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and
 CONFIG_ARCH_USES_NUMA_PROT_NONE.  If you wish, add
 Acked-by: Hugh Dickins hu...@google.com
 but of course it's really Aneesh and powerpc who are the test of it.
 

Thanks. I have a second version finished for that which I'll send once
this bug is addressed.

 One thing I did wonder, though: at first I was reassured by the
 VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
 it would be better as VM_BUG_ON(!(val  _PAGE_PRESENT)), being stronger
 - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
 (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
 

It shouldn't so I'll use the stronger test.

Sasha, if it's not too late would you mind testing this patch in isolation
as a -stable candidate for 3.16 please? It worked for me including within
trinity but then again I was not seeing crashes with 3.16 either so I do
not consider my trinity testing to be a reliable indicator.

---8---
x86,mm: fix pte_special versus pte_numa

Sasha Levin has shown oopses on ea0003480048 and ea0003480008
at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
where zap_pte_range() checks page-mapping to see if PageAnon(page).

Those addresses fit struct pages for pfns d2001 and d2000, and in each
dump a register or a stack slot showed d2001730 or d2000730: pte flags
0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
a hole between cfff and 1, which would need special access.

Commit c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on
the PMD and PTE levels) has broken vm_normal_page(): a PROTNONE SPECIAL
pte no longer passes the pte_special() test, so zap_pte_range() goes on
to try to access a non-existent struct page.

Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
A hint that this was a problem was that c46a7c817e66 added pte_numa()
test to vm_normal_page(), and moved its is_zero_pfn() test from slow to
fast path: This was papering over a pte_special() snag when the zero page
was encountered during zap. This patch reverts vm_normal_page() to how it
was before, relying on pte_special().

It still appears that this patch may be incomplete: aren't there other
places which need to be handling PROTNONE along with PRESENT?  For example,
pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE
area, that would make it pte_special(). This is side-stepped by the fact
that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds
where a NUMA hinting fault on a PROT_NONE VMA would be interesting.

Fixes: c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on the 
PMD and PTE levels)
Reported-by: Sasha Levin sasha.le...@oracle.com
Signed-off-by: Hugh Dickins hu...@google.com
Signed-off-by: Mel Gorman mgor...@suse.de
Cc: sta...@vger.kernel.org [3.16]
---
 arch/x86/include/asm/pgtable.h | 9 +++--
 mm/memory.c| 7 +++
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ec0560..aa97a07 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-   return (pte_flags(pte)  (_PAGE_PRESENT|_PAGE_SPECIAL)) ==
-(_PAGE_PRESENT|_PAGE_SPECIAL);
+   /*
+* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
+* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
+* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
+*/
+   return (pte_flags(pte)  _PAGE_SPECIAL) 
+   (pte_flags(pte)  (_PAGE_PRESENT|_PAGE_PROTNONE));
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
diff --git a/mm/memory.c b/mm/memory.c
index 

Re: mm: BUG in unmap_page_range

2014-08-05 Thread Sasha Levin
Thanks Hugh, Mel. I've added both patches to my local tree and will update 
tomorrow
with the weather.

Also:

On 08/05/2014 08:42 PM, Hugh Dickins wrote:
> One thing I did wonder, though: at first I was reassured by the
> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
> - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)

I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll
update how that one looks as well.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-05 Thread Hugh Dickins
On Tue, 5 Aug 2014, Mel Gorman wrote:
> On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote:
> > 
> > [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa
> > 
> > Sasha Levin has shown oopses on ea0003480048 and ea0003480008
> > at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
> > where zap_pte_range() checks page->mapping to see if PageAnon(page).
> > 
> > Those addresses fit struct pages for pfns d2001 and d2000, and in each
> > dump a register or a stack slot showed d2001730 or d2000730: pte flags
> > 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
> > a hole between cfff and 1, which would need special access.
> > 
> > Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on
> > the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
> > pte no longer passes the pte_special() test, so zap_pte_range() goes on
> > to try to access a non-existent struct page.
> > 
> 
> :(
> 
> > Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
> > to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
> > 
> > It's unclear why c46a7c817e66 added pte_numa() test to vm_normal_page(),
> > and moved its is_zero_pfn() test from slow to fast path: I suspect both
> > were papering over PROT_NONE issues seen with inadequate pte_special().
> > Revert vm_normal_page() to how it was before, relying on pte_special().
> > 
> 
> Rather than answering directly I updated your changelog
> 
> Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
> to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
> 
> A hint that this was a problem was that c46a7c817e66 added pte_numa()
> test to vm_normal_page(), and moved its is_zero_pfn() test from slow to
> fast path: This was papering over a pte_special() snag when the zero
> page was encountered during zap. This patch reverts vm_normal_page()
> to how it was before, relying on pte_special().

Thanks, that's fine.

> 
> > I find it confusing, that the only example of ARCH_USES_NUMA_PROT_NONE
> > no longer uses PROTNONE for NUMA, but SPECIAL instead: update the
> > asm-generic comment a little, but that config option remains unhelpful.
> > 
> 
> ARCH_USES_NUMA_PROT_NONE should have been sent to the farm at the same time
> as that patch and by rights unified with the powerpc helpers. With the new
> _PAGE_NUMA bit, there is no reason they should have different implementations
> of pte_numa and related functions. Unfortunately unifying them is a little
> problematic due to differences in fundamental types. It could be done with
> #defines but I'm attaching a preliminary prototype to illustrate the issue.
> 
> > But more seriously, I think this patch is incomplete: aren't there
> > other places which need to be handling PROTNONE along with PRESENT?
> > For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA,
> > but on a PROT_NONE area, I think that will now make it pte_special()?
> > So it ought to clear _PAGE_PROTNONE too.  Or maybe we can never
> > pte_mknuma() on a PROT_NONE area - there would be no point?
> > 
> 
> We are depending on the fact that inaccessible VMAs are skipped by the
> NUMA hinting scanner.

Ah, okay.  And the other way round (mprotecting to PROT_NONE an area
which already contains _PAGE_NUMA ptes) already looked safe to me.

> 
> > Around here I began to wonder if it was just a mistake to have deserted
> > the PROTNONE for NUMA model: I know Linus had a strong reaction against
> > it, and I've never delved into its drawbacks myself; but bringing yet
> > another (SPECIAL) flag into the game is not an obvious improvement.
> > Should we just revert c46a7c817e66, or would that be a mistake?
> > 
> 
> It's replacing one type of complexity with another. The downside is that
> _PAGE_NUMA == _PAGE_PROTNONE puts subtle traps all over the core for
> powerpc to fall foul of.

Okay.

> 
> I'm attaching a preliminary pair of patches. The first which deals with
> ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised
> changelog. I'm adding Aneesh to the cc to look at the powerpc portion of
> the first patch.

Thanks a lot, Mel.

I am surprised by the ordering, but perhaps you meant nothing by it.
Isn't the first one a welcome but optional cleanup, and the second one
a fix that we need in 3.16-stable?  Or does the fix actually depend in
some unstated way upon the cleanup, in powerpc-land perhaps?

Aside from that, for the first patch: yes, I heartily approve of the
disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and
CONFIG_ARCH_USES_NUMA_PROT_NONE.  If you wish, add
Acked-by: Hugh Dickins 
but of course it's really Aneesh and powerpc who are the test of it.

One thing I did wonder, though: at first I was reassured by the
VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), 

Re: mm: BUG in unmap_page_range

2014-08-05 Thread Mel Gorman
On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote:
> On Sat, 2 Aug 2014, Sasha Levin wrote:
> 
> > Hi all,
> > 
> > While fuzzing with trinity inside a KVM tools guest running the latest -next
> > kernel, I've stumbled on the following spew:
> > 
> > [ 2957.087977] BUG: unable to handle kernel paging request at 
> > ea0003480008
> > [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
> > mm/memory.c:1277 mm/memory.c:1301)
> > [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0
> > [ 2957.088041] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > [ 2957.088087] Dumping ftrace buffer:
> > [ 2957.088266](ftrace buffer empty)
> > [ 2957.088279] Modules linked in:
> > [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 
> > 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990
> > [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 
> > 880739fb4000
> > [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
> > mm/memory.c:1277 mm/memory.c:1301)
> > [ 2957.088328] RSP: 0018:880739fb7c58  EFLAGS: 00010246
> > [ 2957.088336] RAX:  RBX: 880eb2bdbed8 RCX: 
> > dfff971b4280
> > [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: 
> > ea0003480008
> > [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 
> > 00b6e000
> > [ 2957.088357] R10:  R11: 0001 R12: 
> > ea000348
> > [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 
> > 7f00e85db000
> > [ 2957.088374] FS:  7f00e85d8700() GS:88177fa0() 
> > knlGS:
> > [ 2957.088381] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 
> > 06a0
> > [ 2957.088406] DR0:  DR1:  DR2: 
> > 
> > [ 2957.088413] DR3:  DR6: 0ff0 DR7: 
> > 0600
> > [ 2957.088416] Stack:
> > [ 2957.088432]  88171726d570 0010 0008 
> > d2000730
> > [ 2957.088450]  19d00250 7f00e85dc000 880f9d311900 
> > 880739fb7e20
> > [ 2957.088466]  8807a8c507a0 8807a8c5 8807a75fe000 
> > 8807ceaa7a10
> > [ 2957.088469] Call Trace:
> > [ 2957.088490] unmap_single_vma (mm/memory.c:1348)
> > [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3))
> > [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4))
> > [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 
> > include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 
> > mm/mmap.c:493)
> > [ 2957.088559] ? vmacache_update (mm/vmacache.c:61)
> > [ 2957.088572] do_munmap (mm/mmap.c:2581)
> > [ 2957.088583] vm_munmap (mm/mmap.c:2596)
> > [ 2957.088595] SyS_munmap (mm/mmap.c:2601)
> > [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541)
> > [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 
> > 0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 
> > <41> f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd
> > All code
> > 
> >0:   ff  (bad)
> >1:   ff e8   ljmpq  *
> >3:   f9  stc
> >4:   5f  pop%rdi
> >5:   07  (bad)
> >6:   00 48 8badd%cl,-0x75(%rax)
> >9:   45 90   rex.RB xchg %eax,%r8d
> >b:   80 48 18 01 orb$0x1,0x18(%rax)
> >f:   4d 85 e4test   %r12,%r12
> >   12:   0f 84 8b fe ff ff   je 0xfea3
> >   18:   45 84 edtest   %r13b,%r13b
> >   1b:   0f 85 fc 03 00 00   jne0x41d
> >   21:   49 8d 7c 24 08  lea0x8(%r12),%rdi
> >   26:   e8 b5 67 07 00  callq  0x767e0
> >   2b:*  41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   <-- 
> > trapping instruction
> >   31:   0f 84 29 02 00 00   je 0x260
> >   37:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
> >   3b:   4c 89 e7mov%r12,%rdi
> >   3e:   e8  .byte 0xe8
> >   3f:   bd  .byte 0xbd
> 
> This differs in which functions got inlined (unmap_page_range showing up
> in place of zap_pte_range), but this is the same "if (PageAnon(page))"
> that Sasha reported in the "hang in shmem_fallocate" thread on June 26th.
> 
> I can see what it is now, and here is most of a patch (which I don't
> expect to satisfy Trinity yet); at this point I think I had better
> hand it over to Mel, to complete or to discard.
> 
> [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa
> 
> Sasha Levin has shown oopses on ea0003480048 and ea0003480008
> at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
> where zap_pte_range() checks 

Re: mm: BUG in unmap_page_range

2014-08-05 Thread Mel Gorman
On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote:
 On Sat, 2 Aug 2014, Sasha Levin wrote:
 
  Hi all,
  
  While fuzzing with trinity inside a KVM tools guest running the latest -next
  kernel, I've stumbled on the following spew:
  
  [ 2957.087977] BUG: unable to handle kernel paging request at 
  ea0003480008
  [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
  mm/memory.c:1277 mm/memory.c:1301)
  [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0
  [ 2957.088041] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [ 2957.088087] Dumping ftrace buffer:
  [ 2957.088266](ftrace buffer empty)
  [ 2957.088279] Modules linked in:
  [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 
  3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990
  [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 
  880739fb4000
  [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
  mm/memory.c:1277 mm/memory.c:1301)
  [ 2957.088328] RSP: 0018:880739fb7c58  EFLAGS: 00010246
  [ 2957.088336] RAX:  RBX: 880eb2bdbed8 RCX: 
  dfff971b4280
  [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: 
  ea0003480008
  [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 
  00b6e000
  [ 2957.088357] R10:  R11: 0001 R12: 
  ea000348
  [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 
  7f00e85db000
  [ 2957.088374] FS:  7f00e85d8700() GS:88177fa0() 
  knlGS:
  [ 2957.088381] CS:  0010 DS:  ES:  CR0: 80050033
  [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 
  06a0
  [ 2957.088406] DR0:  DR1:  DR2: 
  
  [ 2957.088413] DR3:  DR6: 0ff0 DR7: 
  0600
  [ 2957.088416] Stack:
  [ 2957.088432]  88171726d570 0010 0008 
  d2000730
  [ 2957.088450]  19d00250 7f00e85dc000 880f9d311900 
  880739fb7e20
  [ 2957.088466]  8807a8c507a0 8807a8c5 8807a75fe000 
  8807ceaa7a10
  [ 2957.088469] Call Trace:
  [ 2957.088490] unmap_single_vma (mm/memory.c:1348)
  [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3))
  [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4))
  [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 
  include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 
  mm/mmap.c:493)
  [ 2957.088559] ? vmacache_update (mm/vmacache.c:61)
  [ 2957.088572] do_munmap (mm/mmap.c:2581)
  [ 2957.088583] vm_munmap (mm/mmap.c:2596)
  [ 2957.088595] SyS_munmap (mm/mmap.c:2601)
  [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541)
  [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 
  0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 
  41 f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd
  All code
  
 0:   ff  (bad)
 1:   ff e8   ljmpq  *internal disassembler error
 3:   f9  stc
 4:   5f  pop%rdi
 5:   07  (bad)
 6:   00 48 8badd%cl,-0x75(%rax)
 9:   45 90   rex.RB xchg %eax,%r8d
 b:   80 48 18 01 orb$0x1,0x18(%rax)
 f:   4d 85 e4test   %r12,%r12
12:   0f 84 8b fe ff ff   je 0xfea3
18:   45 84 edtest   %r13b,%r13b
1b:   0f 85 fc 03 00 00   jne0x41d
21:   49 8d 7c 24 08  lea0x8(%r12),%rdi
26:   e8 b5 67 07 00  callq  0x767e0
2b:*  41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   -- 
  trapping instruction
31:   0f 84 29 02 00 00   je 0x260
37:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
3b:   4c 89 e7mov%r12,%rdi
3e:   e8  .byte 0xe8
3f:   bd  .byte 0xbd
 
 This differs in which functions got inlined (unmap_page_range showing up
 in place of zap_pte_range), but this is the same if (PageAnon(page))
 that Sasha reported in the hang in shmem_fallocate thread on June 26th.
 
 I can see what it is now, and here is most of a patch (which I don't
 expect to satisfy Trinity yet); at this point I think I had better
 hand it over to Mel, to complete or to discard.
 
 [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa
 
 Sasha Levin has shown oopses on ea0003480048 and ea0003480008
 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
 where zap_pte_range() checks page-mapping to see if PageAnon(page).
 
 Those addresses fit struct pages for pfns d2001 and d2000, and in each
 dump a register or a stack slot showed d2001730 or 

Re: mm: BUG in unmap_page_range

2014-08-05 Thread Hugh Dickins
On Tue, 5 Aug 2014, Mel Gorman wrote:
 On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote:
  
  [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa
  
  Sasha Levin has shown oopses on ea0003480048 and ea0003480008
  at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
  where zap_pte_range() checks page-mapping to see if PageAnon(page).
  
  Those addresses fit struct pages for pfns d2001 and d2000, and in each
  dump a register or a stack slot showed d2001730 or d2000730: pte flags
  0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
  a hole between cfff and 1, which would need special access.
  
  Commit c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on
  the PMD and PTE levels) has broken vm_normal_page(): a PROTNONE SPECIAL
  pte no longer passes the pte_special() test, so zap_pte_range() goes on
  to try to access a non-existent struct page.
  
 
 :(
 
  Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
  to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
  
  It's unclear why c46a7c817e66 added pte_numa() test to vm_normal_page(),
  and moved its is_zero_pfn() test from slow to fast path: I suspect both
  were papering over PROT_NONE issues seen with inadequate pte_special().
  Revert vm_normal_page() to how it was before, relying on pte_special().
  
 
 Rather than answering directly I updated your changelog
 
 Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE)
 to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).
 
 A hint that this was a problem was that c46a7c817e66 added pte_numa()
 test to vm_normal_page(), and moved its is_zero_pfn() test from slow to
 fast path: This was papering over a pte_special() snag when the zero
 page was encountered during zap. This patch reverts vm_normal_page()
 to how it was before, relying on pte_special().

Thanks, that's fine.

 
  I find it confusing, that the only example of ARCH_USES_NUMA_PROT_NONE
  no longer uses PROTNONE for NUMA, but SPECIAL instead: update the
  asm-generic comment a little, but that config option remains unhelpful.
  
 
 ARCH_USES_NUMA_PROT_NONE should have been sent to the farm at the same time
 as that patch and by rights unified with the powerpc helpers. With the new
 _PAGE_NUMA bit, there is no reason they should have different implementations
 of pte_numa and related functions. Unfortunately unifying them is a little
 problematic due to differences in fundamental types. It could be done with
 #defines but I'm attaching a preliminary prototype to illustrate the issue.
 
  But more seriously, I think this patch is incomplete: aren't there
  other places which need to be handling PROTNONE along with PRESENT?
  For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA,
  but on a PROT_NONE area, I think that will now make it pte_special()?
  So it ought to clear _PAGE_PROTNONE too.  Or maybe we can never
  pte_mknuma() on a PROT_NONE area - there would be no point?
  
 
 We are depending on the fact that inaccessible VMAs are skipped by the
 NUMA hinting scanner.

Ah, okay.  And the other way round (mprotecting to PROT_NONE an area
which already contains _PAGE_NUMA ptes) already looked safe to me.

 
  Around here I began to wonder if it was just a mistake to have deserted
  the PROTNONE for NUMA model: I know Linus had a strong reaction against
  it, and I've never delved into its drawbacks myself; but bringing yet
  another (SPECIAL) flag into the game is not an obvious improvement.
  Should we just revert c46a7c817e66, or would that be a mistake?
  
 
 It's replacing one type of complexity with another. The downside is that
 _PAGE_NUMA == _PAGE_PROTNONE puts subtle traps all over the core for
 powerpc to fall foul of.

Okay.

 
 I'm attaching a preliminary pair of patches. The first which deals with
 ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised
 changelog. I'm adding Aneesh to the cc to look at the powerpc portion of
 the first patch.

Thanks a lot, Mel.

I am surprised by the ordering, but perhaps you meant nothing by it.
Isn't the first one a welcome but optional cleanup, and the second one
a fix that we need in 3.16-stable?  Or does the fix actually depend in
some unstated way upon the cleanup, in powerpc-land perhaps?

Aside from that, for the first patch: yes, I heartily approve of the
disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and
CONFIG_ARCH_USES_NUMA_PROT_NONE.  If you wish, add
Acked-by: Hugh Dickins hu...@google.com
but of course it's really Aneesh and powerpc who are the test of it.

One thing I did wonder, though: at first I was reassured by the
VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
it would be better as VM_BUG_ON(!(val  _PAGE_PRESENT)), being stronger
- asserting that indeed we do not put NUMA hints on PROT_NONE areas.
(But I have not tested, 

Re: mm: BUG in unmap_page_range

2014-08-05 Thread Sasha Levin
Thanks Hugh, Mel. I've added both patches to my local tree and will update 
tomorrow
with the weather.

Also:

On 08/05/2014 08:42 PM, Hugh Dickins wrote:
 One thing I did wonder, though: at first I was reassured by the
 VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
 it would be better as VM_BUG_ON(!(val  _PAGE_PRESENT)), being stronger
 - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
 (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)

I've added VM_BUG_ON(!(val  _PAGE_PRESENT)) in just as a curiosity, I'll
update how that one looks as well.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in unmap_page_range

2014-08-04 Thread Hugh Dickins
On Sat, 2 Aug 2014, Sasha Levin wrote:

> Hi all,
> 
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel, I've stumbled on the following spew:
> 
> [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008
> [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
> mm/memory.c:1277 mm/memory.c:1301)
> [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0
> [ 2957.088041] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 2957.088087] Dumping ftrace buffer:
> [ 2957.088266](ftrace buffer empty)
> [ 2957.088279] Modules linked in:
> [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 
> 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990
> [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 
> 880739fb4000
> [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
> mm/memory.c:1277 mm/memory.c:1301)
> [ 2957.088328] RSP: 0018:880739fb7c58  EFLAGS: 00010246
> [ 2957.088336] RAX:  RBX: 880eb2bdbed8 RCX: 
> dfff971b4280
> [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: 
> ea0003480008
> [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 
> 00b6e000
> [ 2957.088357] R10:  R11: 0001 R12: 
> ea000348
> [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 
> 7f00e85db000
> [ 2957.088374] FS:  7f00e85d8700() GS:88177fa0() 
> knlGS:
> [ 2957.088381] CS:  0010 DS:  ES:  CR0: 80050033
> [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 
> 06a0
> [ 2957.088406] DR0:  DR1:  DR2: 
> 
> [ 2957.088413] DR3:  DR6: 0ff0 DR7: 
> 0600
> [ 2957.088416] Stack:
> [ 2957.088432]  88171726d570 0010 0008 
> d2000730
> [ 2957.088450]  19d00250 7f00e85dc000 880f9d311900 
> 880739fb7e20
> [ 2957.088466]  8807a8c507a0 8807a8c5 8807a75fe000 
> 8807ceaa7a10
> [ 2957.088469] Call Trace:
> [ 2957.088490] unmap_single_vma (mm/memory.c:1348)
> [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3))
> [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4))
> [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 
> include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 
> mm/mmap.c:493)
> [ 2957.088559] ? vmacache_update (mm/vmacache.c:61)
> [ 2957.088572] do_munmap (mm/mmap.c:2581)
> [ 2957.088583] vm_munmap (mm/mmap.c:2596)
> [ 2957.088595] SyS_munmap (mm/mmap.c:2601)
> [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541)
> [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 
> 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 <41> 
> f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd
> All code
> 
>0: ff  (bad)
>1: ff e8   ljmpq  *
>3: f9  stc
>4: 5f  pop%rdi
>5: 07  (bad)
>6: 00 48 8badd%cl,-0x75(%rax)
>9: 45 90   rex.RB xchg %eax,%r8d
>b: 80 48 18 01 orb$0x1,0x18(%rax)
>f: 4d 85 e4test   %r12,%r12
>   12: 0f 84 8b fe ff ff   je 0xfea3
>   18: 45 84 edtest   %r13b,%r13b
>   1b: 0f 85 fc 03 00 00   jne0x41d
>   21: 49 8d 7c 24 08  lea0x8(%r12),%rdi
>   26: e8 b5 67 07 00  callq  0x767e0
>   2b:*41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   <-- 
> trapping instruction
>   31: 0f 84 29 02 00 00   je 0x260
>   37: 83 6d c8 01 subl   $0x1,-0x38(%rbp)
>   3b: 4c 89 e7mov%r12,%rdi
>   3e: e8  .byte 0xe8
>   3f: bd  .byte 0xbd

This differs in which functions got inlined (unmap_page_range showing up
in place of zap_pte_range), but this is the same "if (PageAnon(page))"
that Sasha reported in the "hang in shmem_fallocate" thread on June 26th.

I can see what it is now, and here is most of a patch (which I don't
expect to satisfy Trinity yet); at this point I think I had better
hand it over to Mel, to complete or to discard.

[INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa

Sasha Levin has shown oopses on ea0003480048 and ea0003480008
at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
where zap_pte_range() checks page->mapping to see if PageAnon(page).

Those addresses fit struct pages for pfns d2001 and d2000, and in each
dump a register or a stack slot showed d2001730 or d2000730: pte flags
0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
a hole between cfff and 1, which would need special access.

Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing 

Re: mm: BUG in unmap_page_range

2014-08-04 Thread Hugh Dickins
On Sat, 2 Aug 2014, Sasha Levin wrote:

 Hi all,
 
 While fuzzing with trinity inside a KVM tools guest running the latest -next
 kernel, I've stumbled on the following spew:
 
 [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008
 [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
 mm/memory.c:1277 mm/memory.c:1301)
 [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0
 [ 2957.088041] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 [ 2957.088087] Dumping ftrace buffer:
 [ 2957.088266](ftrace buffer empty)
 [ 2957.088279] Modules linked in:
 [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 
 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990
 [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 
 880739fb4000
 [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
 mm/memory.c:1277 mm/memory.c:1301)
 [ 2957.088328] RSP: 0018:880739fb7c58  EFLAGS: 00010246
 [ 2957.088336] RAX:  RBX: 880eb2bdbed8 RCX: 
 dfff971b4280
 [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: 
 ea0003480008
 [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 
 00b6e000
 [ 2957.088357] R10:  R11: 0001 R12: 
 ea000348
 [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 
 7f00e85db000
 [ 2957.088374] FS:  7f00e85d8700() GS:88177fa0() 
 knlGS:
 [ 2957.088381] CS:  0010 DS:  ES:  CR0: 80050033
 [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 
 06a0
 [ 2957.088406] DR0:  DR1:  DR2: 
 
 [ 2957.088413] DR3:  DR6: 0ff0 DR7: 
 0600
 [ 2957.088416] Stack:
 [ 2957.088432]  88171726d570 0010 0008 
 d2000730
 [ 2957.088450]  19d00250 7f00e85dc000 880f9d311900 
 880739fb7e20
 [ 2957.088466]  8807a8c507a0 8807a8c5 8807a75fe000 
 8807ceaa7a10
 [ 2957.088469] Call Trace:
 [ 2957.088490] unmap_single_vma (mm/memory.c:1348)
 [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3))
 [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4))
 [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 
 include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 
 mm/mmap.c:493)
 [ 2957.088559] ? vmacache_update (mm/vmacache.c:61)
 [ 2957.088572] do_munmap (mm/mmap.c:2581)
 [ 2957.088583] vm_munmap (mm/mmap.c:2596)
 [ 2957.088595] SyS_munmap (mm/mmap.c:2601)
 [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541)
 [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 
 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 41 
 f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd
 All code
 
0: ff  (bad)
1: ff e8   ljmpq  *internal disassembler error
3: f9  stc
4: 5f  pop%rdi
5: 07  (bad)
6: 00 48 8badd%cl,-0x75(%rax)
9: 45 90   rex.RB xchg %eax,%r8d
b: 80 48 18 01 orb$0x1,0x18(%rax)
f: 4d 85 e4test   %r12,%r12
   12: 0f 84 8b fe ff ff   je 0xfea3
   18: 45 84 edtest   %r13b,%r13b
   1b: 0f 85 fc 03 00 00   jne0x41d
   21: 49 8d 7c 24 08  lea0x8(%r12),%rdi
   26: e8 b5 67 07 00  callq  0x767e0
   2b:*41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   -- 
 trapping instruction
   31: 0f 84 29 02 00 00   je 0x260
   37: 83 6d c8 01 subl   $0x1,-0x38(%rbp)
   3b: 4c 89 e7mov%r12,%rdi
   3e: e8  .byte 0xe8
   3f: bd  .byte 0xbd

This differs in which functions got inlined (unmap_page_range showing up
in place of zap_pte_range), but this is the same if (PageAnon(page))
that Sasha reported in the hang in shmem_fallocate thread on June 26th.

I can see what it is now, and here is most of a patch (which I don't
expect to satisfy Trinity yet); at this point I think I had better
hand it over to Mel, to complete or to discard.

[INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa

Sasha Levin has shown oopses on ea0003480048 and ea0003480008
at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
where zap_pte_range() checks page-mapping to see if PageAnon(page).

Those addresses fit struct pages for pfns d2001 and d2000, and in each
dump a register or a stack slot showed d2001730 or d2000730: pte flags
0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
a hole between cfff and 1, which would need special access.

Commit c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on
the PMD and PTE levels) has broken 

mm: BUG in unmap_page_range

2014-08-02 Thread Sasha Levin
Hi all,

While fuzzing with trinity inside a KVM tools guest running the latest -next
kernel, I've stumbled on the following spew:

[ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008
[ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
mm/memory.c:1277 mm/memory.c:1301)
[ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0
[ 2957.088041] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2957.088087] Dumping ftrace buffer:
[ 2957.088266](ftrace buffer empty)
[ 2957.088279] Modules linked in:
[ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 
3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990
[ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 
880739fb4000
[ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
mm/memory.c:1277 mm/memory.c:1301)
[ 2957.088328] RSP: 0018:880739fb7c58  EFLAGS: 00010246
[ 2957.088336] RAX:  RBX: 880eb2bdbed8 RCX: dfff971b4280
[ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: ea0003480008
[ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 00b6e000
[ 2957.088357] R10:  R11: 0001 R12: ea000348
[ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 7f00e85db000
[ 2957.088374] FS:  7f00e85d8700() GS:88177fa0() 
knlGS:
[ 2957.088381] CS:  0010 DS:  ES:  CR0: 80050033
[ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 06a0
[ 2957.088406] DR0:  DR1:  DR2: 
[ 2957.088413] DR3:  DR6: 0ff0 DR7: 0600
[ 2957.088416] Stack:
[ 2957.088432]  88171726d570 0010 0008 
d2000730
[ 2957.088450]  19d00250 7f00e85dc000 880f9d311900 
880739fb7e20
[ 2957.088466]  8807a8c507a0 8807a8c5 8807a75fe000 
8807ceaa7a10
[ 2957.088469] Call Trace:
[ 2957.088490] unmap_single_vma (mm/memory.c:1348)
[ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3))
[ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4))
[ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 
include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 
mm/mmap.c:493)
[ 2957.088559] ? vmacache_update (mm/vmacache.c:61)
[ 2957.088572] do_munmap (mm/mmap.c:2581)
[ 2957.088583] vm_munmap (mm/mmap.c:2596)
[ 2957.088595] SyS_munmap (mm/mmap.c:2601)
[ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541)
[ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 
84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 <41> f6 
44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd
All code

   0:   ff  (bad)
   1:   ff e8   ljmpq  *
   3:   f9  stc
   4:   5f  pop%rdi
   5:   07  (bad)
   6:   00 48 8badd%cl,-0x75(%rax)
   9:   45 90   rex.RB xchg %eax,%r8d
   b:   80 48 18 01 orb$0x1,0x18(%rax)
   f:   4d 85 e4test   %r12,%r12
  12:   0f 84 8b fe ff ff   je 0xfea3
  18:   45 84 edtest   %r13b,%r13b
  1b:   0f 85 fc 03 00 00   jne0x41d
  21:   49 8d 7c 24 08  lea0x8(%r12),%rdi
  26:   e8 b5 67 07 00  callq  0x767e0
  2b:*  41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   <-- trapping 
instruction
  31:   0f 84 29 02 00 00   je 0x260
  37:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
  3b:   4c 89 e7mov%r12,%rdi
  3e:   e8  .byte 0xe8
  3f:   bd  .byte 0xbd
...

Code starting with the faulting instruction
===
   0:   41 f6 44 24 08 01   testb  $0x1,0x8(%r12)
   6:   0f 84 29 02 00 00   je 0x235
   c:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
  10:   4c 89 e7mov%r12,%rdi
  13:   e8  .byte 0xe8
  14:   bd  .byte 0xbd
...
[ 2957.088784] RIP unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
mm/memory.c:1277 mm/memory.c:1301)
[ 2957.088789]  RSP 
[ 2957.088794] CR2: ea0003480008


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mm: BUG in unmap_page_range

2014-08-02 Thread Sasha Levin
Hi all,

While fuzzing with trinity inside a KVM tools guest running the latest -next
kernel, I've stumbled on the following spew:

[ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008
[ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
mm/memory.c:1277 mm/memory.c:1301)
[ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0
[ 2957.088041] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2957.088087] Dumping ftrace buffer:
[ 2957.088266](ftrace buffer empty)
[ 2957.088279] Modules linked in:
[ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 
3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990
[ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 
880739fb4000
[ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
mm/memory.c:1277 mm/memory.c:1301)
[ 2957.088328] RSP: 0018:880739fb7c58  EFLAGS: 00010246
[ 2957.088336] RAX:  RBX: 880eb2bdbed8 RCX: dfff971b4280
[ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: ea0003480008
[ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 00b6e000
[ 2957.088357] R10:  R11: 0001 R12: ea000348
[ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 7f00e85db000
[ 2957.088374] FS:  7f00e85d8700() GS:88177fa0() 
knlGS:
[ 2957.088381] CS:  0010 DS:  ES:  CR0: 80050033
[ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 06a0
[ 2957.088406] DR0:  DR1:  DR2: 
[ 2957.088413] DR3:  DR6: 0ff0 DR7: 0600
[ 2957.088416] Stack:
[ 2957.088432]  88171726d570 0010 0008 
d2000730
[ 2957.088450]  19d00250 7f00e85dc000 880f9d311900 
880739fb7e20
[ 2957.088466]  8807a8c507a0 8807a8c5 8807a75fe000 
8807ceaa7a10
[ 2957.088469] Call Trace:
[ 2957.088490] unmap_single_vma (mm/memory.c:1348)
[ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3))
[ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4))
[ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 
include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 
mm/mmap.c:493)
[ 2957.088559] ? vmacache_update (mm/vmacache.c:61)
[ 2957.088572] do_munmap (mm/mmap.c:2581)
[ 2957.088583] vm_munmap (mm/mmap.c:2596)
[ 2957.088595] SyS_munmap (mm/mmap.c:2601)
[ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541)
[ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 
84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 41 f6 
44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd
All code

   0:   ff  (bad)
   1:   ff e8   ljmpq  *internal disassembler error
   3:   f9  stc
   4:   5f  pop%rdi
   5:   07  (bad)
   6:   00 48 8badd%cl,-0x75(%rax)
   9:   45 90   rex.RB xchg %eax,%r8d
   b:   80 48 18 01 orb$0x1,0x18(%rax)
   f:   4d 85 e4test   %r12,%r12
  12:   0f 84 8b fe ff ff   je 0xfea3
  18:   45 84 edtest   %r13b,%r13b
  1b:   0f 85 fc 03 00 00   jne0x41d
  21:   49 8d 7c 24 08  lea0x8(%r12),%rdi
  26:   e8 b5 67 07 00  callq  0x767e0
  2b:*  41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   -- trapping 
instruction
  31:   0f 84 29 02 00 00   je 0x260
  37:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
  3b:   4c 89 e7mov%r12,%rdi
  3e:   e8  .byte 0xe8
  3f:   bd  .byte 0xbd
...

Code starting with the faulting instruction
===
   0:   41 f6 44 24 08 01   testb  $0x1,0x8(%r12)
   6:   0f 84 29 02 00 00   je 0x235
   c:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
  10:   4c 89 e7mov%r12,%rdi
  13:   e8  .byte 0xe8
  14:   bd  .byte 0xbd
...
[ 2957.088784] RIP unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 
mm/memory.c:1277 mm/memory.c:1301)
[ 2957.088789]  RSP 880739fb7c58
[ 2957.088794] CR2: ea0003480008


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/