Re: mm: BUG in unmap_page_range
On 09/11/2014 06:38 PM, Sasha Levin wrote: > On 09/11/2014 12:28 PM, Mel Gorman wrote: >> > Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be >> > really nice if you could bisect 3.17-rc4 to linux-next carrying the >> > VM_BUG_ON(!(val & _PAGE_PRESENT)) check at each bisection point. I'm not >> > 100% sure if I'm seeing the same corruption as you or some other issue and >> > do not want to conflate numerous different problems into one. I know this >> > is a pain in the ass but if 3.17-rc4 looks stable then a bisection might >> > be faster overall than my constant head scratching :( > The good news are that 3.17-rc4 seems to be stable. I'll start the bisection, > which I suspect would take several days. I'll update when I run into > something. I might need a bit of a help here. The bisection is going sideways because I can't reliably reproduce the issue. We don't know what's causing this issue, but we know what the symptoms are. Is there a VM_BUG_ON we could add somewhere so that it would be more likely to trigger? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/11/2014 06:38 PM, Sasha Levin wrote: On 09/11/2014 12:28 PM, Mel Gorman wrote: Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be really nice if you could bisect 3.17-rc4 to linux-next carrying the VM_BUG_ON(!(val _PAGE_PRESENT)) check at each bisection point. I'm not 100% sure if I'm seeing the same corruption as you or some other issue and do not want to conflate numerous different problems into one. I know this is a pain in the ass but if 3.17-rc4 looks stable then a bisection might be faster overall than my constant head scratching :( The good news are that 3.17-rc4 seems to be stable. I'll start the bisection, which I suspect would take several days. I'll update when I run into something. I might need a bit of a help here. The bisection is going sideways because I can't reliably reproduce the issue. We don't know what's causing this issue, but we know what the symptoms are. Is there a VM_BUG_ON we could add somewhere so that it would be more likely to trigger? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/11/2014 12:28 PM, Mel Gorman wrote: > Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be > really nice if you could bisect 3.17-rc4 to linux-next carrying the > VM_BUG_ON(!(val & _PAGE_PRESENT)) check at each bisection point. I'm not > 100% sure if I'm seeing the same corruption as you or some other issue and > do not want to conflate numerous different problems into one. I know this > is a pain in the ass but if 3.17-rc4 looks stable then a bisection might > be faster overall than my constant head scratching :( The good news are that 3.17-rc4 seems to be stable. I'll start the bisection, which I suspect would take several days. I'll update when I run into something. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Thu, Sep 11, 2014 at 04:39:39AM -0700, Hugh Dickins wrote: > On Wed, 10 Sep 2014, Sasha Levin wrote: > > On 09/10/2014 03:36 PM, Hugh Dickins wrote: > > > Right, and Sasha reports that that can fire, but he sees the bug > > > with this patch in and without that firing. > > > > I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful > > VMA information out, and got the following: > > Well, thanks, but Mel and I have both failed to perceive any actual > problem arising from that peculiarity. And Mel's warning, and the 900s > in yesterday's dumps, have shown that it is not correlated with the > pte_mknuma() bug we are chasing. So there isn't anything that I want to > look up in these vmas. Or did you notice something interesting in them? > > > And on a maybe related note, I've started seeing the following today. It may > > be because we fixed mbind() in trinity but it could also be related to > > The fixed trinity may be counter-productive for now, since we think > there is an understandable pte_mknuma() bug coming from that direction, > but have not posted a patch for it yet. > > > this issue (free_pgtables() is in the call chain). If you don't think it has > > anything to do with it let me know and I'll start a new thread: > > > > [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at > > (null) > > [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 > > lib/rbtree.c:229 lib/rbtree.c:367) > > [ 1196.001744] Call Trace: > > [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) > > [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) > > [ 1196.001744] unlink_file_vma (mm/mmap.c:246) > > [ 1196.001744] free_pgtables (mm/memory.c:547) > > [ 1196.001744] exit_mmap (mm/mmap.c:2826) > > [ 1196.001744] mmput (kernel/fork.c:654) > > [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 > > kernel/exit.c:461 kernel/exit.c:746) > > I didn't study in any detail, but this one seems much more like the > zeroing and vma corruption that you've been seeing in other dumps. > I didn't look through the dumps closely today because I spent the time putting together a KVM setup similar to Sasha's (many cpus, fake NUMA, etc) so I could run trinity in it in another attempt to reproduce this. I did not encounter the same VM_BUG_ON unfortunately. However, trinity itself crashed after 2.5 hours complaining [watchdog] pid 32188 has disappeared. Reaping. [watchdog] pid 32024 has disappeared. Reaping. [watchdog] pid 32300 has disappeared. Reaping. [watchdog] Sanity check failed! Found pid 0 at pidslot 35! This did not happen when running on bare metal. This error makes me wonder if it is evidence that there is zeroing corruption occuring when running inside KVM. Another possibility is that it's somehow related to fake NUMA although it's hard to see how. It's still possible the bug is with the page table handling and KVM affects timing enough to cause problems so I'm not ruling that out. > Though a single pte_mknuma() crash could presumably be caused by vma > corruption (but I think not mere zeroing), the recurrent way in which > you hit that pte_mknuma() bug in particular makes it unlikely to be > caused by random corruption. > > You are generating new crashes faster than we can keep up with them. > Would this be a suitable point for you to switch over to testing > 3.17-rc, to see if that is as unstable for you as -next is? > > That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree, > but I think you can "safely" add it to 3.17-rc. Quotes around > "safely" meaning that we know that there's a bug to hit, at least > in -next, but I don't think it's going to be hit for stupid obvious > reasons. > Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be really nice if you could bisect 3.17-rc4 to linux-next carrying the VM_BUG_ON(!(val & _PAGE_PRESENT)) check at each bisection point. I'm not 100% sure if I'm seeing the same corruption as you or some other issue and do not want to conflate numerous different problems into one. I know this is a pain in the ass but if 3.17-rc4 looks stable then a bisection might be faster overall than my constant head scratching :( -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Thu, Sep 11, 2014 at 10:22:42AM -0400, Sasha Levin wrote: > > The fixed trinity may be counter-productive for now, since we think > > there is an understandable pte_mknuma() bug coming from that direction, > > but have not posted a patch for it yet. > > I'm still seeing the bug with fixed trinity, it was a matter of adding more > flags > to mbind. What did I miss ? Anything not in the MPOL_MF_VALID mask should be -EINVAL Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/11/2014 07:39 AM, Hugh Dickins wrote: > On Wed, 10 Sep 2014, Sasha Levin wrote: >> On 09/10/2014 03:36 PM, Hugh Dickins wrote: >>> Right, and Sasha reports that that can fire, but he sees the bug >>> with this patch in and without that firing. >> >> I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful >> VMA information out, and got the following: > > Well, thanks, but Mel and I have both failed to perceive any actual > problem arising from that peculiarity. And Mel's warning, and the 900s > in yesterday's dumps, have shown that it is not correlated with the > pte_mknuma() bug we are chasing. So there isn't anything that I want to > look up in these vmas. Or did you notice something interesting in them? I thought this was a separate issue that would need taking care of as well. >> And on a maybe related note, I've started seeing the following today. It may >> be because we fixed mbind() in trinity but it could also be related to > > The fixed trinity may be counter-productive for now, since we think > there is an understandable pte_mknuma() bug coming from that direction, > but have not posted a patch for it yet. I'm still seeing the bug with fixed trinity, it was a matter of adding more flags to mbind. >> this issue (free_pgtables() is in the call chain). If you don't think it has >> anything to do with it let me know and I'll start a new thread: >> >> [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at >> (null) >> [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 >> lib/rbtree.c:229 lib/rbtree.c:367) >> [ 1196.001744] Call Trace: >> [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) >> [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) >> [ 1196.001744] unlink_file_vma (mm/mmap.c:246) >> [ 1196.001744] free_pgtables (mm/memory.c:547) >> [ 1196.001744] exit_mmap (mm/mmap.c:2826) >> [ 1196.001744] mmput (kernel/fork.c:654) >> [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 >> kernel/exit.c:461 kernel/exit.c:746) > > I didn't study in any detail, but this one seems much more like the > zeroing and vma corruption that you've been seeing in other dumps. > > Though a single pte_mknuma() crash could presumably be caused by vma > corruption (but I think not mere zeroing), the recurrent way in which > you hit that pte_mknuma() bug in particular makes it unlikely to be > caused by random corruption. > > You are generating new crashes faster than we can keep up with them. > Would this be a suitable point for you to switch over to testing > 3.17-rc, to see if that is as unstable for you as -next is? > > That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree, > but I think you can "safely" add it to 3.17-rc. Quotes around > "safely" meaning that we know that there's a bug to hit, at least > in -next, but I don't think it's going to be hit for stupid obvious > reasons. I'll try it, usually I just hit a bunch of issues that were already fixed in -next, which is why I try sticking to one tree. > And you're using a gcc 5 these days? That's another variable to > try removing from the mix, to see if it makes a difference. I'm seeing the BUG getting hit with 4.7.2, so I don't think it's compiler dependant. I'll try reproducing everything I reported yesterday with 4.7.2 just in case, but I don't think that this is the issue. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Sasha Levin wrote: > On 09/10/2014 03:36 PM, Hugh Dickins wrote: > > Right, and Sasha reports that that can fire, but he sees the bug > > with this patch in and without that firing. > > I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful > VMA information out, and got the following: Well, thanks, but Mel and I have both failed to perceive any actual problem arising from that peculiarity. And Mel's warning, and the 900s in yesterday's dumps, have shown that it is not correlated with the pte_mknuma() bug we are chasing. So there isn't anything that I want to look up in these vmas. Or did you notice something interesting in them? > And on a maybe related note, I've started seeing the following today. It may > be because we fixed mbind() in trinity but it could also be related to The fixed trinity may be counter-productive for now, since we think there is an understandable pte_mknuma() bug coming from that direction, but have not posted a patch for it yet. > this issue (free_pgtables() is in the call chain). If you don't think it has > anything to do with it let me know and I'll start a new thread: > > [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at > (null) > [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 > lib/rbtree.c:229 lib/rbtree.c:367) > [ 1196.001744] Call Trace: > [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) > [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) > [ 1196.001744] unlink_file_vma (mm/mmap.c:246) > [ 1196.001744] free_pgtables (mm/memory.c:547) > [ 1196.001744] exit_mmap (mm/mmap.c:2826) > [ 1196.001744] mmput (kernel/fork.c:654) > [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 > kernel/exit.c:461 kernel/exit.c:746) I didn't study in any detail, but this one seems much more like the zeroing and vma corruption that you've been seeing in other dumps. Though a single pte_mknuma() crash could presumably be caused by vma corruption (but I think not mere zeroing), the recurrent way in which you hit that pte_mknuma() bug in particular makes it unlikely to be caused by random corruption. You are generating new crashes faster than we can keep up with them. Would this be a suitable point for you to switch over to testing 3.17-rc, to see if that is as unstable for you as -next is? That VM_BUG_ON(!(val & _PAGE_PRESENT)) is not in the 3.17-rc tree, but I think you can "safely" add it to 3.17-rc. Quotes around "safely" meaning that we know that there's a bug to hit, at least in -next, but I don't think it's going to be hit for stupid obvious reasons. And you're using a gcc 5 these days? That's another variable to try removing from the mix, to see if it makes a difference. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/11/2014 12:28 PM, Mel Gorman wrote: Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be really nice if you could bisect 3.17-rc4 to linux-next carrying the VM_BUG_ON(!(val _PAGE_PRESENT)) check at each bisection point. I'm not 100% sure if I'm seeing the same corruption as you or some other issue and do not want to conflate numerous different problems into one. I know this is a pain in the ass but if 3.17-rc4 looks stable then a bisection might be faster overall than my constant head scratching :( The good news are that 3.17-rc4 seems to be stable. I'll start the bisection, which I suspect would take several days. I'll update when I run into something. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Sasha Levin wrote: On 09/10/2014 03:36 PM, Hugh Dickins wrote: Right, and Sasha reports that that can fire, but he sees the bug with this patch in and without that firing. I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA information out, and got the following: Well, thanks, but Mel and I have both failed to perceive any actual problem arising from that peculiarity. And Mel's warning, and the 900s in yesterday's dumps, have shown that it is not correlated with the pte_mknuma() bug we are chasing. So there isn't anything that I want to look up in these vmas. Or did you notice something interesting in them? And on a maybe related note, I've started seeing the following today. It may be because we fixed mbind() in trinity but it could also be related to The fixed trinity may be counter-productive for now, since we think there is an understandable pte_mknuma() bug coming from that direction, but have not posted a patch for it yet. this issue (free_pgtables() is in the call chain). If you don't think it has anything to do with it let me know and I'll start a new thread: [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 lib/rbtree.c:229 lib/rbtree.c:367) [ 1196.001744] Call Trace: [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) [ 1196.001744] unlink_file_vma (mm/mmap.c:246) [ 1196.001744] free_pgtables (mm/memory.c:547) [ 1196.001744] exit_mmap (mm/mmap.c:2826) [ 1196.001744] mmput (kernel/fork.c:654) [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:461 kernel/exit.c:746) I didn't study in any detail, but this one seems much more like the zeroing and vma corruption that you've been seeing in other dumps. Though a single pte_mknuma() crash could presumably be caused by vma corruption (but I think not mere zeroing), the recurrent way in which you hit that pte_mknuma() bug in particular makes it unlikely to be caused by random corruption. You are generating new crashes faster than we can keep up with them. Would this be a suitable point for you to switch over to testing 3.17-rc, to see if that is as unstable for you as -next is? That VM_BUG_ON(!(val _PAGE_PRESENT)) is not in the 3.17-rc tree, but I think you can safely add it to 3.17-rc. Quotes around safely meaning that we know that there's a bug to hit, at least in -next, but I don't think it's going to be hit for stupid obvious reasons. And you're using a gcc 5 these days? That's another variable to try removing from the mix, to see if it makes a difference. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/11/2014 07:39 AM, Hugh Dickins wrote: On Wed, 10 Sep 2014, Sasha Levin wrote: On 09/10/2014 03:36 PM, Hugh Dickins wrote: Right, and Sasha reports that that can fire, but he sees the bug with this patch in and without that firing. I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA information out, and got the following: Well, thanks, but Mel and I have both failed to perceive any actual problem arising from that peculiarity. And Mel's warning, and the 900s in yesterday's dumps, have shown that it is not correlated with the pte_mknuma() bug we are chasing. So there isn't anything that I want to look up in these vmas. Or did you notice something interesting in them? I thought this was a separate issue that would need taking care of as well. And on a maybe related note, I've started seeing the following today. It may be because we fixed mbind() in trinity but it could also be related to The fixed trinity may be counter-productive for now, since we think there is an understandable pte_mknuma() bug coming from that direction, but have not posted a patch for it yet. I'm still seeing the bug with fixed trinity, it was a matter of adding more flags to mbind. this issue (free_pgtables() is in the call chain). If you don't think it has anything to do with it let me know and I'll start a new thread: [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 lib/rbtree.c:229 lib/rbtree.c:367) [ 1196.001744] Call Trace: [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) [ 1196.001744] unlink_file_vma (mm/mmap.c:246) [ 1196.001744] free_pgtables (mm/memory.c:547) [ 1196.001744] exit_mmap (mm/mmap.c:2826) [ 1196.001744] mmput (kernel/fork.c:654) [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:461 kernel/exit.c:746) I didn't study in any detail, but this one seems much more like the zeroing and vma corruption that you've been seeing in other dumps. Though a single pte_mknuma() crash could presumably be caused by vma corruption (but I think not mere zeroing), the recurrent way in which you hit that pte_mknuma() bug in particular makes it unlikely to be caused by random corruption. You are generating new crashes faster than we can keep up with them. Would this be a suitable point for you to switch over to testing 3.17-rc, to see if that is as unstable for you as -next is? That VM_BUG_ON(!(val _PAGE_PRESENT)) is not in the 3.17-rc tree, but I think you can safely add it to 3.17-rc. Quotes around safely meaning that we know that there's a bug to hit, at least in -next, but I don't think it's going to be hit for stupid obvious reasons. I'll try it, usually I just hit a bunch of issues that were already fixed in -next, which is why I try sticking to one tree. And you're using a gcc 5 these days? That's another variable to try removing from the mix, to see if it makes a difference. I'm seeing the BUG getting hit with 4.7.2, so I don't think it's compiler dependant. I'll try reproducing everything I reported yesterday with 4.7.2 just in case, but I don't think that this is the issue. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Thu, Sep 11, 2014 at 10:22:42AM -0400, Sasha Levin wrote: The fixed trinity may be counter-productive for now, since we think there is an understandable pte_mknuma() bug coming from that direction, but have not posted a patch for it yet. I'm still seeing the bug with fixed trinity, it was a matter of adding more flags to mbind. What did I miss ? Anything not in the MPOL_MF_VALID mask should be -EINVAL Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Thu, Sep 11, 2014 at 04:39:39AM -0700, Hugh Dickins wrote: On Wed, 10 Sep 2014, Sasha Levin wrote: On 09/10/2014 03:36 PM, Hugh Dickins wrote: Right, and Sasha reports that that can fire, but he sees the bug with this patch in and without that firing. I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA information out, and got the following: Well, thanks, but Mel and I have both failed to perceive any actual problem arising from that peculiarity. And Mel's warning, and the 900s in yesterday's dumps, have shown that it is not correlated with the pte_mknuma() bug we are chasing. So there isn't anything that I want to look up in these vmas. Or did you notice something interesting in them? And on a maybe related note, I've started seeing the following today. It may be because we fixed mbind() in trinity but it could also be related to The fixed trinity may be counter-productive for now, since we think there is an understandable pte_mknuma() bug coming from that direction, but have not posted a patch for it yet. this issue (free_pgtables() is in the call chain). If you don't think it has anything to do with it let me know and I'll start a new thread: [ 1195.996803] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1196.001744] IP: __rb_erase_color (include/linux/rbtree_augmented.h:107 lib/rbtree.c:229 lib/rbtree.c:367) [ 1196.001744] Call Trace: [ 1196.001744] vma_interval_tree_remove (mm/interval_tree.c:24) [ 1196.001744] __remove_shared_vm_struct (mm/mmap.c:232) [ 1196.001744] unlink_file_vma (mm/mmap.c:246) [ 1196.001744] free_pgtables (mm/memory.c:547) [ 1196.001744] exit_mmap (mm/mmap.c:2826) [ 1196.001744] mmput (kernel/fork.c:654) [ 1196.001744] do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:461 kernel/exit.c:746) I didn't study in any detail, but this one seems much more like the zeroing and vma corruption that you've been seeing in other dumps. I didn't look through the dumps closely today because I spent the time putting together a KVM setup similar to Sasha's (many cpus, fake NUMA, etc) so I could run trinity in it in another attempt to reproduce this. I did not encounter the same VM_BUG_ON unfortunately. However, trinity itself crashed after 2.5 hours complaining [watchdog] pid 32188 has disappeared. Reaping. [watchdog] pid 32024 has disappeared. Reaping. [watchdog] pid 32300 has disappeared. Reaping. [watchdog] Sanity check failed! Found pid 0 at pidslot 35! This did not happen when running on bare metal. This error makes me wonder if it is evidence that there is zeroing corruption occuring when running inside KVM. Another possibility is that it's somehow related to fake NUMA although it's hard to see how. It's still possible the bug is with the page table handling and KVM affects timing enough to cause problems so I'm not ruling that out. Though a single pte_mknuma() crash could presumably be caused by vma corruption (but I think not mere zeroing), the recurrent way in which you hit that pte_mknuma() bug in particular makes it unlikely to be caused by random corruption. You are generating new crashes faster than we can keep up with them. Would this be a suitable point for you to switch over to testing 3.17-rc, to see if that is as unstable for you as -next is? That VM_BUG_ON(!(val _PAGE_PRESENT)) is not in the 3.17-rc tree, but I think you can safely add it to 3.17-rc. Quotes around safely meaning that we know that there's a bug to hit, at least in -next, but I don't think it's going to be hit for stupid obvious reasons. Agreed. If 3.17-rc4 looks stable with the VM_BUG_ON then it would be really nice if you could bisect 3.17-rc4 to linux-next carrying the VM_BUG_ON(!(val _PAGE_PRESENT)) check at each bisection point. I'm not 100% sure if I'm seeing the same corruption as you or some other issue and do not want to conflate numerous different problems into one. I know this is a pain in the ass but if 3.17-rc4 looks stable then a bisection might be faster overall than my constant head scratching :( -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/10/2014 03:36 PM, Hugh Dickins wrote: >> migrate: debug patch to try identify race between migration completion and >> mprotect >> > >> > A migration entry is marked as write if pte_write was true at the >> > time the entry was created. The VMA protections are not double checked >> > when migration entries are being removed but mprotect itself will mark >> > write-migration-entries as read to avoid problems. It means we potentially >> > take a spurious fault to mark these ptes write again but otherwise it's >> > harmless. Still, one dump indicates that this situation can actually >> > happen so this debugging patch spits out a warning if the situation occurs >> > and hopefully the resulting warning will contain a clue as to how exactly >> > it happens >> > >> > Not-signed-off >> > --- >> > mm/migrate.c | 12 ++-- >> > 1 file changed, 10 insertions(+), 2 deletions(-) >> > >> > diff --git a/mm/migrate.c b/mm/migrate.c >> > index 09d489c..631725c 100644 >> > --- a/mm/migrate.c >> > +++ b/mm/migrate.c >> > @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, >> > struct vm_area_struct *vma, >> >pte = pte_mkold(mk_pte(new, vma->vm_page_prot)); >> >if (pte_swp_soft_dirty(*ptep)) >> >pte = pte_mksoft_dirty(pte); >> > - if (is_write_migration_entry(entry)) >> > - pte = pte_mkwrite(pte); >> > + if (is_write_migration_entry(entry)) { >> > + /* >> > + * This WARN_ON_ONCE is temporary for the purposes of seeing if >> > + * it's a case encountered by trinity in Sasha's testing >> > + */ >> > + if (!(vma->vm_flags & (VM_WRITE))) >> > + WARN_ON_ONCE(1); >> > + else >> > + pte = pte_mkwrite(pte); >> > + } >> > #ifdef CONFIG_HUGETLB_PAGE >> >if (PageHuge(new)) { >> >pte = pte_mkhuge(pte); >> > > Right, and Sasha reports that that can fire, but he sees the bug > with this patch in and without that firing. I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA information out, and got the following: [ 4018.870776] vma 8801a0f1e800 start 7f3fd0ca7000 end 7f3fd16a7000 [ 4018.870776] next 8804e1b89800 prev 88008cd9a000 mm 88054b17d000 [ 4018.870776] prot 120 anon_vma 880bc858a200 vm_ops (null) [ 4018.870776] pgoff 41bc8 file (null) private_data (null) [ 4018.879731] flags: 0x8100070(mayread|maywrite|mayexec|account) [ 4018.881324] [ cut here ] [ 4018.882612] kernel BUG at mm/migrate.c:155! [ 4018.883649] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 4018.889647] Dumping ftrace buffer: [ 4018.890323](ftrace buffer empty) [ 4018.890323] Modules linked in: [ 4018.890323] CPU: 4 PID: 9966 Comm: trinity-main Tainted: GW 3.17.0-rc4-next-20140910-sasha-00042-ga4bad9b-dirty #1140 [ 4018.890323] task: 880695b83000 ti: 880560c44000 task.ti: 880560c44000 [ 4018.890323] RIP: 0010:[] [] remove_migration_pte+0x3e1/0x3f0 [ 4018.890323] RSP: :880560c477c8 EFLAGS: 00010292 [ 4018.890323] RAX: 0001 RBX: 7f3fd129b000 RCX: [ 4018.890323] RDX: 0001 RSI: 9e4ba395 RDI: 0001 [ 4018.890323] RBP: 880560c47800 R08: 0001 R09: 0001 [ 4018.890323] R10: 00045401 R11: 0001 R12: 8801a0f1e800 [ 4018.890323] R13: 88054b17d000 R14: ea000478eb40 R15: 880122bcf070 [ 4018.890323] FS: 7f3fd55bb700() GS:8803d6a0() knlGS: [ 4018.890323] CS: 0010 DS: ES: CR0: 8005003b [ 4018.890323] CR2: 00fcbca8 CR3: 000561bab000 CR4: 06a0 [ 4018.890323] DR0: 006f DR1: DR2: [ 4018.890323] DR3: DR6: 0ff0 DR7: 0600 [ 4018.890323] Stack: [ 4018.890323] ea00046ed980 88011079c4d8 ea000478eb40 880560c47858 [ 4018.890323] 88019fde0330 000421bc 8801a0f1e800 880560c47848 [ 4018.890323] 9b2d1b0f 880bc858a200 880560c47850 ea000478eb40 [ 4018.890323] Call Trace: [ 4018.890323] [] rmap_walk+0x22f/0x380 [ 4018.890323] [] remove_migration_ptes+0x41/0x50 [ 4018.890323] [] ? __migration_entry_wait.isra.24+0x160/0x160 [ 4018.890323] [] ? remove_migration_pte+0x3f0/0x3f0 [ 4018.890323] [] move_to_new_page+0x16b/0x230 [ 4018.890323] [] ? try_to_unmap+0x6c/0xf0 [ 4018.890323] [] ? try_to_unmap_nonlinear+0x5c0/0x5c0 [ 4018.890323] [] ? invalid_migration_vma+0x30/0x30 [ 4018.890323] [] ? page_remove_rmap+0x320/0x320 [ 4018.890323] [] migrate_pages+0x85c/0x930 [ 4018.890323] [] ? isolate_freepages_block+0x410/0x410 [ 4018.890323] [] ? arch_local_save_flags+0x30/0x30 [ 4018.890323] [] compact_zone+0x4d3/0x8a0 [ 4018.890323] [] compact_zone_order+0x5f/0xa0 [ 4018.890323] [] try_to_compact_pages+0x127/0x2f0 [
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Sasha Levin wrote: > On 09/10/2014 03:09 PM, Hugh Dickins wrote: > > Thanks for supplying, but the change in inlining means that > > change_protection_range() and change_protection() are no longer > > relevant for these traces, we now need to see change_pte_range() > > instead, to confirm that what I expect are ptes are indeed ptes. > > > > If you can include line numbers (objdump -ld) in the disassembly, so > > much the better, but should be decipherable without. (Or objdump -Sd > > for source, but I often find that harder to unscramble, can't say why.) > > Here it is. Note that the source includes both of Mel's debug patches. > For reference, here's one trace of the issue with those patches: > > [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724! > [ 3114.541857] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 3114.543112] Dumping ftrace buffer: > [ 3114.544056](ftrace buffer empty) > [ 3114.545000] Modules linked in: > [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW > 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 > [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: > 88076f584000 > [ 3114.549284] RIP: 0010:[] [] > change_pte_range+0x4ea/0x4f0 > [ 3114.550028] RSP: :88076f587d68 EFLAGS: 00010246 > [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: > 0100 > [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: > 000314625900 > [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: > 00b5 > [ 3114.550028] R10: 00032c01 R11: 0008 R12: > 8802a81070c0 > [ 3114.550028] R13: 8025 R14: 41343000 R15: > cfff > [ 3114.550028] FS: 7fabb91c8700() GS:88025ec0() > knlGS: > [ 3114.550028] CS: 0010 DS: ES: CR0: 8005003b > [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: > 06a0 > [ 3114.550028] DR0: 006f DR1: DR2: > > [ 3114.550028] DR3: DR6: 0ff0 DR7: > 00050602 > [ 3114.550028] Stack: > [ 3114.550028] 0001 000314625900 0018 > 8802685f2260 > [ 3114.550028] 1684 8802cf973600 88061684 > 41343000 > [ 3114.550028] 880108805048 41005000 4120 > 41343000 > [ 3114.550028] Call Trace: > [ 3114.550028] [] change_protection+0x2b4/0x4e0 > [ 3114.550028] [] change_prot_numa+0x1b/0x40 > [ 3114.550028] [] task_numa_work+0x1f6/0x330 > [ 3114.550028] [] task_work_run+0xc4/0xf0 > [ 3114.550028] [] do_notify_resume+0x97/0xb0 > [ 3114.550028] [] int_signal+0x12/0x17 > [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff > ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> > 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 > [ 3114.550028] RIP [] change_pte_range+0x4ea/0x4f0 > [ 3114.550028] RSP > > And the disassembly: ... > /home/sasha/linux-next/mm/mprotect.c:105 > 31d: 48 8b 4d a8 mov-0x58(%rbp),%rcx > 321: 81 e1 01 03 00 00 and$0x301,%ecx > 327: 48 81 f9 00 02 00 00cmp$0x200,%rcx > 32e: 0f 84 0b ff ff ff je 23f > pte_val(): > /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450 > 334: 48 83 3d 00 00 00 00cmpq $0x0,0x0(%rip)# 33c > > 33b: 00 > 337: R_X86_64_PC32 pv_mmu_ops+0xe3 > ptep_set_numa(): > /home/sasha/linux-next/include/asm-generic/pgtable.h:740 > 33c: 49 8b 3c 24 mov(%r12),%rdi > pte_val(): > /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450 > 340: 0f 84 12 01 00 00 je 458 > 346: ff 14 25 00 00 00 00callq *0x0 > 349: R_X86_64_32S pv_mmu_ops+0xe8 > pte_mknuma(): > /home/sasha/linux-next/include/asm-generic/pgtable.h:724 > 34d: a8 01 test $0x1,%al > 34f: 0f 84 95 01 00 00 je 4ea ... > ptep_set_numa(): > /home/sasha/linux-next/include/asm-generic/pgtable.h:724 > 4ea: 0f 0b ud2 Thanks, yes, there is enough in there to be sure that the ...900 is indeed the oldpte. I wasn't expecting that pv_mmu_ops function call, but there's no evidence that it does anything worse than just return in %rax what it's given in %rdi; and the second long on the stack is the -0x58(%rbp) from which oldpte is retrieved for !pte_numa(oldpte) at the beginning of the extract above. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/10/2014 03:09 PM, Hugh Dickins wrote: > Thanks for supplying, but the change in inlining means that > change_protection_range() and change_protection() are no longer > relevant for these traces, we now need to see change_pte_range() > instead, to confirm that what I expect are ptes are indeed ptes. > > If you can include line numbers (objdump -ld) in the disassembly, so > much the better, but should be decipherable without. (Or objdump -Sd > for source, but I often find that harder to unscramble, can't say why.) Here it is. Note that the source includes both of Mel's debug patches. For reference, here's one trace of the issue with those patches: [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724! [ 3114.541857] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3114.543112] Dumping ftrace buffer: [ 3114.544056](ftrace buffer empty) [ 3114.545000] Modules linked in: [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 88076f584000 [ 3114.549284] RIP: 0010:[] [] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP: :88076f587d68 EFLAGS: 00010246 [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100 [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900 [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5 [ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0 [ 3114.550028] R13: 8025 R14: 41343000 R15: cfff [ 3114.550028] FS: 7fabb91c8700() GS:88025ec0() knlGS: [ 3114.550028] CS: 0010 DS: ES: CR0: 8005003b [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0 [ 3114.550028] DR0: 006f DR1: DR2: [ 3114.550028] DR3: DR6: 0ff0 DR7: 00050602 [ 3114.550028] Stack: [ 3114.550028] 0001 000314625900 0018 8802685f2260 [ 3114.550028] 1684 8802cf973600 88061684 41343000 [ 3114.550028] 880108805048 41005000 4120 41343000 [ 3114.550028] Call Trace: [ 3114.550028] [] change_protection+0x2b4/0x4e0 [ 3114.550028] [] change_prot_numa+0x1b/0x40 [ 3114.550028] [] task_numa_work+0x1f6/0x330 [ 3114.550028] [] task_work_run+0xc4/0xf0 [ 3114.550028] [] do_notify_resume+0x97/0xb0 [ 3114.550028] [] int_signal+0x12/0x17 [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 3114.550028] RIP [] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP And the disassembly: : change_pte_range(): /home/sasha/linux-next/mm/mprotect.c:70 0: e8 00 00 00 00 callq 5 1: R_X86_64_PC32__fentry__-0x4 5: 55 push %rbp 6: 48 89 e5mov%rsp,%rbp 9: 41 57 push %r15 b: 41 56 push %r14 d: 49 89 cemov%rcx,%r14 10: 41 55 push %r13 12: 4d 89 c5mov%r8,%r13 15: 41 54 push %r12 17: 49 89 f4mov%rsi,%r12 1a: 53 push %rbx 1b: 48 89 d3mov%rdx,%rbx 1e: 48 83 ec 38 sub$0x38,%rsp /home/sasha/linux-next/mm/mprotect.c:71 22: 48 8b 47 40 mov0x40(%rdi),%rax /home/sasha/linux-next/mm/mprotect.c:70 26: 48 89 7d c8 mov%rdi,-0x38(%rbp) lock_pte_protection(): /home/sasha/linux-next/mm/mprotect.c:53 2a: 8b 4d 10mov0x10(%rbp),%ecx change_pte_range(): /home/sasha/linux-next/mm/mprotect.c:70 2d: 44 89 4d c4 mov%r9d,-0x3c(%rbp) /home/sasha/linux-next/mm/mprotect.c:71 31: 48 89 45 d0 mov%rax,-0x30(%rbp) lock_pte_protection(): /home/sasha/linux-next/mm/mprotect.c:53 35: 85 c9 test %ecx,%ecx 37: 0f 84 6b 03 00 00 je 3a8 pmd_to_page(): /home/sasha/linux-next/include/linux/mm.h:1538 3d: 48 89 f7mov%rsi,%rdi 40: 48 81 e7 00 f0 ff ffand$0xf000,%rdi 47: e8 00 00 00 00 callq 4c 48: R_X86_64_PC32 __phys_addr-0x4 4c: 48 ba 00 00 00 00 00movabs $0xea00,%rdx 53: ea ff ff 56: 48 c1 e8 0c shr$0xc,%rax spin_lock(): /home/sasha/linux-next/include/linux/spinlock.h:309 5a: 48 89 55 b8 mov%rdx,-0x48(%rbp) 5e: 48 c1 e0 06 shl$0x6,%rax 62: 4c 8b 7c 10 30 mov
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Mel Gorman wrote: > On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote: > > > > I've been rather assuming that the 9d340902 seen in many of the > > registers in that Aug26 dump is the pte val in question: that's > > SOFT_DIRTY|PROTNONE|RW. The 900s in the latest dumps imply that that 902 was not important. (If any of them are in fact the pte val.) > > > > I think RW on PROTNONE is unusual but not impossible (migration entry > > replacement racing with mprotect setting PROT_NONE, after it's updated > > vm_page_prot, before it's reached the page table). > > At the risk of sounding thick, I need to spell this out because I'm > having trouble seeing exactly what race you are thinking of. > > Migration entry replacement is protected against parallel NUMA hinting > updates by the page table lock (either PMD or PTE level). It's taken by > remove_migration_pte on one side and lock_pte_protection on the other. > > For the mprotect case racing again migration, migration entries are not > present so change_pte_range() should ignore it. On migration completion > the VMA flags determine the permissions of the new PTE. Parallel faults > wait on the migration entry and see the correct value afterwards. > > When creating migration entries, try_to_unmap calls page_check_address > which takes the PTL before doing anything. On the mprotect side, > lock_pte_protection will block before seeing PROTNONE. > > I think the race you are thinking of is a migration entry created for write, > parallel mprotect(PROTNONE) and migration completion. The migration entry > was created for write but remove_migration_pte does not double check the VMA > protections and mmap_sem is not taken for write across a full migration to > protect against changes to vm_page_prot. Yes, the "if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte);" arguably should take the latest value of vma->vm_page_prot into account. > However, change_pte_range checks > for migration entries marked for write under the PTL and marks them read if > one is encountered. The consequence is that we potentially take a spurious > fault to mark the PTE write again after migration completes but I can't > see how that causes a problem as such. Yes, once mprotect's page table walk reaches that pte, it updates it correctly along with all the others nearby (which were not migrated), removing the temporary oddity. > > I'm missing some part of your reasoning that leads to the RW|PROTNONE :( You don't appear to be missing it at all, you are seeing the possibility of an RW|PROTNONE yourself, and how it gets "corrected" afterwards ("corrected" in quotes because without the present bit, it's not an error). > > > But exciting though > > that line of thought is, I cannot actually bring it to a pte_mknuma bug, > > or any bug at all. > > And I wasn't saying that it led to this bug, just that it was an oddity worth thinking about, and worth mentioning to you, in case you could work out a way it might lead to the bug, when I had failed to do so. But we now (almost) know that 902 is irrelevant to this bug anyway. > > On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It GLOBAL once PRESENT is set, but PROTNONE so long as it is not. > wouldn't cause this bug but it's sufficiently suspicious to be worth > correcting. In case this is the race you're thinking of, the patch is below. > Unfortunately, I cannot see how it would affect this problem but worth > giving a whirl anyway. > > > Mel, no way can it be the cause of this bug - unless Sasha's later > > traces actually show a different stack - but I don't see the call > > to change_prot_numa() from queue_pages_range() sharing the same > > avoidance of PROT_NONE that task_numa_work() has (though it does > > have an outdated comment about PROT_NONE which should be removed). > > So I think that site probably does need PROT_NONE checking added. > > > > That site should have checked PROT_NONE but it can't be the same bug > that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY > according to git grep of the trinity source. Yes, queue_pages_range() is not implicated in any of Sasha's traces. Something to fix, but not relevant to this bug. > > Worth adding this to the debugging mix? It should warn if it encounters > the problem but avoid adding the problematic RW bit. > > ---8<--- > migrate: debug patch to try identify race between migration completion and > mprotect > > A migration entry is marked as write if pte_write was true at the > time the entry was created. The VMA protections are not double checked > when migration entries are being removed but mprotect itself will mark > write-migration-entries as read to avoid problems. It means we potentially > take a spurious fault to mark these ptes write again but otherwise it's > harmless. Still, one dump indicates that this situation can actually > happen so this debugging patch spits out a warning if
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Sasha Levin wrote: > On 09/09/2014 10:45 PM, Hugh Dickins wrote: > > Sasha, you say you're getting plenty of these now, but I've only seen > > the dump for one of them, on Aug26: please post a few more dumps, so > > that we can look for commonality. > > I wasn't saving older logs for this issue so I only have 2 traces from > tonight. If that's not enough please let me know and I'll try to add > a few more. Thanks, these two are useful, mainly because the register contents most likely to be ptes are in both of these ...900, with no sign of a ...902. So the RW bit I got excited about yesterday is clearly not necessary for the bug (though it's still possible that it was good for implicating page migration, and page migration still play a part in the story). > > And please attach a disassembly of change_protection_range() (noting > > which of the dumps it corresponds to, in case it has changed around): > > "Code" just shows a cluster of ud2s for the unlikely bugs at end of the > > function, we cannot tell at all what should be in the registers by then. > > change_protection_range() got inlined into change_protection(), it applies to > both traces above: Thanks for supplying, but the change in inlining means that change_protection_range() and change_protection() are no longer relevant for these traces, we now need to see change_pte_range() instead, to confirm that what I expect are ptes are indeed ptes. If you can include line numbers (objdump -ld) in the disassembly, so much the better, but should be decipherable without. (Or objdump -Sd for source, but I often find that harder to unscramble, can't say why.) Thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/10/2014 08:47 AM, Mel Gorman wrote: > migrate: debug patch to try identify race between migration completion and > mprotect > > A migration entry is marked as write if pte_write was true at the > time the entry was created. The VMA protections are not double checked > when migration entries are being removed but mprotect itself will mark > write-migration-entries as read to avoid problems. It means we potentially > take a spurious fault to mark these ptes write again but otherwise it's > harmless. Still, one dump indicates that this situation can actually > happen so this debugging patch spits out a warning if the situation occurs > and hopefully the resulting warning will contain a clue as to how exactly > it happens > > Not-signed-off > --- > mm/migrate.c | 12 ++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 09d489c..631725c 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, struct > vm_area_struct *vma, > pte = pte_mkold(mk_pte(new, vma->vm_page_prot)); > if (pte_swp_soft_dirty(*ptep)) > pte = pte_mksoft_dirty(pte); > - if (is_write_migration_entry(entry)) > - pte = pte_mkwrite(pte); > + if (is_write_migration_entry(entry)) { > + /* > + * This WARN_ON_ONCE is temporary for the purposes of seeing if > + * it's a case encountered by trinity in Sasha's testing > + */ > + if (!(vma->vm_flags & (VM_WRITE))) > + WARN_ON_ONCE(1); > + else > + pte = pte_mkwrite(pte); > + } > #ifdef CONFIG_HUGETLB_PAGE > if (PageHuge(new)) { > pte = pte_mkhuge(pte); I seem to have hit this warning: [ 4782.617806] WARNING: CPU: 10 PID: 21180 at mm/migrate.c:155 remove_migration_pte+0x3f7/0x420() [ 4782.619315] Modules linked in: [ 4782.622189] [ 4782.622501] CPU: 10 PID: 21180 Comm: trinity-main Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 4782.624344] 0009 8800193eb770 a04c742a [ 4782.627801] 8800193eb7a8 9d16e55d 7f2458d89000 880120959600 [ 4782.629283] 88012b02c000 ea002abeab00 88063118da90 8800193eb7b8 [ 4782.631353] Call Trace: [ 4782.633789] [] dump_stack+0x4e/0x7a [ 4782.634314] [] warn_slowpath_common+0x7d/0xa0 [ 4782.634877] [] warn_slowpath_null+0x1a/0x20 [ 4782.635430] [] remove_migration_pte+0x3f7/0x420 [ 4782.636042] [] rmap_walk+0xef/0x380 [ 4782.636544] [] remove_migration_ptes+0x41/0x50 [ 4782.637130] [] ? __migration_entry_wait.isra.24+0x160/0x160 [ 4782.639928] [] ? remove_migration_pte+0x420/0x420 [ 4782.640616] [] move_to_new_page+0x16b/0x230 [ 4782.641251] [] ? try_to_unmap+0x6c/0xf0 [ 4782.643950] [] ? try_to_unmap_nonlinear+0x5c0/0x5c0 [ 4782.644690] [] ? invalid_migration_vma+0x30/0x30 [ 4782.645273] [] ? page_remove_rmap+0x320/0x320 [ 4782.646072] [] migrate_pages+0x85c/0x930 [ 4782.646701] [] ? isolate_freepages_block+0x410/0x410 [ 4782.647407] [] ? arch_local_save_flags+0x30/0x30 [ 4782.648114] [] compact_zone+0x4d3/0x8a0 [ 4782.650157] [] compact_zone_order+0x5f/0xa0 [ 4782.651014] [] try_to_compact_pages+0x127/0x2f0 [ 4782.651656] [] __alloc_pages_direct_compact+0x68/0x200 [ 4782.652313] [] __alloc_pages_nodemask+0x99a/0xd90 [ 4782.652916] [] alloc_pages_vma+0x13c/0x270 [ 4782.653618] [] ? do_huge_pmd_wp_page+0x494/0xc90 [ 4782.654487] [] do_huge_pmd_wp_page+0x494/0xc90 [ 4782.656045] [] ? __mem_cgroup_count_vm_event+0xd0/0x240 [ 4782.657089] [] handle_mm_fault+0x8bd/0xc50 [ 4782.660931] [] ? __lock_is_held+0x56/0x80 [ 4782.662695] [] __do_page_fault+0x1b7/0x660 [ 4782.663259] [] ? put_lock_stats.isra.13+0xe/0x30 [ 4782.663851] [] ? vtime_account_user+0x91/0xa0 [ 4782.664419] [] ? context_tracking_user_exit+0xb5/0x1b0 [ 4782.665119] [] ? __this_cpu_preempt_check+0x13/0x20 [ 4782.665969] [] ? trace_hardirqs_off_caller+0xe2/0x1b0 [ 4782.34] [] trace_do_page_fault+0x51/0x2b0 [ 4782.667257] [] do_async_page_fault+0x63/0xd0 [ 4782.667871] [] async_page_fault+0x28/0x30 Although it wasn't followed by anything else, and I've seen the original issue getting triggered without this WARN showing up, so it seems like a different, unrelated issue? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/10/2014 09:40 AM, Mel Gorman wrote: > On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote: >> >> >> I've spotted a new trace in overnight fuzzing, it could be related to this >> issue: >> >> [ 3494.324839] general protection fault: [#1] PREEMPT SMP >> DEBUG_PAGEALLOC >> [ 3494.332153] Dumping ftrace buffer: >> [ 3494.332153](ftrace buffer empty) >> [ 3494.332153] Modules linked in: >> [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted >> 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 >> [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: >> 8804d491c000 >> [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 >> kernel/sched/fair.c:1956) >> [ 3494.332153] RSP: :8804d491feb8 EFLAGS: 00010206 >> [ 3494.332153] RAX: RBX: 8804bf4e8000 RCX: >> e8e8 >> [ 3494.343974] RDX: 000a RSI: RDI: >> 8804bd6d4da8 >> [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: >> >> [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: >> 7f53e443c000 >> [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: >> 88047e52b000 >> [ 3494.343974] FS: 7f53e463f700() GS:880277e0() >> knlGS: >> [ 3494.343974] CS: 0010 DS: ES: CR0: 8005003b >> [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: >> 06a0 >> [ 3494.369895] DR0: 006f DR1: DR2: >> >> [ 3494.369895] DR3: DR6: 0ff0 DR7: >> 0600 >> [ 3494.380081] Stack: >> [ 3494.380081] 8804bf4e80a8 0014 7f53e4437000 >> >> [ 3494.380081] 9b976e70 88047e52bbd8 88047e52b000 >> >> [ 3494.380081] 8804d491ff28 95193d84 0002 >> 8804d491ff58 >> [ 3494.380081] Call Trace: >> [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1)) >> [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 >> arch/x86/kernel/signal.c:758) >> [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918) >> [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 >> c6 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 >> <49> f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00 > > Shot in dark, can you test this please? Pagetable teardown can schedule > and I'm wondering if we are trying to add hinting faults to an address > space that is in the process of going away. The TASK_DEAD check is bogus > so replacing it. Mel, I ran today's -next with both of your patches, but the issue still remains: [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724! [ 3114.541857] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3114.543112] Dumping ftrace buffer: [ 3114.544056](ftrace buffer empty) [ 3114.545000] Modules linked in: [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 88076f584000 [ 3114.549284] RIP: 0010:[] [] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP: :88076f587d68 EFLAGS: 00010246 [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100 [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900 [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5 [ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0 [ 3114.550028] R13: 8025 R14: 41343000 R15: cfff [ 3114.550028] FS: 7fabb91c8700() GS:88025ec0() knlGS: [ 3114.550028] CS: 0010 DS: ES: CR0: 8005003b [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0 [ 3114.550028] DR0: 006f DR1: DR2: [ 3114.550028] DR3: DR6: 0ff0 DR7: 00050602 [ 3114.550028] Stack: [ 3114.550028] 0001 000314625900 0018 8802685f2260 [ 3114.550028] 1684 8802cf973600 88061684 41343000 [ 3114.550028] 880108805048 41005000 4120 41343000 [ 3114.550028] Call Trace: [ 3114.550028] [] change_protection+0x2b4/0x4e0 [ 3114.550028] [] change_prot_numa+0x1b/0x40 [ 3114.550028] [] task_numa_work+0x1f6/0x330 [ 3114.550028] [] task_work_run+0xc4/0xf0 [ 3114.550028] [] do_notify_resume+0x97/0xb0 [ 3114.550028] [] int_signal+0x12/0x17 [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 3114.550028] RIP []
Re: Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)
On Wed, Sep 10, 2014 at 10:24:40AM -0400, Sasha Levin wrote: > On 09/10/2014 08:47 AM, Mel Gorman wrote: > > That site should have checked PROT_NONE but it can't be the same bug > > that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY > > according to git grep of the trinity source. > > Actually, if I'm reading it correctly I think that Trinity handles mbind() > calls wrong. It passes the wrong values for mode flags and actual flags. Ugh, I think you're right. I misinterpreted the man page that mentions that flags like MPOL_F_STATIC_NODES/RELATIVE_NODES are OR'd with the mode, and instead dumped those flags into .. the flags field. So the 'flags' argument it generates is crap, because I didn't add any of the actual correct values. I'll fix it up, though if it's currently finding bugs, you might want to keep the current syscalls/mbind.c for now. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)
On 09/10/2014 08:47 AM, Mel Gorman wrote: > That site should have checked PROT_NONE but it can't be the same bug > that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY > according to git grep of the trinity source. Actually, if I'm reading it correctly I think that Trinity handles mbind() calls wrong. It passes the wrong values for mode flags and actual flags. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote: > > > I've spotted a new trace in overnight fuzzing, it could be related to this > issue: > > [ 3494.324839] general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 3494.332153] Dumping ftrace buffer: > [ 3494.332153](ftrace buffer empty) > [ 3494.332153] Modules linked in: > [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted > 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 > [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: > 8804d491c000 > [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 > kernel/sched/fair.c:1956) > [ 3494.332153] RSP: :8804d491feb8 EFLAGS: 00010206 > [ 3494.332153] RAX: RBX: 8804bf4e8000 RCX: > e8e8 > [ 3494.343974] RDX: 000a RSI: RDI: > 8804bd6d4da8 > [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: > > [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: > 7f53e443c000 > [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: > 88047e52b000 > [ 3494.343974] FS: 7f53e463f700() GS:880277e0() > knlGS: > [ 3494.343974] CS: 0010 DS: ES: CR0: 8005003b > [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: > 06a0 > [ 3494.369895] DR0: 006f DR1: DR2: > > [ 3494.369895] DR3: DR6: 0ff0 DR7: > 0600 > [ 3494.380081] Stack: > [ 3494.380081] 8804bf4e80a8 0014 7f53e4437000 > > [ 3494.380081] 9b976e70 88047e52bbd8 88047e52b000 > > [ 3494.380081] 8804d491ff28 95193d84 0002 > 8804d491ff58 > [ 3494.380081] Call Trace: > [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1)) > [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 > arch/x86/kernel/signal.c:758) > [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918) > [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 c6 > 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 <49> > f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00 Shot in dark, can you test this please? Pagetable teardown can schedule and I'm wondering if we are trying to add hinting faults to an address space that is in the process of going away. The TASK_DEAD check is bogus so replacing it. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ea6006..007fc1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1810,7 +1810,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) return; /* Do not worry about placement if exiting */ - if (p->state == TASK_DEAD) + if (p->flags & PF_EXITING) return; /* Allocate buffer to track faults on a per-node basis */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/09/2014 10:45 PM, Hugh Dickins wrote: > Sasha, you say you're getting plenty of these now, but I've only seen > the dump for one of them, on Aug26: please post a few more dumps, so > that we can look for commonality. I wasn't saving older logs for this issue so I only have 2 traces from tonight. If that's not enough please let me know and I'll try to add a few more. [ 1125.600123] kernel BUG at include/asm-generic/pgtable.h:724! [ 1125.600123] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1125.600123] Dumping ftrace buffer: [ 1125.600123](ftrace buffer empty) [ 1125.600123] Modules linked in: [ 1125.600123] CPU: 16 PID: 11903 Comm: trinity-c517 Not tainted 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 [ 1125.600123] task: 88066173 ti: 880582c2 task.ti: 880582c2 [ 1125.600123] RIP: 0010:[] [] change_pte_range+0x4ea/0x4f0 [ 1125.600123] RSP: 0018:880582c23d68 EFLAGS: 00010246 [ 1125.600123] RAX: 000936d9a900 RBX: 7ffdb17c8000 RCX: 0100 [ 1125.600123] RDX: 000936d9a900 RSI: 7ffdb17c8000 RDI: 000936d9a900 [ 1125.600123] RBP: 880582c23dc8 R08: 8802a8f2d400 R09: 00b56000 [ 1125.600123] R10: 00020201 R11: 0008 R12: 88004dd6ee40 [ 1125.600123] R13: 8025 R14: 7ffdb180 R15: cfff [ 1125.600123] FS: 7ffdb6382700() GS:88027820() knlGS: [ 1125.600123] CS: 0010 DS: ES: CR0: 80050033 [ 1125.600123] CR2: 7ffdb617e60c CR3: 00050ff12000 CR4: 06a0 [ 1125.600123] DR0: 006f DR1: DR2: [ 1125.600123] DR3: DR6: 0ff0 DR7: 0600 [ 1125.600123] Stack: [ 1125.600123] 0001 000936d9a900 0046 8804bd549f40 [ 1125.600123] 1f989000 8802a8f2d400 88051f989000 7f9f40604cfdb1ac8000 [ 1125.600123] 88032fcc3c58 7ffdb16df000 7ffdb16df000 7ffdb180 [ 1125.600123] Call Trace: [ 1125.600123] [] change_protection+0x2b4/0x4e0 [ 1125.600123] [] change_prot_numa+0x1b/0x40 [ 1125.600123] [] task_numa_work+0x1f6/0x330 [ 1125.600123] [] task_work_run+0xc4/0xf0 [ 1125.600123] [] do_notify_resume+0x97/0xb0 [ 1125.600123] [] int_signal+0x12/0x17 [ 1125.600123] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 1125.600123] RIP [] change_pte_range+0x4ea/0x4f0 [ 1125.600123] RSP [ 3131.084176] kernel BUG at include/asm-generic/pgtable.h:724! [ 3131.087358] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3131.090143] Dumping ftrace buffer: [ 3131.090143](ftrace buffer empty) [ 3131.090143] Modules linked in: [ 3131.090143] CPU: 8 PID: 20595 Comm: trinity-c34 Not tainted 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 [ 3131.090143] task: 8801ded6 ti: 8803204ec000 task.ti: 8803204ec000 [ 3131.090143] RIP: 0010:[] [] change_pte_range+0x4ea/0x4f0 [ 3131.090143] RSP: :8803204efd68 EFLAGS: 00010246 [ 3131.090143] RAX: 000971bba900 RBX: 7ffda1d4d000 RCX: 0100 [ 3131.090143] RDX: 000971bba900 RSI: 7ffda1d4d000 RDI: 000971bba900 [ 3131.120281] RBP: 8803204efdc8 R08: 88026bed8800 R09: 00b48000 [ 3131.120281] R10: 00076501 R11: 0008 R12: 8801ca071a68 [ 3131.120281] R13: 8025 R14: 7ffda1dbf000 R15: cfff [ 3131.120281] FS: 7ffda5cd4700() GS:880277e0() knlGS: [ 3131.120281] CS: 0010 DS: ES: CR0: 80050033 [ 3131.120281] CR2: 025d6000 CR3: 0004bcde2000 CR4: 06a0 [ 3131.120281] Stack: [ 3131.120281] 0001 000971bba900 005c 8800661a7b60 [ 3131.120281] f4953000 88026bed8800 8801f4953000 7ffda1dbf000 [ 3131.120281] 8802b3319870 7ffda1c1b000 7ffda1c1b000 7ffda1dbf000 [ 3131.120281] Call Trace: [ 3131.120281] [] change_protection+0x2b4/0x4e0 [ 3131.120281] [] change_prot_numa+0x1b/0x40 [ 3131.120281] [] task_numa_work+0x1f6/0x330 [ 3131.120281] [] task_work_run+0xc4/0xf0 [ 3131.120281] [] do_notify_resume+0x97/0xb0 [ 3131.120281] [] retint_signal+0x4d/0x9f [ 3131.120281] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b <0f> 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 3131.120281] RIP [] change_pte_range+0x4ea/0x4f0 [ 3131.120281] RSP > And please attach a disassembly of change_protection_range() (noting > which of the dumps it corresponds to, in case it has changed around): > "Code" just shows a cluster of ud2s for the unlikely bugs at end of the > function, we cannot tell at all what should be in the registers by
Re: mm: BUG in unmap_page_range
On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote: > On Tue, 9 Sep 2014, Sasha Levin wrote: > > On 09/09/2014 05:33 PM, Mel Gorman wrote: > > > On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: > > >> On 09/08/2014 01:18 PM, Mel Gorman wrote: > > >>> A worse possibility is that somehow the lock is getting corrupted but > > >>> that's also a tough sell considering that the locks should be allocated > > >>> from a dedicated cache. I guess I could try breaking that to allocate > > >>> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very > > >>> optimistic. > > >> > > >> I did see ptl corruption couple days ago: > > >> > > >> https://lkml.org/lkml/2014/9/4/599 > > >> > > >> Could this be related? > > >> > > > > > > Possibly although the likely explanation then would be that there is > > > just general corruption coming from somewhere. Even using your config > > > and applying a patch to make linux-next boot (already in Tejun's tree) > > > I was unable to reproduce the problem after running for several hours. I > > > had to run trinity on tmpfs as ext4 and xfs blew up almost immediately > > > so I have a few questions. > > > > I agree it could be a case of random corruption somewhere else, it's just > > that the amount of times this exact issue reproduced > > Yes, I doubt it's random corruption; but I've been no more successful > than Mel in working it out (I share responsibility for that VM_BUG_ON). > > Sasha, you say you're getting plenty of these now, but I've only seen > the dump for one of them, on Aug26: please post a few more dumps, so > that we can look for commonality. > It's also worth knowing that this is a test running in KVM and fake NUMA. The hint was that the filesystem used was virtio-9p. I haven't formulated a theory on how KVM could cause any damage here but it's interesting. > And please attach a disassembly of change_protection_range() (noting > which of the dumps it corresponds to, in case it has changed around): > "Code" just shows a cluster of ud2s for the unlikely bugs at end of the > function, we cannot tell at all what should be in the registers by then. > > I've been rather assuming that the 9d340902 seen in many of the > registers in that Aug26 dump is the pte val in question: that's > SOFT_DIRTY|PROTNONE|RW. > > I think RW on PROTNONE is unusual but not impossible (migration entry > replacement racing with mprotect setting PROT_NONE, after it's updated > vm_page_prot, before it's reached the page table). At the risk of sounding thick, I need to spell this out because I'm having trouble seeing exactly what race you are thinking of. Migration entry replacement is protected against parallel NUMA hinting updates by the page table lock (either PMD or PTE level). It's taken by remove_migration_pte on one side and lock_pte_protection on the other. For the mprotect case racing again migration, migration entries are not present so change_pte_range() should ignore it. On migration completion the VMA flags determine the permissions of the new PTE. Parallel faults wait on the migration entry and see the correct value afterwards. When creating migration entries, try_to_unmap calls page_check_address which takes the PTL before doing anything. On the mprotect side, lock_pte_protection will block before seeing PROTNONE. I think the race you are thinking of is a migration entry created for write, parallel mprotect(PROTNONE) and migration completion. The migration entry was created for write but remove_migration_pte does not double check the VMA protections and mmap_sem is not taken for write across a full migration to protect against changes to vm_page_prot. However, change_pte_range checks for migration entries marked for write under the PTL and marks them read if one is encountered. The consequence is that we potentially take a spurious fault to mark the PTE write again after migration completes but I can't see how that causes a problem as such. I'm missing some part of your reasoning that leads to the RW|PROTNONE :( > But exciting though > that line of thought is, I cannot actually bring it to a pte_mknuma bug, > or any bug at all. > On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It wouldn't cause this bug but it's sufficiently suspicious to be worth correcting. In case this is the race you're thinking of, the patch is below. Unfortunately, I cannot see how it would affect this problem but worth giving a whirl anyway. > Mel, no way can it be the cause of this bug - unless Sasha's later > traces actually show a different stack - but I don't see the call > to change_prot_numa() from queue_pages_range() sharing the same > avoidance of PROT_NONE that task_numa_work() has (though it does > have an outdated comment about PROT_NONE which should be removed). > So I think that site probably does need PROT_NONE checking added. > That site should have checked PROT_NONE but it can't be the same bug that trinity is seeing.
Re: mm: BUG in unmap_page_range
On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote: On Tue, 9 Sep 2014, Sasha Levin wrote: On 09/09/2014 05:33 PM, Mel Gorman wrote: On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: On 09/08/2014 01:18 PM, Mel Gorman wrote: A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. I did see ptl corruption couple days ago: https://lkml.org/lkml/2014/9/4/599 Could this be related? Possibly although the likely explanation then would be that there is just general corruption coming from somewhere. Even using your config and applying a patch to make linux-next boot (already in Tejun's tree) I was unable to reproduce the problem after running for several hours. I had to run trinity on tmpfs as ext4 and xfs blew up almost immediately so I have a few questions. I agree it could be a case of random corruption somewhere else, it's just that the amount of times this exact issue reproduced Yes, I doubt it's random corruption; but I've been no more successful than Mel in working it out (I share responsibility for that VM_BUG_ON). Sasha, you say you're getting plenty of these now, but I've only seen the dump for one of them, on Aug26: please post a few more dumps, so that we can look for commonality. It's also worth knowing that this is a test running in KVM and fake NUMA. The hint was that the filesystem used was virtio-9p. I haven't formulated a theory on how KVM could cause any damage here but it's interesting. And please attach a disassembly of change_protection_range() (noting which of the dumps it corresponds to, in case it has changed around): Code just shows a cluster of ud2s for the unlikely bugs at end of the function, we cannot tell at all what should be in the registers by then. I've been rather assuming that the 9d340902 seen in many of the registers in that Aug26 dump is the pte val in question: that's SOFT_DIRTY|PROTNONE|RW. I think RW on PROTNONE is unusual but not impossible (migration entry replacement racing with mprotect setting PROT_NONE, after it's updated vm_page_prot, before it's reached the page table). At the risk of sounding thick, I need to spell this out because I'm having trouble seeing exactly what race you are thinking of. Migration entry replacement is protected against parallel NUMA hinting updates by the page table lock (either PMD or PTE level). It's taken by remove_migration_pte on one side and lock_pte_protection on the other. For the mprotect case racing again migration, migration entries are not present so change_pte_range() should ignore it. On migration completion the VMA flags determine the permissions of the new PTE. Parallel faults wait on the migration entry and see the correct value afterwards. When creating migration entries, try_to_unmap calls page_check_address which takes the PTL before doing anything. On the mprotect side, lock_pte_protection will block before seeing PROTNONE. I think the race you are thinking of is a migration entry created for write, parallel mprotect(PROTNONE) and migration completion. The migration entry was created for write but remove_migration_pte does not double check the VMA protections and mmap_sem is not taken for write across a full migration to protect against changes to vm_page_prot. However, change_pte_range checks for migration entries marked for write under the PTL and marks them read if one is encountered. The consequence is that we potentially take a spurious fault to mark the PTE write again after migration completes but I can't see how that causes a problem as such. I'm missing some part of your reasoning that leads to the RW|PROTNONE :( But exciting though that line of thought is, I cannot actually bring it to a pte_mknuma bug, or any bug at all. On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It wouldn't cause this bug but it's sufficiently suspicious to be worth correcting. In case this is the race you're thinking of, the patch is below. Unfortunately, I cannot see how it would affect this problem but worth giving a whirl anyway. Mel, no way can it be the cause of this bug - unless Sasha's later traces actually show a different stack - but I don't see the call to change_prot_numa() from queue_pages_range() sharing the same avoidance of PROT_NONE that task_numa_work() has (though it does have an outdated comment about PROT_NONE which should be removed). So I think that site probably does need PROT_NONE checking added. That site should have checked PROT_NONE but it can't be the same bug that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY according to git grep of the trinity source. Worth adding this to the debugging
Re: mm: BUG in unmap_page_range
On 09/09/2014 10:45 PM, Hugh Dickins wrote: Sasha, you say you're getting plenty of these now, but I've only seen the dump for one of them, on Aug26: please post a few more dumps, so that we can look for commonality. I wasn't saving older logs for this issue so I only have 2 traces from tonight. If that's not enough please let me know and I'll try to add a few more. [ 1125.600123] kernel BUG at include/asm-generic/pgtable.h:724! [ 1125.600123] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1125.600123] Dumping ftrace buffer: [ 1125.600123](ftrace buffer empty) [ 1125.600123] Modules linked in: [ 1125.600123] CPU: 16 PID: 11903 Comm: trinity-c517 Not tainted 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 [ 1125.600123] task: 88066173 ti: 880582c2 task.ti: 880582c2 [ 1125.600123] RIP: 0010:[a32e500a] [a32e500a] change_pte_range+0x4ea/0x4f0 [ 1125.600123] RSP: 0018:880582c23d68 EFLAGS: 00010246 [ 1125.600123] RAX: 000936d9a900 RBX: 7ffdb17c8000 RCX: 0100 [ 1125.600123] RDX: 000936d9a900 RSI: 7ffdb17c8000 RDI: 000936d9a900 [ 1125.600123] RBP: 880582c23dc8 R08: 8802a8f2d400 R09: 00b56000 [ 1125.600123] R10: 00020201 R11: 0008 R12: 88004dd6ee40 [ 1125.600123] R13: 8025 R14: 7ffdb180 R15: cfff [ 1125.600123] FS: 7ffdb6382700() GS:88027820() knlGS: [ 1125.600123] CS: 0010 DS: ES: CR0: 80050033 [ 1125.600123] CR2: 7ffdb617e60c CR3: 00050ff12000 CR4: 06a0 [ 1125.600123] DR0: 006f DR1: DR2: [ 1125.600123] DR3: DR6: 0ff0 DR7: 0600 [ 1125.600123] Stack: [ 1125.600123] 0001 000936d9a900 0046 8804bd549f40 [ 1125.600123] 1f989000 8802a8f2d400 88051f989000 7f9f40604cfdb1ac8000 [ 1125.600123] 88032fcc3c58 7ffdb16df000 7ffdb16df000 7ffdb180 [ 1125.600123] Call Trace: [ 1125.600123] [a32e52c4] change_protection+0x2b4/0x4e0 [ 1125.600123] [a32fefdb] change_prot_numa+0x1b/0x40 [ 1125.600123] [a31add86] task_numa_work+0x1f6/0x330 [ 1125.600123] [a3193d84] task_work_run+0xc4/0xf0 [ 1125.600123] [a3071477] do_notify_resume+0x97/0xb0 [ 1125.600123] [a650daea] int_signal+0x12/0x17 [ 1125.600123] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 1125.600123] RIP [a32e500a] change_pte_range+0x4ea/0x4f0 [ 1125.600123] RSP 880582c23d68 [ 3131.084176] kernel BUG at include/asm-generic/pgtable.h:724! [ 3131.087358] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3131.090143] Dumping ftrace buffer: [ 3131.090143](ftrace buffer empty) [ 3131.090143] Modules linked in: [ 3131.090143] CPU: 8 PID: 20595 Comm: trinity-c34 Not tainted 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 [ 3131.090143] task: 8801ded6 ti: 8803204ec000 task.ti: 8803204ec000 [ 3131.090143] RIP: 0010:[a72e500a] [a72e500a] change_pte_range+0x4ea/0x4f0 [ 3131.090143] RSP: :8803204efd68 EFLAGS: 00010246 [ 3131.090143] RAX: 000971bba900 RBX: 7ffda1d4d000 RCX: 0100 [ 3131.090143] RDX: 000971bba900 RSI: 7ffda1d4d000 RDI: 000971bba900 [ 3131.120281] RBP: 8803204efdc8 R08: 88026bed8800 R09: 00b48000 [ 3131.120281] R10: 00076501 R11: 0008 R12: 8801ca071a68 [ 3131.120281] R13: 8025 R14: 7ffda1dbf000 R15: cfff [ 3131.120281] FS: 7ffda5cd4700() GS:880277e0() knlGS: [ 3131.120281] CS: 0010 DS: ES: CR0: 80050033 [ 3131.120281] CR2: 025d6000 CR3: 0004bcde2000 CR4: 06a0 [ 3131.120281] Stack: [ 3131.120281] 0001 000971bba900 005c 8800661a7b60 [ 3131.120281] f4953000 88026bed8800 8801f4953000 7ffda1dbf000 [ 3131.120281] 8802b3319870 7ffda1c1b000 7ffda1c1b000 7ffda1dbf000 [ 3131.120281] Call Trace: [ 3131.120281] [a72e52c4] change_protection+0x2b4/0x4e0 [ 3131.120281] [a72fefdb] change_prot_numa+0x1b/0x40 [ 3131.120281] [a71add86] task_numa_work+0x1f6/0x330 [ 3131.120281] [a7193d84] task_work_run+0xc4/0xf0 [ 3131.120281] [a7071477] do_notify_resume+0x97/0xb0 [ 3131.120281] [aa50e6ae] retint_signal+0x4d/0x9f [ 3131.120281] Code: 66 90 48 8b 7d b8 e8 f6 75 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 3131.120281] RIP [a72e500a] change_pte_range+0x4ea/0x4f0 [ 3131.120281]
Re: mm: BUG in unmap_page_range
On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote: SNIP, haven't digested the rest I've spotted a new trace in overnight fuzzing, it could be related to this issue: [ 3494.324839] general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3494.332153] Dumping ftrace buffer: [ 3494.332153](ftrace buffer empty) [ 3494.332153] Modules linked in: [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: 8804d491c000 [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 kernel/sched/fair.c:1956) [ 3494.332153] RSP: :8804d491feb8 EFLAGS: 00010206 [ 3494.332153] RAX: RBX: 8804bf4e8000 RCX: e8e8 [ 3494.343974] RDX: 000a RSI: RDI: 8804bd6d4da8 [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: 7f53e443c000 [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: 88047e52b000 [ 3494.343974] FS: 7f53e463f700() GS:880277e0() knlGS: [ 3494.343974] CS: 0010 DS: ES: CR0: 8005003b [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: 06a0 [ 3494.369895] DR0: 006f DR1: DR2: [ 3494.369895] DR3: DR6: 0ff0 DR7: 0600 [ 3494.380081] Stack: [ 3494.380081] 8804bf4e80a8 0014 7f53e4437000 [ 3494.380081] 9b976e70 88047e52bbd8 88047e52b000 [ 3494.380081] 8804d491ff28 95193d84 0002 8804d491ff58 [ 3494.380081] Call Trace: [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1)) [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 arch/x86/kernel/signal.c:758) [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918) [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 c6 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 49 f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00 Shot in dark, can you test this please? Pagetable teardown can schedule and I'm wondering if we are trying to add hinting faults to an address space that is in the process of going away. The TASK_DEAD check is bogus so replacing it. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ea6006..007fc1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1810,7 +1810,7 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) return; /* Do not worry about placement if exiting */ - if (p-state == TASK_DEAD) + if (p-flags PF_EXITING) return; /* Allocate buffer to track faults on a per-node basis */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)
On 09/10/2014 08:47 AM, Mel Gorman wrote: That site should have checked PROT_NONE but it can't be the same bug that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY according to git grep of the trinity source. Actually, if I'm reading it correctly I think that Trinity handles mbind() calls wrong. It passes the wrong values for mode flags and actual flags. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Trinity and mbind flags (WAS: Re: mm: BUG in unmap_page_range)
On Wed, Sep 10, 2014 at 10:24:40AM -0400, Sasha Levin wrote: On 09/10/2014 08:47 AM, Mel Gorman wrote: That site should have checked PROT_NONE but it can't be the same bug that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY according to git grep of the trinity source. Actually, if I'm reading it correctly I think that Trinity handles mbind() calls wrong. It passes the wrong values for mode flags and actual flags. Ugh, I think you're right. I misinterpreted the man page that mentions that flags like MPOL_F_STATIC_NODES/RELATIVE_NODES are OR'd with the mode, and instead dumped those flags into .. the flags field. So the 'flags' argument it generates is crap, because I didn't add any of the actual correct values. I'll fix it up, though if it's currently finding bugs, you might want to keep the current syscalls/mbind.c for now. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/10/2014 09:40 AM, Mel Gorman wrote: On Wed, Sep 10, 2014 at 09:12:04AM -0400, Sasha Levin wrote: SNIP, haven't digested the rest I've spotted a new trace in overnight fuzzing, it could be related to this issue: [ 3494.324839] general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3494.332153] Dumping ftrace buffer: [ 3494.332153](ftrace buffer empty) [ 3494.332153] Modules linked in: [ 3494.332153] CPU: 8 PID: 2727 Comm: trinity-c929 Not tainted 3.17.0-rc4-next-20140909-sasha-00032-gc16d47b #1135 [ 3494.332153] task: 88047e52b000 ti: 8804d491c000 task.ti: 8804d491c000 [ 3494.332153] RIP: task_numa_work (include/linux/mempolicy.h:177 kernel/sched/fair.c:1956) [ 3494.332153] RSP: :8804d491feb8 EFLAGS: 00010206 [ 3494.332153] RAX: RBX: 8804bf4e8000 RCX: e8e8 [ 3494.343974] RDX: 000a RSI: RDI: 8804bd6d4da8 [ 3494.343974] RBP: 8804d491fef8 R08: 8804bf4e84c8 R09: [ 3494.343974] R10: 7f53e443c000 R11: 0001 R12: 7f53e443c000 [ 3494.343974] R13: dc51 R14: 006f732e61727478 R15: 88047e52b000 [ 3494.343974] FS: 7f53e463f700() GS:880277e0() knlGS: [ 3494.343974] CS: 0010 DS: ES: CR0: 8005003b [ 3494.369895] CR2: 01670fa8 CR3: 000283562000 CR4: 06a0 [ 3494.369895] DR0: 006f DR1: DR2: [ 3494.369895] DR3: DR6: 0ff0 DR7: 0600 [ 3494.380081] Stack: [ 3494.380081] 8804bf4e80a8 0014 7f53e4437000 [ 3494.380081] 9b976e70 88047e52bbd8 88047e52b000 [ 3494.380081] 8804d491ff28 95193d84 0002 8804d491ff58 [ 3494.380081] Call Trace: [ 3494.380081] task_work_run (kernel/task_work.c:125 (discriminator 1)) [ 3494.380081] do_notify_resume (include/linux/tracehook.h:190 arch/x86/kernel/signal.c:758) [ 3494.380081] retint_signal (arch/x86/kernel/entry_64.S:918) [ 3494.380081] Code: e8 1e e5 01 00 48 89 df 4c 89 e6 e8 a3 2d 13 00 49 89 c6 48 85 c0 0f 84 07 02 00 00 48 c7 45 c8 00 00 00 00 0f 1f 80 00 00 00 00 49 f7 46 50 00 44 00 00 0f 85 42 01 00 00 49 8b 86 a0 00 00 00 Shot in dark, can you test this please? Pagetable teardown can schedule and I'm wondering if we are trying to add hinting faults to an address space that is in the process of going away. The TASK_DEAD check is bogus so replacing it. Mel, I ran today's -next with both of your patches, but the issue still remains: [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724! [ 3114.541857] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3114.543112] Dumping ftrace buffer: [ 3114.544056](ftrace buffer empty) [ 3114.545000] Modules linked in: [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 88076f584000 [ 3114.549284] RIP: 0010:[952e527a] [952e527a] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP: :88076f587d68 EFLAGS: 00010246 [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100 [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900 [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5 [ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0 [ 3114.550028] R13: 8025 R14: 41343000 R15: cfff [ 3114.550028] FS: 7fabb91c8700() GS:88025ec0() knlGS: [ 3114.550028] CS: 0010 DS: ES: CR0: 8005003b [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0 [ 3114.550028] DR0: 006f DR1: DR2: [ 3114.550028] DR3: DR6: 0ff0 DR7: 00050602 [ 3114.550028] Stack: [ 3114.550028] 0001 000314625900 0018 8802685f2260 [ 3114.550028] 1684 8802cf973600 88061684 41343000 [ 3114.550028] 880108805048 41005000 4120 41343000 [ 3114.550028] Call Trace: [ 3114.550028] [952e5534] change_protection+0x2b4/0x4e0 [ 3114.550028] [952ff24b] change_prot_numa+0x1b/0x40 [ 3114.550028] [951adf16] task_numa_work+0x1f6/0x330 [ 3114.550028] [95193de4] task_work_run+0xc4/0xf0 [ 3114.550028] [95071477] do_notify_resume+0x97/0xb0 [ 3114.550028] [9850f06a] int_signal+0x12/0x17 [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41
Re: mm: BUG in unmap_page_range
On 09/10/2014 08:47 AM, Mel Gorman wrote: migrate: debug patch to try identify race between migration completion and mprotect A migration entry is marked as write if pte_write was true at the time the entry was created. The VMA protections are not double checked when migration entries are being removed but mprotect itself will mark write-migration-entries as read to avoid problems. It means we potentially take a spurious fault to mark these ptes write again but otherwise it's harmless. Still, one dump indicates that this situation can actually happen so this debugging patch spits out a warning if the situation occurs and hopefully the resulting warning will contain a clue as to how exactly it happens Not-signed-off --- mm/migrate.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 09d489c..631725c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma, pte = pte_mkold(mk_pte(new, vma-vm_page_prot)); if (pte_swp_soft_dirty(*ptep)) pte = pte_mksoft_dirty(pte); - if (is_write_migration_entry(entry)) - pte = pte_mkwrite(pte); + if (is_write_migration_entry(entry)) { + /* + * This WARN_ON_ONCE is temporary for the purposes of seeing if + * it's a case encountered by trinity in Sasha's testing + */ + if (!(vma-vm_flags (VM_WRITE))) + WARN_ON_ONCE(1); + else + pte = pte_mkwrite(pte); + } #ifdef CONFIG_HUGETLB_PAGE if (PageHuge(new)) { pte = pte_mkhuge(pte); I seem to have hit this warning: [ 4782.617806] WARNING: CPU: 10 PID: 21180 at mm/migrate.c:155 remove_migration_pte+0x3f7/0x420() [ 4782.619315] Modules linked in: [ 4782.622189] [ 4782.622501] CPU: 10 PID: 21180 Comm: trinity-main Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 4782.624344] 0009 8800193eb770 a04c742a [ 4782.627801] 8800193eb7a8 9d16e55d 7f2458d89000 880120959600 [ 4782.629283] 88012b02c000 ea002abeab00 88063118da90 8800193eb7b8 [ 4782.631353] Call Trace: [ 4782.633789] [a04c742a] dump_stack+0x4e/0x7a [ 4782.634314] [9d16e55d] warn_slowpath_common+0x7d/0xa0 [ 4782.634877] [9d16e63a] warn_slowpath_null+0x1a/0x20 [ 4782.635430] [9d315487] remove_migration_pte+0x3f7/0x420 [ 4782.636042] [9d2e99cf] rmap_walk+0xef/0x380 [ 4782.636544] [9d3147f1] remove_migration_ptes+0x41/0x50 [ 4782.637130] [9d315090] ? __migration_entry_wait.isra.24+0x160/0x160 [ 4782.639928] [9d3154b0] ? remove_migration_pte+0x420/0x420 [ 4782.640616] [9d31671b] move_to_new_page+0x16b/0x230 [ 4782.641251] [9d2e9e8c] ? try_to_unmap+0x6c/0xf0 [ 4782.643950] [9d2e88a0] ? try_to_unmap_nonlinear+0x5c0/0x5c0 [ 4782.644690] [9d2e70a0] ? invalid_migration_vma+0x30/0x30 [ 4782.645273] [9d2e82e0] ? page_remove_rmap+0x320/0x320 [ 4782.646072] [9d31717c] migrate_pages+0x85c/0x930 [ 4782.646701] [9d2d0e20] ? isolate_freepages_block+0x410/0x410 [ 4782.647407] [9d2cfa60] ? arch_local_save_flags+0x30/0x30 [ 4782.648114] [9d2d1803] compact_zone+0x4d3/0x8a0 [ 4782.650157] [9d2d1c2f] compact_zone_order+0x5f/0xa0 [ 4782.651014] [9d2d1f87] try_to_compact_pages+0x127/0x2f0 [ 4782.651656] [9d2b0c98] __alloc_pages_direct_compact+0x68/0x200 [ 4782.652313] [9d2b17ca] __alloc_pages_nodemask+0x99a/0xd90 [ 4782.652916] [9d300a1c] alloc_pages_vma+0x13c/0x270 [ 4782.653618] [9d31d914] ? do_huge_pmd_wp_page+0x494/0xc90 [ 4782.654487] [9d31d914] do_huge_pmd_wp_page+0x494/0xc90 [ 4782.656045] [9d320d20] ? __mem_cgroup_count_vm_event+0xd0/0x240 [ 4782.657089] [9d2dcb7d] handle_mm_fault+0x8bd/0xc50 [ 4782.660931] [9d1d26e6] ? __lock_is_held+0x56/0x80 [ 4782.662695] [9d0c7bc7] __do_page_fault+0x1b7/0x660 [ 4782.663259] [9d1cdc5e] ? put_lock_stats.isra.13+0xe/0x30 [ 4782.663851] [9d1abf41] ? vtime_account_user+0x91/0xa0 [ 4782.664419] [9d2a2c35] ? context_tracking_user_exit+0xb5/0x1b0 [ 4782.665119] [9db6e103] ? __this_cpu_preempt_check+0x13/0x20 [ 4782.665969] [9d1ce2e2] ? trace_hardirqs_off_caller+0xe2/0x1b0 [ 4782.34] [9d0c8141] trace_do_page_fault+0x51/0x2b0 [ 4782.667257] [9d0bee83] do_async_page_fault+0x63/0xd0 [ 4782.667871] [a0511018] async_page_fault+0x28/0x30 Although it wasn't followed by anything else, and I've seen the original issue getting triggered without this WARN showing up, so it seems like a different, unrelated issue? Thanks, Sasha -- To unsubscribe from this
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Sasha Levin wrote: On 09/09/2014 10:45 PM, Hugh Dickins wrote: Sasha, you say you're getting plenty of these now, but I've only seen the dump for one of them, on Aug26: please post a few more dumps, so that we can look for commonality. I wasn't saving older logs for this issue so I only have 2 traces from tonight. If that's not enough please let me know and I'll try to add a few more. Thanks, these two are useful, mainly because the register contents most likely to be ptes are in both of these ...900, with no sign of a ...902. So the RW bit I got excited about yesterday is clearly not necessary for the bug (though it's still possible that it was good for implicating page migration, and page migration still play a part in the story). And please attach a disassembly of change_protection_range() (noting which of the dumps it corresponds to, in case it has changed around): Code just shows a cluster of ud2s for the unlikely bugs at end of the function, we cannot tell at all what should be in the registers by then. change_protection_range() got inlined into change_protection(), it applies to both traces above: Thanks for supplying, but the change in inlining means that change_protection_range() and change_protection() are no longer relevant for these traces, we now need to see change_pte_range() instead, to confirm that what I expect are ptes are indeed ptes. If you can include line numbers (objdump -ld) in the disassembly, so much the better, but should be decipherable without. (Or objdump -Sd for source, but I often find that harder to unscramble, can't say why.) Thanks, Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Mel Gorman wrote: On Tue, Sep 09, 2014 at 07:45:26PM -0700, Hugh Dickins wrote: I've been rather assuming that the 9d340902 seen in many of the registers in that Aug26 dump is the pte val in question: that's SOFT_DIRTY|PROTNONE|RW. The 900s in the latest dumps imply that that 902 was not important. (If any of them are in fact the pte val.) I think RW on PROTNONE is unusual but not impossible (migration entry replacement racing with mprotect setting PROT_NONE, after it's updated vm_page_prot, before it's reached the page table). At the risk of sounding thick, I need to spell this out because I'm having trouble seeing exactly what race you are thinking of. Migration entry replacement is protected against parallel NUMA hinting updates by the page table lock (either PMD or PTE level). It's taken by remove_migration_pte on one side and lock_pte_protection on the other. For the mprotect case racing again migration, migration entries are not present so change_pte_range() should ignore it. On migration completion the VMA flags determine the permissions of the new PTE. Parallel faults wait on the migration entry and see the correct value afterwards. When creating migration entries, try_to_unmap calls page_check_address which takes the PTL before doing anything. On the mprotect side, lock_pte_protection will block before seeing PROTNONE. I think the race you are thinking of is a migration entry created for write, parallel mprotect(PROTNONE) and migration completion. The migration entry was created for write but remove_migration_pte does not double check the VMA protections and mmap_sem is not taken for write across a full migration to protect against changes to vm_page_prot. Yes, the if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte); arguably should take the latest value of vma-vm_page_prot into account. However, change_pte_range checks for migration entries marked for write under the PTL and marks them read if one is encountered. The consequence is that we potentially take a spurious fault to mark the PTE write again after migration completes but I can't see how that causes a problem as such. Yes, once mprotect's page table walk reaches that pte, it updates it correctly along with all the others nearby (which were not migrated), removing the temporary oddity. I'm missing some part of your reasoning that leads to the RW|PROTNONE :( You don't appear to be missing it at all, you are seeing the possibility of an RW|PROTNONE yourself, and how it gets corrected afterwards (corrected in quotes because without the present bit, it's not an error). But exciting though that line of thought is, I cannot actually bring it to a pte_mknuma bug, or any bug at all. And I wasn't saying that it led to this bug, just that it was an oddity worth thinking about, and worth mentioning to you, in case you could work out a way it might lead to the bug, when I had failed to do so. But we now (almost) know that 902 is irrelevant to this bug anyway. On x86, PROTNONE|RW translates as GLOBAL|RW which would be unexpected. It GLOBAL once PRESENT is set, but PROTNONE so long as it is not. wouldn't cause this bug but it's sufficiently suspicious to be worth correcting. In case this is the race you're thinking of, the patch is below. Unfortunately, I cannot see how it would affect this problem but worth giving a whirl anyway. Mel, no way can it be the cause of this bug - unless Sasha's later traces actually show a different stack - but I don't see the call to change_prot_numa() from queue_pages_range() sharing the same avoidance of PROT_NONE that task_numa_work() has (though it does have an outdated comment about PROT_NONE which should be removed). So I think that site probably does need PROT_NONE checking added. That site should have checked PROT_NONE but it can't be the same bug that trinity is seeing. Minimally trinity is unaware of MPOL_MF_LAZY according to git grep of the trinity source. Yes, queue_pages_range() is not implicated in any of Sasha's traces. Something to fix, but not relevant to this bug. Worth adding this to the debugging mix? It should warn if it encounters the problem but avoid adding the problematic RW bit. ---8--- migrate: debug patch to try identify race between migration completion and mprotect A migration entry is marked as write if pte_write was true at the time the entry was created. The VMA protections are not double checked when migration entries are being removed but mprotect itself will mark write-migration-entries as read to avoid problems. It means we potentially take a spurious fault to mark these ptes write again but otherwise it's harmless. Still, one dump indicates that this situation can actually happen so this debugging patch spits out a warning if the situation occurs and hopefully the resulting warning will contain a clue as to how exactly it
Re: mm: BUG in unmap_page_range
On 09/10/2014 03:09 PM, Hugh Dickins wrote: Thanks for supplying, but the change in inlining means that change_protection_range() and change_protection() are no longer relevant for these traces, we now need to see change_pte_range() instead, to confirm that what I expect are ptes are indeed ptes. If you can include line numbers (objdump -ld) in the disassembly, so much the better, but should be decipherable without. (Or objdump -Sd for source, but I often find that harder to unscramble, can't say why.) Here it is. Note that the source includes both of Mel's debug patches. For reference, here's one trace of the issue with those patches: [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724! [ 3114.541857] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3114.543112] Dumping ftrace buffer: [ 3114.544056](ftrace buffer empty) [ 3114.545000] Modules linked in: [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 88076f584000 [ 3114.549284] RIP: 0010:[952e527a] [952e527a] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP: :88076f587d68 EFLAGS: 00010246 [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100 [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900 [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5 [ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0 [ 3114.550028] R13: 8025 R14: 41343000 R15: cfff [ 3114.550028] FS: 7fabb91c8700() GS:88025ec0() knlGS: [ 3114.550028] CS: 0010 DS: ES: CR0: 8005003b [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0 [ 3114.550028] DR0: 006f DR1: DR2: [ 3114.550028] DR3: DR6: 0ff0 DR7: 00050602 [ 3114.550028] Stack: [ 3114.550028] 0001 000314625900 0018 8802685f2260 [ 3114.550028] 1684 8802cf973600 88061684 41343000 [ 3114.550028] 880108805048 41005000 4120 41343000 [ 3114.550028] Call Trace: [ 3114.550028] [952e5534] change_protection+0x2b4/0x4e0 [ 3114.550028] [952ff24b] change_prot_numa+0x1b/0x40 [ 3114.550028] [951adf16] task_numa_work+0x1f6/0x330 [ 3114.550028] [95193de4] task_work_run+0xc4/0xf0 [ 3114.550028] [95071477] do_notify_resume+0x97/0xb0 [ 3114.550028] [9850f06a] int_signal+0x12/0x17 [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 3114.550028] RIP [952e527a] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP 88076f587d68 And the disassembly: change_pte_range: change_pte_range(): /home/sasha/linux-next/mm/mprotect.c:70 0: e8 00 00 00 00 callq 5 change_pte_range+0x5 1: R_X86_64_PC32__fentry__-0x4 5: 55 push %rbp 6: 48 89 e5mov%rsp,%rbp 9: 41 57 push %r15 b: 41 56 push %r14 d: 49 89 cemov%rcx,%r14 10: 41 55 push %r13 12: 4d 89 c5mov%r8,%r13 15: 41 54 push %r12 17: 49 89 f4mov%rsi,%r12 1a: 53 push %rbx 1b: 48 89 d3mov%rdx,%rbx 1e: 48 83 ec 38 sub$0x38,%rsp /home/sasha/linux-next/mm/mprotect.c:71 22: 48 8b 47 40 mov0x40(%rdi),%rax /home/sasha/linux-next/mm/mprotect.c:70 26: 48 89 7d c8 mov%rdi,-0x38(%rbp) lock_pte_protection(): /home/sasha/linux-next/mm/mprotect.c:53 2a: 8b 4d 10mov0x10(%rbp),%ecx change_pte_range(): /home/sasha/linux-next/mm/mprotect.c:70 2d: 44 89 4d c4 mov%r9d,-0x3c(%rbp) /home/sasha/linux-next/mm/mprotect.c:71 31: 48 89 45 d0 mov%rax,-0x30(%rbp) lock_pte_protection(): /home/sasha/linux-next/mm/mprotect.c:53 35: 85 c9 test %ecx,%ecx 37: 0f 84 6b 03 00 00 je 3a8 change_pte_range+0x3a8 pmd_to_page(): /home/sasha/linux-next/include/linux/mm.h:1538 3d: 48 89 f7mov%rsi,%rdi 40: 48 81 e7 00 f0 ff ffand$0xf000,%rdi 47: e8 00 00 00 00 callq 4c change_pte_range+0x4c 48: R_X86_64_PC32 __phys_addr-0x4 4c: 48 ba 00 00 00 00 00movabs $0xea00,%rdx 53: ea ff ff 56: 48 c1 e8 0c shr
Re: mm: BUG in unmap_page_range
On Wed, 10 Sep 2014, Sasha Levin wrote: On 09/10/2014 03:09 PM, Hugh Dickins wrote: Thanks for supplying, but the change in inlining means that change_protection_range() and change_protection() are no longer relevant for these traces, we now need to see change_pte_range() instead, to confirm that what I expect are ptes are indeed ptes. If you can include line numbers (objdump -ld) in the disassembly, so much the better, but should be decipherable without. (Or objdump -Sd for source, but I often find that harder to unscramble, can't say why.) Here it is. Note that the source includes both of Mel's debug patches. For reference, here's one trace of the issue with those patches: [ 3114.540976] kernel BUG at include/asm-generic/pgtable.h:724! [ 3114.541857] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3114.543112] Dumping ftrace buffer: [ 3114.544056](ftrace buffer empty) [ 3114.545000] Modules linked in: [ 3114.545717] CPU: 18 PID: 30217 Comm: trinity-c617 Tainted: GW 3.17.0-rc4-next-20140910-sasha-00032-g6825fb5-dirty #1137 [ 3114.548058] task: 88041505 ti: 88076f584000 task.ti: 88076f584000 [ 3114.549284] RIP: 0010:[952e527a] [952e527a] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP: :88076f587d68 EFLAGS: 00010246 [ 3114.550028] RAX: 000314625900 RBX: 41218000 RCX: 0100 [ 3114.550028] RDX: 000314625900 RSI: 41218000 RDI: 000314625900 [ 3114.550028] RBP: 88076f587dc8 R08: 8802cf973600 R09: 00b5 [ 3114.550028] R10: 00032c01 R11: 0008 R12: 8802a81070c0 [ 3114.550028] R13: 8025 R14: 41343000 R15: cfff [ 3114.550028] FS: 7fabb91c8700() GS:88025ec0() knlGS: [ 3114.550028] CS: 0010 DS: ES: CR0: 8005003b [ 3114.550028] CR2: 7fffdb7678e8 CR3: 000713935000 CR4: 06a0 [ 3114.550028] DR0: 006f DR1: DR2: [ 3114.550028] DR3: DR6: 0ff0 DR7: 00050602 [ 3114.550028] Stack: [ 3114.550028] 0001 000314625900 0018 8802685f2260 [ 3114.550028] 1684 8802cf973600 88061684 41343000 [ 3114.550028] 880108805048 41005000 4120 41343000 [ 3114.550028] Call Trace: [ 3114.550028] [952e5534] change_protection+0x2b4/0x4e0 [ 3114.550028] [952ff24b] change_prot_numa+0x1b/0x40 [ 3114.550028] [951adf16] task_numa_work+0x1f6/0x330 [ 3114.550028] [95193de4] task_work_run+0xc4/0xf0 [ 3114.550028] [95071477] do_notify_resume+0x97/0xb0 [ 3114.550028] [9850f06a] int_signal+0x12/0x17 [ 3114.550028] Code: 66 90 48 8b 7d b8 e8 e6 88 22 03 48 8b 45 b0 e9 6f ff ff ff 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 90 55 48 89 e5 41 57 49 89 d7 41 [ 3114.550028] RIP [952e527a] change_pte_range+0x4ea/0x4f0 [ 3114.550028] RSP 88076f587d68 And the disassembly: ... /home/sasha/linux-next/mm/mprotect.c:105 31d: 48 8b 4d a8 mov-0x58(%rbp),%rcx 321: 81 e1 01 03 00 00 and$0x301,%ecx 327: 48 81 f9 00 02 00 00cmp$0x200,%rcx 32e: 0f 84 0b ff ff ff je 23f change_pte_range+0x23f pte_val(): /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450 334: 48 83 3d 00 00 00 00cmpq $0x0,0x0(%rip)# 33c change_pte_range+0x33c 33b: 00 337: R_X86_64_PC32 pv_mmu_ops+0xe3 ptep_set_numa(): /home/sasha/linux-next/include/asm-generic/pgtable.h:740 33c: 49 8b 3c 24 mov(%r12),%rdi pte_val(): /home/sasha/linux-next/./arch/x86/include/asm/paravirt.h:450 340: 0f 84 12 01 00 00 je 458 change_pte_range+0x458 346: ff 14 25 00 00 00 00callq *0x0 349: R_X86_64_32S pv_mmu_ops+0xe8 pte_mknuma(): /home/sasha/linux-next/include/asm-generic/pgtable.h:724 34d: a8 01 test $0x1,%al 34f: 0f 84 95 01 00 00 je 4ea change_pte_range+0x4ea ... ptep_set_numa(): /home/sasha/linux-next/include/asm-generic/pgtable.h:724 4ea: 0f 0b ud2 Thanks, yes, there is enough in there to be sure that the ...900 is indeed the oldpte. I wasn't expecting that pv_mmu_ops function call, but there's no evidence that it does anything worse than just return in %rax what it's given in %rdi; and the second long on the stack is the -0x58(%rbp) from which oldpte is retrieved for !pte_numa(oldpte) at the beginning of the extract above. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: mm: BUG in unmap_page_range
On 09/10/2014 03:36 PM, Hugh Dickins wrote: migrate: debug patch to try identify race between migration completion and mprotect A migration entry is marked as write if pte_write was true at the time the entry was created. The VMA protections are not double checked when migration entries are being removed but mprotect itself will mark write-migration-entries as read to avoid problems. It means we potentially take a spurious fault to mark these ptes write again but otherwise it's harmless. Still, one dump indicates that this situation can actually happen so this debugging patch spits out a warning if the situation occurs and hopefully the resulting warning will contain a clue as to how exactly it happens Not-signed-off --- mm/migrate.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 09d489c..631725c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -146,8 +146,16 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma, pte = pte_mkold(mk_pte(new, vma-vm_page_prot)); if (pte_swp_soft_dirty(*ptep)) pte = pte_mksoft_dirty(pte); - if (is_write_migration_entry(entry)) - pte = pte_mkwrite(pte); + if (is_write_migration_entry(entry)) { + /* + * This WARN_ON_ONCE is temporary for the purposes of seeing if + * it's a case encountered by trinity in Sasha's testing + */ + if (!(vma-vm_flags (VM_WRITE))) + WARN_ON_ONCE(1); + else + pte = pte_mkwrite(pte); + } #ifdef CONFIG_HUGETLB_PAGE if (PageHuge(new)) { pte = pte_mkhuge(pte); Right, and Sasha reports that that can fire, but he sees the bug with this patch in and without that firing. I've changed that WARN_ON_ONCE() to a VM_BUG_ON_VMA() to get some useful VMA information out, and got the following: [ 4018.870776] vma 8801a0f1e800 start 7f3fd0ca7000 end 7f3fd16a7000 [ 4018.870776] next 8804e1b89800 prev 88008cd9a000 mm 88054b17d000 [ 4018.870776] prot 120 anon_vma 880bc858a200 vm_ops (null) [ 4018.870776] pgoff 41bc8 file (null) private_data (null) [ 4018.879731] flags: 0x8100070(mayread|maywrite|mayexec|account) [ 4018.881324] [ cut here ] [ 4018.882612] kernel BUG at mm/migrate.c:155! [ 4018.883649] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 4018.889647] Dumping ftrace buffer: [ 4018.890323](ftrace buffer empty) [ 4018.890323] Modules linked in: [ 4018.890323] CPU: 4 PID: 9966 Comm: trinity-main Tainted: GW 3.17.0-rc4-next-20140910-sasha-00042-ga4bad9b-dirty #1140 [ 4018.890323] task: 880695b83000 ti: 880560c44000 task.ti: 880560c44000 [ 4018.890323] RIP: 0010:[9b2fd4c1] [9b2fd4c1] remove_migration_pte+0x3e1/0x3f0 [ 4018.890323] RSP: :880560c477c8 EFLAGS: 00010292 [ 4018.890323] RAX: 0001 RBX: 7f3fd129b000 RCX: [ 4018.890323] RDX: 0001 RSI: 9e4ba395 RDI: 0001 [ 4018.890323] RBP: 880560c47800 R08: 0001 R09: 0001 [ 4018.890323] R10: 00045401 R11: 0001 R12: 8801a0f1e800 [ 4018.890323] R13: 88054b17d000 R14: ea000478eb40 R15: 880122bcf070 [ 4018.890323] FS: 7f3fd55bb700() GS:8803d6a0() knlGS: [ 4018.890323] CS: 0010 DS: ES: CR0: 8005003b [ 4018.890323] CR2: 00fcbca8 CR3: 000561bab000 CR4: 06a0 [ 4018.890323] DR0: 006f DR1: DR2: [ 4018.890323] DR3: DR6: 0ff0 DR7: 0600 [ 4018.890323] Stack: [ 4018.890323] ea00046ed980 88011079c4d8 ea000478eb40 880560c47858 [ 4018.890323] 88019fde0330 000421bc 8801a0f1e800 880560c47848 [ 4018.890323] 9b2d1b0f 880bc858a200 880560c47850 ea000478eb40 [ 4018.890323] Call Trace: [ 4018.890323] [9b2d1b0f] rmap_walk+0x22f/0x380 [ 4018.890323] [9b2fc841] remove_migration_ptes+0x41/0x50 [ 4018.890323] [9b2fd0e0] ? __migration_entry_wait.isra.24+0x160/0x160 [ 4018.890323] [9b2fd4d0] ? remove_migration_pte+0x3f0/0x3f0 [ 4018.890323] [9b2fe73b] move_to_new_page+0x16b/0x230 [ 4018.890323] [9b2d1e8c] ? try_to_unmap+0x6c/0xf0 [ 4018.890323] [9b2d08a0] ? try_to_unmap_nonlinear+0x5c0/0x5c0 [ 4018.890323] [9b2cf0a0] ? invalid_migration_vma+0x30/0x30 [ 4018.890323] [9b2d02e0] ? page_remove_rmap+0x320/0x320 [ 4018.890323] [9b2ff19c] migrate_pages+0x85c/0x930 [ 4018.890323] [9b2b8e20] ? isolate_freepages_block+0x410/0x410 [ 4018.890323] [9b2b7a60] ? arch_local_save_flags+0x30/0x30 [ 4018.890323] [9b2b9803]
Re: mm: BUG in unmap_page_range
On Tue, 9 Sep 2014, Sasha Levin wrote: > On 09/09/2014 05:33 PM, Mel Gorman wrote: > > On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: > >> On 09/08/2014 01:18 PM, Mel Gorman wrote: > >>> A worse possibility is that somehow the lock is getting corrupted but > >>> that's also a tough sell considering that the locks should be allocated > >>> from a dedicated cache. I guess I could try breaking that to allocate > >>> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very > >>> optimistic. > >> > >> I did see ptl corruption couple days ago: > >> > >>https://lkml.org/lkml/2014/9/4/599 > >> > >> Could this be related? > >> > > > > Possibly although the likely explanation then would be that there is > > just general corruption coming from somewhere. Even using your config > > and applying a patch to make linux-next boot (already in Tejun's tree) > > I was unable to reproduce the problem after running for several hours. I > > had to run trinity on tmpfs as ext4 and xfs blew up almost immediately > > so I have a few questions. > > I agree it could be a case of random corruption somewhere else, it's just > that the amount of times this exact issue reproduced Yes, I doubt it's random corruption; but I've been no more successful than Mel in working it out (I share responsibility for that VM_BUG_ON). Sasha, you say you're getting plenty of these now, but I've only seen the dump for one of them, on Aug26: please post a few more dumps, so that we can look for commonality. And please attach a disassembly of change_protection_range() (noting which of the dumps it corresponds to, in case it has changed around): "Code" just shows a cluster of ud2s for the unlikely bugs at end of the function, we cannot tell at all what should be in the registers by then. I've been rather assuming that the 9d340902 seen in many of the registers in that Aug26 dump is the pte val in question: that's SOFT_DIRTY|PROTNONE|RW. I think RW on PROTNONE is unusual but not impossible (migration entry replacement racing with mprotect setting PROT_NONE, after it's updated vm_page_prot, before it's reached the page table). But exciting though that line of thought is, I cannot actually bring it to a pte_mknuma bug, or any bug at all. Mel, no way can it be the cause of this bug - unless Sasha's later traces actually show a different stack - but I don't see the call to change_prot_numa() from queue_pages_range() sharing the same avoidance of PROT_NONE that task_numa_work() has (though it does have an outdated comment about PROT_NONE which should be removed). So I think that site probably does need PROT_NONE checking added. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/09/2014 05:33 PM, Mel Gorman wrote: > On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: >> On 09/08/2014 01:18 PM, Mel Gorman wrote: >>> A worse possibility is that somehow the lock is getting corrupted but >>> that's also a tough sell considering that the locks should be allocated >>> from a dedicated cache. I guess I could try breaking that to allocate >>> one page per lock so DEBUG_PAGEALLOC triggers but I'm not very >>> optimistic. >> >> I did see ptl corruption couple days ago: >> >> https://lkml.org/lkml/2014/9/4/599 >> >> Could this be related? >> > > Possibly although the likely explanation then would be that there is > just general corruption coming from somewhere. Even using your config > and applying a patch to make linux-next boot (already in Tejun's tree) > I was unable to reproduce the problem after running for several hours. I > had to run trinity on tmpfs as ext4 and xfs blew up almost immediately > so I have a few questions. I agree it could be a case of random corruption somewhere else, it's just that the amount of times this exact issue reproduced > 1. What filesystem are you using? virtio-9p. I'm willing to try something more "common" if you feel this could be related, but I haven't seen any issues coming out of 9p in a while now. > 2. What compiler in case it's an experimental compiler? I ask because I >think I saw a patch from you adding support so that the kernel would >build with gcc 5 Right, I've been testing with gcc 5 as well as Debian's gcc 4.7.2, it reproduces with both compilers. > 3. Does your hardware support TSX or anything similarly funky that would >potentially affect locking? Not that I know of, here are the cpu flags for reference: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm ida epb dtherm tpr_shadow vnmi flexpriority ept vpid > 4. How many sockets are on your test machine in case reproducing it >depends in a machine large enough to open a timing race? 128 sockets. > As I'm drawing a blank on what would trigger the bug I'm hoping I can > reproduce this locally and experiement a bit. I was thinking about sneaking in something like the following (untested) patch to see if it's really memory corruption that is wiping out stuff: diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 0f9724c..0205655 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -25,6 +25,7 @@ #define _PAGE_BIT_SPLITTING_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */ #define _PAGE_BIT_IOMAP_PAGE_BIT_SOFTW2 /* flag used to indicate IO mapping */ #define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */ +#define _PAGE_BIT_SANITY _PAGE_BIT_SOFTW3 /* Memory corruption canary */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ @@ -66,6 +67,8 @@ #define _PAGE_HIDDEN (_AT(pteval_t, 0)) #endif +#define _PAGE_SANITY (_AT(pteval_t, 1) << _PAGE_BIT_SANITY) + /* * The same hidden bit is used by kmemcheck, but since kmemcheck * works on kernel pages while soft-dirty engine on user space, @@ -312,7 +315,7 @@ static inline pmdval_t pmd_flags(pmd_t pmd) static inline pte_t native_make_pte(pteval_t val) { - return (pte_t) { .pte = val }; + return (pte_t) { .pte = val | _PAGE_SANITY }; } static inline pteval_t native_pte_val(pte_t pte) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index ffea570..bc897a1 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -720,6 +720,8 @@ static inline pmd_t pmd_mknonnuma(pmd_t pmd) static inline pte_t pte_mknuma(pte_t pte) { pteval_t val = pte_val(pte); + + VM_BUG_ON(!(val & _PAGE_SANITY)); VM_BUG_ON(!(val & _PAGE_PRESENT)); Does it make sense at all? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: > On 09/08/2014 01:18 PM, Mel Gorman wrote: > > A worse possibility is that somehow the lock is getting corrupted but > > that's also a tough sell considering that the locks should be allocated > > from a dedicated cache. I guess I could try breaking that to allocate > > one page per lock so DEBUG_PAGEALLOC triggers but I'm not very > > optimistic. > > I did see ptl corruption couple days ago: > > https://lkml.org/lkml/2014/9/4/599 > > Could this be related? > Possibly although the likely explanation then would be that there is just general corruption coming from somewhere. Even using your config and applying a patch to make linux-next boot (already in Tejun's tree) I was unable to reproduce the problem after running for several hours. I had to run trinity on tmpfs as ext4 and xfs blew up almost immediately so I have a few questions. 1. What filesystem are you using? 2. What compiler in case it's an experimental compiler? I ask because I think I saw a patch from you adding support so that the kernel would build with gcc 5 3. Does your hardware support TSX or anything similarly funky that would potentially affect locking? 4. How many sockets are on your test machine in case reproducing it depends in a machine large enough to open a timing race? As I'm drawing a blank on what would trigger the bug I'm hoping I can reproduce this locally and experiement a bit. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: On 09/08/2014 01:18 PM, Mel Gorman wrote: A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. I did see ptl corruption couple days ago: https://lkml.org/lkml/2014/9/4/599 Could this be related? Possibly although the likely explanation then would be that there is just general corruption coming from somewhere. Even using your config and applying a patch to make linux-next boot (already in Tejun's tree) I was unable to reproduce the problem after running for several hours. I had to run trinity on tmpfs as ext4 and xfs blew up almost immediately so I have a few questions. 1. What filesystem are you using? 2. What compiler in case it's an experimental compiler? I ask because I think I saw a patch from you adding support so that the kernel would build with gcc 5 3. Does your hardware support TSX or anything similarly funky that would potentially affect locking? 4. How many sockets are on your test machine in case reproducing it depends in a machine large enough to open a timing race? As I'm drawing a blank on what would trigger the bug I'm hoping I can reproduce this locally and experiement a bit. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/09/2014 05:33 PM, Mel Gorman wrote: On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: On 09/08/2014 01:18 PM, Mel Gorman wrote: A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. I did see ptl corruption couple days ago: https://lkml.org/lkml/2014/9/4/599 Could this be related? Possibly although the likely explanation then would be that there is just general corruption coming from somewhere. Even using your config and applying a patch to make linux-next boot (already in Tejun's tree) I was unable to reproduce the problem after running for several hours. I had to run trinity on tmpfs as ext4 and xfs blew up almost immediately so I have a few questions. I agree it could be a case of random corruption somewhere else, it's just that the amount of times this exact issue reproduced 1. What filesystem are you using? virtio-9p. I'm willing to try something more common if you feel this could be related, but I haven't seen any issues coming out of 9p in a while now. 2. What compiler in case it's an experimental compiler? I ask because I think I saw a patch from you adding support so that the kernel would build with gcc 5 Right, I've been testing with gcc 5 as well as Debian's gcc 4.7.2, it reproduces with both compilers. 3. Does your hardware support TSX or anything similarly funky that would potentially affect locking? Not that I know of, here are the cpu flags for reference: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm ida epb dtherm tpr_shadow vnmi flexpriority ept vpid 4. How many sockets are on your test machine in case reproducing it depends in a machine large enough to open a timing race? 128 sockets. As I'm drawing a blank on what would trigger the bug I'm hoping I can reproduce this locally and experiement a bit. I was thinking about sneaking in something like the following (untested) patch to see if it's really memory corruption that is wiping out stuff: diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 0f9724c..0205655 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -25,6 +25,7 @@ #define _PAGE_BIT_SPLITTING_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */ #define _PAGE_BIT_IOMAP_PAGE_BIT_SOFTW2 /* flag used to indicate IO mapping */ #define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */ +#define _PAGE_BIT_SANITY _PAGE_BIT_SOFTW3 /* Memory corruption canary */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */ @@ -66,6 +67,8 @@ #define _PAGE_HIDDEN (_AT(pteval_t, 0)) #endif +#define _PAGE_SANITY (_AT(pteval_t, 1) _PAGE_BIT_SANITY) + /* * The same hidden bit is used by kmemcheck, but since kmemcheck * works on kernel pages while soft-dirty engine on user space, @@ -312,7 +315,7 @@ static inline pmdval_t pmd_flags(pmd_t pmd) static inline pte_t native_make_pte(pteval_t val) { - return (pte_t) { .pte = val }; + return (pte_t) { .pte = val | _PAGE_SANITY }; } static inline pteval_t native_pte_val(pte_t pte) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index ffea570..bc897a1 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -720,6 +720,8 @@ static inline pmd_t pmd_mknonnuma(pmd_t pmd) static inline pte_t pte_mknuma(pte_t pte) { pteval_t val = pte_val(pte); + + VM_BUG_ON(!(val _PAGE_SANITY)); VM_BUG_ON(!(val _PAGE_PRESENT)); Does it make sense at all? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Tue, 9 Sep 2014, Sasha Levin wrote: On 09/09/2014 05:33 PM, Mel Gorman wrote: On Mon, Sep 08, 2014 at 01:56:55PM -0400, Sasha Levin wrote: On 09/08/2014 01:18 PM, Mel Gorman wrote: A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. I did see ptl corruption couple days ago: https://lkml.org/lkml/2014/9/4/599 Could this be related? Possibly although the likely explanation then would be that there is just general corruption coming from somewhere. Even using your config and applying a patch to make linux-next boot (already in Tejun's tree) I was unable to reproduce the problem after running for several hours. I had to run trinity on tmpfs as ext4 and xfs blew up almost immediately so I have a few questions. I agree it could be a case of random corruption somewhere else, it's just that the amount of times this exact issue reproduced Yes, I doubt it's random corruption; but I've been no more successful than Mel in working it out (I share responsibility for that VM_BUG_ON). Sasha, you say you're getting plenty of these now, but I've only seen the dump for one of them, on Aug26: please post a few more dumps, so that we can look for commonality. And please attach a disassembly of change_protection_range() (noting which of the dumps it corresponds to, in case it has changed around): Code just shows a cluster of ud2s for the unlikely bugs at end of the function, we cannot tell at all what should be in the registers by then. I've been rather assuming that the 9d340902 seen in many of the registers in that Aug26 dump is the pte val in question: that's SOFT_DIRTY|PROTNONE|RW. I think RW on PROTNONE is unusual but not impossible (migration entry replacement racing with mprotect setting PROT_NONE, after it's updated vm_page_prot, before it's reached the page table). But exciting though that line of thought is, I cannot actually bring it to a pte_mknuma bug, or any bug at all. Mel, no way can it be the cause of this bug - unless Sasha's later traces actually show a different stack - but I don't see the call to change_prot_numa() from queue_pages_range() sharing the same avoidance of PROT_NONE that task_numa_work() has (though it does have an outdated comment about PROT_NONE which should be removed). So I think that site probably does need PROT_NONE checking added. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/08/2014 01:18 PM, Mel Gorman wrote: > A worse possibility is that somehow the lock is getting corrupted but > that's also a tough sell considering that the locks should be allocated > from a dedicated cache. I guess I could try breaking that to allocate > one page per lock so DEBUG_PAGEALLOC triggers but I'm not very > optimistic. I did see ptl corruption couple days ago: https://lkml.org/lkml/2014/9/4/599 Could this be related? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Thu, Sep 04, 2014 at 05:04:37AM -0400, Sasha Levin wrote: > On 08/29/2014 09:23 PM, Sasha Levin wrote: > > On 08/27/2014 11:26 AM, Mel Gorman wrote: > >> > diff --git a/include/asm-generic/pgtable.h > >> > b/include/asm-generic/pgtable.h > >> > index 281870f..ffea570 100644 > >> > --- a/include/asm-generic/pgtable.h > >> > +++ b/include/asm-generic/pgtable.h > >> > @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) > >> > > >> > VM_BUG_ON(!(val & _PAGE_PRESENT)); > >> > > >> > +/* debugging only, specific to x86 */ > >> > +VM_BUG_ON(val & _PAGE_PROTNONE); > >> > + > >> > val &= ~_PAGE_PRESENT; > >> > val |= _PAGE_NUMA; > > Triggered again, the first VM_BUG_ON got hit, the second one never did. > > Okay, this bug has reproduced quite a few times since then that I no longer > suspect it's random memory corruption. I'd be happy to try out more debug > patches if you have any leads. > The fact the second one doesn't trigger makes me think that this is not related to how the helpers are called and is instead relating to timing. I tried reproducing this but got nothing after 3 hours. How long does it typically take to reproduce in a given run? You mentioned that it takes a few weeks to hit but maybe the frequency has changed since. I tried todays linux-next kernel but it didn't even boot so next-20140826 to match your original report but got nothing. Can you also send me the config you used in case that's a factor. I had one hunch that this may somehow be related to a collision between pagetable teardown during exit and the scanner but I could not find a way that could actually happen. During teardown there should be only one user of the mm and it can't race with itself. A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Thu, Sep 04, 2014 at 05:04:37AM -0400, Sasha Levin wrote: On 08/29/2014 09:23 PM, Sasha Levin wrote: On 08/27/2014 11:26 AM, Mel Gorman wrote: diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 281870f..ffea570 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) VM_BUG_ON(!(val _PAGE_PRESENT)); +/* debugging only, specific to x86 */ +VM_BUG_ON(val _PAGE_PROTNONE); + val = ~_PAGE_PRESENT; val |= _PAGE_NUMA; Triggered again, the first VM_BUG_ON got hit, the second one never did. Okay, this bug has reproduced quite a few times since then that I no longer suspect it's random memory corruption. I'd be happy to try out more debug patches if you have any leads. The fact the second one doesn't trigger makes me think that this is not related to how the helpers are called and is instead relating to timing. I tried reproducing this but got nothing after 3 hours. How long does it typically take to reproduce in a given run? You mentioned that it takes a few weeks to hit but maybe the frequency has changed since. I tried todays linux-next kernel but it didn't even boot so next-20140826 to match your original report but got nothing. Can you also send me the config you used in case that's a factor. I had one hunch that this may somehow be related to a collision between pagetable teardown during exit and the scanner but I could not find a way that could actually happen. During teardown there should be only one user of the mm and it can't race with itself. A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 09/08/2014 01:18 PM, Mel Gorman wrote: A worse possibility is that somehow the lock is getting corrupted but that's also a tough sell considering that the locks should be allocated from a dedicated cache. I guess I could try breaking that to allocate one page per lock so DEBUG_PAGEALLOC triggers but I'm not very optimistic. I did see ptl corruption couple days ago: https://lkml.org/lkml/2014/9/4/599 Could this be related? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/29/2014 09:23 PM, Sasha Levin wrote: > On 08/27/2014 11:26 AM, Mel Gorman wrote: >> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h >> > index 281870f..ffea570 100644 >> > --- a/include/asm-generic/pgtable.h >> > +++ b/include/asm-generic/pgtable.h >> > @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) >> > >> >VM_BUG_ON(!(val & _PAGE_PRESENT)); >> > >> > + /* debugging only, specific to x86 */ >> > + VM_BUG_ON(val & _PAGE_PROTNONE); >> > + >> >val &= ~_PAGE_PRESENT; >> >val |= _PAGE_NUMA; > Triggered again, the first VM_BUG_ON got hit, the second one never did. Okay, this bug has reproduced quite a few times since then that I no longer suspect it's random memory corruption. I'd be happy to try out more debug patches if you have any leads. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/29/2014 09:23 PM, Sasha Levin wrote: On 08/27/2014 11:26 AM, Mel Gorman wrote: diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 281870f..ffea570 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) VM_BUG_ON(!(val _PAGE_PRESENT)); + /* debugging only, specific to x86 */ + VM_BUG_ON(val _PAGE_PROTNONE); + val = ~_PAGE_PRESENT; val |= _PAGE_NUMA; Triggered again, the first VM_BUG_ON got hit, the second one never did. Okay, this bug has reproduced quite a few times since then that I no longer suspect it's random memory corruption. I'd be happy to try out more debug patches if you have any leads. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/27/2014 11:26 AM, Mel Gorman wrote: > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > index 281870f..ffea570 100644 > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) > > VM_BUG_ON(!(val & _PAGE_PRESENT)); > > + /* debugging only, specific to x86 */ > + VM_BUG_ON(val & _PAGE_PROTNONE); > + > val &= ~_PAGE_PRESENT; > val |= _PAGE_NUMA; Triggered again, the first VM_BUG_ON got hit, the second one never did. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/27/2014 11:26 AM, Mel Gorman wrote: diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 281870f..ffea570 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) VM_BUG_ON(!(val _PAGE_PRESENT)); + /* debugging only, specific to x86 */ + VM_BUG_ON(val _PAGE_PROTNONE); + val = ~_PAGE_PRESENT; val |= _PAGE_NUMA; Triggered again, the first VM_BUG_ON got hit, the second one never did. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/27/2014 11:26 AM, Mel Gorman wrote: > Sasha, how long does it typically take to trigger this? Are you > using any particular switches for trinity that would trigger the bug > faster? It took couple of weeks (I've been running with it since the beginning of August). I don't have any special trinity options, just the default fuzzing. Do you think that focusing on any of the mm syscalls would increase the odds of hitting it? There's always the chance that this is a fluke due to corruption somewhere else. I'll keep running it with the new debug patch and if it won't reproduce any time soon we can probably safely assume that. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Tue, Aug 26, 2014 at 11:16:47PM -0400, Sasha Levin wrote: > On 08/11/2014 11:28 PM, Sasha Levin wrote: > > On 08/05/2014 09:04 PM, Sasha Levin wrote: > >> > Thanks Hugh, Mel. I've added both patches to my local tree and will > >> > update tomorrow > >> > with the weather. > >> > > >> > Also: > >> > > >> > On 08/05/2014 08:42 PM, Hugh Dickins wrote: > >>> >> One thing I did wonder, though: at first I was reassured by the > >>> >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought > >>> >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger > >>> >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas. > >>> >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) > >> > > >> > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll > >> > update how that one looks as well. > > Sorry for the rather long delay. > > > > The patch looks fine, the issue didn't reproduce. > > > > The added VM_BUG_ON didn't trigger either, so maybe we should consider > > adding > > it in. > > It took a while, but I've managed to hit that VM_BUG_ON: > > [ 707.975456] kernel BUG at include/asm-generic/pgtable.h:724! > [ 707.977147] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 707.978974] Dumping ftrace buffer: > [ 707.980110](ftrace buffer empty) > [ 707.981221] Modules linked in: > [ 707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted > 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079 > [ 707.982801] task: 880165e28000 ti: 880165e3 task.ti: > 880165e3 > [ 707.982801] RIP: 0010:[] [] > change_protection_range+0x94a/0x970 > [ 707.982801] RSP: 0018:880165e33d98 EFLAGS: 00010246 > [ 707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: > 0100 > [ 707.982801] RDX: 9d340902 RSI: 41741000 RDI: > 9d340902 > [ 707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: > 00b52000 > [ 707.982801] R10: 1e01 R11: 0008 R12: > 41751000 > [ 707.982801] R13: 00f7 R14: 9d340902 R15: > 41741000 > [ 707.982801] FS: 7f358a9aa700() GS:88071c60() > knlGS: > [ 707.982801] CS: 0010 DS: ES: CR0: 8005003b > [ 707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: > 06a0 > [ 707.982801] Stack: > [ 707.982801] 8804db88d058 88070fb17cf0 > > [ 707.982801] 880165d88000 8801686a5000 > 4163e000 > [ 707.982801] 8801686a5000 0001 0025 > 41750fff > [ 707.982801] Call Trace: > [ 707.982801] [] change_protection+0x14/0x30 > [ 707.982801] [] change_prot_numa+0x1b/0x40 > [ 707.982801] [] task_numa_work+0x1f6/0x330 > [ 707.982801] [] task_work_run+0xc4/0xf0 > [ 707.982801] [] do_notify_resume+0x97/0xb0 > [ 707.982801] [] int_signal+0x12/0x17 > [ 707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b > 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b <0f> > 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01 > [ 707.982801] RIP [] change_protection_range+0x94a/0x970 > [ 707.982801] RSP > The tests to reach here are pte_present any of _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA pte_numaonly_PAGE_NUMA out of _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA VM_BUG_ON not set _PAGE_PRESENT To trigger the bug the PTE bits must then be _PAGE_PROTNONE | _PAGE_NUMA. The NUMA PTE scanner is skipping PROT_NONE VMAs so it should be "impossible" for it to be set there. The mmap_sem is held for read during scans so the protections should not be altering underneath us and the PTL is held against parallel faults. That leaves setting prot_none leaveing _PAGE_NUMA behind. Potentially that's an issue due to /* Set of bits not changed in pte_modify */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ _PAGE_SOFT_DIRTY | _PAGE_NUMA) The _PAGE_NUMA bit is not cleared as removing it potentially leaves the PTE in an unexpected state due to a "present" PTE marked for NUMA hinting fault becoming non-present. Instead there is this check in change_pte_range() to move PTEs to a known state before changing protections if (pte_numa(ptent)) ptent = pte_mknonnuma(ptent); ptent = pte_modify(ptent, newprot); So right now, I'm not seeing what path gets us to this inconsistent state. Sasha, how long does it typically take to trigger this? Are you using any particular switches for trinity that would trigger the bug faster? This untested patch might help pinpoint the source of the corruption early though it's
Re: mm: BUG in unmap_page_range
On Tue, Aug 26, 2014 at 11:16:47PM -0400, Sasha Levin wrote: On 08/11/2014 11:28 PM, Sasha Levin wrote: On 08/05/2014 09:04 PM, Sasha Levin wrote: Thanks Hugh, Mel. I've added both patches to my local tree and will update tomorrow with the weather. Also: On 08/05/2014 08:42 PM, Hugh Dickins wrote: One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val _PAGE_PRESENT)), being stronger - asserting that indeed we do not put NUMA hints on PROT_NONE areas. (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) I've added VM_BUG_ON(!(val _PAGE_PRESENT)) in just as a curiosity, I'll update how that one looks as well. Sorry for the rather long delay. The patch looks fine, the issue didn't reproduce. The added VM_BUG_ON didn't trigger either, so maybe we should consider adding it in. It took a while, but I've managed to hit that VM_BUG_ON: [ 707.975456] kernel BUG at include/asm-generic/pgtable.h:724! [ 707.977147] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 707.978974] Dumping ftrace buffer: [ 707.980110](ftrace buffer empty) [ 707.981221] Modules linked in: [ 707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079 [ 707.982801] task: 880165e28000 ti: 880165e3 task.ti: 880165e3 [ 707.982801] RIP: 0010:[b42e3dda] [b42e3dda] change_protection_range+0x94a/0x970 [ 707.982801] RSP: 0018:880165e33d98 EFLAGS: 00010246 [ 707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100 [ 707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902 [ 707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000 [ 707.982801] R10: 1e01 R11: 0008 R12: 41751000 [ 707.982801] R13: 00f7 R14: 9d340902 R15: 41741000 [ 707.982801] FS: 7f358a9aa700() GS:88071c60() knlGS: [ 707.982801] CS: 0010 DS: ES: CR0: 8005003b [ 707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0 [ 707.982801] Stack: [ 707.982801] 8804db88d058 88070fb17cf0 [ 707.982801] 880165d88000 8801686a5000 4163e000 [ 707.982801] 8801686a5000 0001 0025 41750fff [ 707.982801] Call Trace: [ 707.982801] [b42e3e14] change_protection+0x14/0x30 [ 707.982801] [b42fda3b] change_prot_numa+0x1b/0x40 [ 707.982801] [b41ad766] task_numa_work+0x1f6/0x330 [ 707.982801] [b41937c4] task_work_run+0xc4/0xf0 [ 707.982801] [b40712e7] do_notify_resume+0x97/0xb0 [ 707.982801] [b74fd6ea] int_signal+0x12/0x17 [ 707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b 0f 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01 [ 707.982801] RIP [b42e3dda] change_protection_range+0x94a/0x970 [ 707.982801] RSP 880165e33d98 The tests to reach here are pte_present any of _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA pte_numaonly_PAGE_NUMA out of _PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_NUMA VM_BUG_ON not set _PAGE_PRESENT To trigger the bug the PTE bits must then be _PAGE_PROTNONE | _PAGE_NUMA. The NUMA PTE scanner is skipping PROT_NONE VMAs so it should be impossible for it to be set there. The mmap_sem is held for read during scans so the protections should not be altering underneath us and the PTL is held against parallel faults. That leaves setting prot_none leaveing _PAGE_NUMA behind. Potentially that's an issue due to /* Set of bits not changed in pte_modify */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ _PAGE_SOFT_DIRTY | _PAGE_NUMA) The _PAGE_NUMA bit is not cleared as removing it potentially leaves the PTE in an unexpected state due to a present PTE marked for NUMA hinting fault becoming non-present. Instead there is this check in change_pte_range() to move PTEs to a known state before changing protections if (pte_numa(ptent)) ptent = pte_mknonnuma(ptent); ptent = pte_modify(ptent, newprot); So right now, I'm not seeing what path gets us to this inconsistent state. Sasha, how long does it typically take to trigger this? Are you using any particular switches for trinity that would trigger the bug faster? This untested patch might help pinpoint the source of the corruption
Re: mm: BUG in unmap_page_range
On 08/27/2014 11:26 AM, Mel Gorman wrote: Sasha, how long does it typically take to trigger this? Are you using any particular switches for trinity that would trigger the bug faster? It took couple of weeks (I've been running with it since the beginning of August). I don't have any special trinity options, just the default fuzzing. Do you think that focusing on any of the mm syscalls would increase the odds of hitting it? There's always the chance that this is a fluke due to corruption somewhere else. I'll keep running it with the new debug patch and if it won't reproduce any time soon we can probably safely assume that. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/11/2014 11:28 PM, Sasha Levin wrote: > On 08/05/2014 09:04 PM, Sasha Levin wrote: >> > Thanks Hugh, Mel. I've added both patches to my local tree and will update >> > tomorrow >> > with the weather. >> > >> > Also: >> > >> > On 08/05/2014 08:42 PM, Hugh Dickins wrote: >>> >> One thing I did wonder, though: at first I was reassured by the >>> >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought >>> >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger >>> >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas. >>> >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) >> > >> > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll >> > update how that one looks as well. > Sorry for the rather long delay. > > The patch looks fine, the issue didn't reproduce. > > The added VM_BUG_ON didn't trigger either, so maybe we should consider adding > it in. It took a while, but I've managed to hit that VM_BUG_ON: [ 707.975456] kernel BUG at include/asm-generic/pgtable.h:724! [ 707.977147] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 707.978974] Dumping ftrace buffer: [ 707.980110](ftrace buffer empty) [ 707.981221] Modules linked in: [ 707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079 [ 707.982801] task: 880165e28000 ti: 880165e3 task.ti: 880165e3 [ 707.982801] RIP: 0010:[] [] change_protection_range+0x94a/0x970 [ 707.982801] RSP: 0018:880165e33d98 EFLAGS: 00010246 [ 707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100 [ 707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902 [ 707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000 [ 707.982801] R10: 1e01 R11: 0008 R12: 41751000 [ 707.982801] R13: 00f7 R14: 9d340902 R15: 41741000 [ 707.982801] FS: 7f358a9aa700() GS:88071c60() knlGS: [ 707.982801] CS: 0010 DS: ES: CR0: 8005003b [ 707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0 [ 707.982801] Stack: [ 707.982801] 8804db88d058 88070fb17cf0 [ 707.982801] 880165d88000 8801686a5000 4163e000 [ 707.982801] 8801686a5000 0001 0025 41750fff [ 707.982801] Call Trace: [ 707.982801] [] change_protection+0x14/0x30 [ 707.982801] [] change_prot_numa+0x1b/0x40 [ 707.982801] [] task_numa_work+0x1f6/0x330 [ 707.982801] [] task_work_run+0xc4/0xf0 [ 707.982801] [] do_notify_resume+0x97/0xb0 [ 707.982801] [] int_signal+0x12/0x17 [ 707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01 [ 707.982801] RIP [] change_protection_range+0x94a/0x970 [ 707.982801] RSP Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/11/2014 11:28 PM, Sasha Levin wrote: On 08/05/2014 09:04 PM, Sasha Levin wrote: Thanks Hugh, Mel. I've added both patches to my local tree and will update tomorrow with the weather. Also: On 08/05/2014 08:42 PM, Hugh Dickins wrote: One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val _PAGE_PRESENT)), being stronger - asserting that indeed we do not put NUMA hints on PROT_NONE areas. (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) I've added VM_BUG_ON(!(val _PAGE_PRESENT)) in just as a curiosity, I'll update how that one looks as well. Sorry for the rather long delay. The patch looks fine, the issue didn't reproduce. The added VM_BUG_ON didn't trigger either, so maybe we should consider adding it in. It took a while, but I've managed to hit that VM_BUG_ON: [ 707.975456] kernel BUG at include/asm-generic/pgtable.h:724! [ 707.977147] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 707.978974] Dumping ftrace buffer: [ 707.980110](ftrace buffer empty) [ 707.981221] Modules linked in: [ 707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079 [ 707.982801] task: 880165e28000 ti: 880165e3 task.ti: 880165e3 [ 707.982801] RIP: 0010:[b42e3dda] [b42e3dda] change_protection_range+0x94a/0x970 [ 707.982801] RSP: 0018:880165e33d98 EFLAGS: 00010246 [ 707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100 [ 707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902 [ 707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000 [ 707.982801] R10: 1e01 R11: 0008 R12: 41751000 [ 707.982801] R13: 00f7 R14: 9d340902 R15: 41741000 [ 707.982801] FS: 7f358a9aa700() GS:88071c60() knlGS: [ 707.982801] CS: 0010 DS: ES: CR0: 8005003b [ 707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0 [ 707.982801] Stack: [ 707.982801] 8804db88d058 88070fb17cf0 [ 707.982801] 880165d88000 8801686a5000 4163e000 [ 707.982801] 8801686a5000 0001 0025 41750fff [ 707.982801] Call Trace: [ 707.982801] [b42e3e14] change_protection+0x14/0x30 [ 707.982801] [b42fda3b] change_prot_numa+0x1b/0x40 [ 707.982801] [b41ad766] task_numa_work+0x1f6/0x330 [ 707.982801] [b41937c4] task_work_run+0xc4/0xf0 [ 707.982801] [b40712e7] do_notify_resume+0x97/0xb0 [ 707.982801] [b74fd6ea] int_signal+0x12/0x17 [ 707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b 0f 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01 [ 707.982801] RIP [b42e3dda] change_protection_range+0x94a/0x970 [ 707.982801] RSP 880165e33d98 Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/05/2014 09:04 PM, Sasha Levin wrote: > Thanks Hugh, Mel. I've added both patches to my local tree and will update > tomorrow > with the weather. > > Also: > > On 08/05/2014 08:42 PM, Hugh Dickins wrote: >> One thing I did wonder, though: at first I was reassured by the >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas. >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) > > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll > update how that one looks as well. Sorry for the rather long delay. The patch looks fine, the issue didn't reproduce. The added VM_BUG_ON didn't trigger either, so maybe we should consider adding it in. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/05/2014 09:04 PM, Sasha Levin wrote: Thanks Hugh, Mel. I've added both patches to my local tree and will update tomorrow with the weather. Also: On 08/05/2014 08:42 PM, Hugh Dickins wrote: One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val _PAGE_PRESENT)), being stronger - asserting that indeed we do not put NUMA hints on PROT_NONE areas. (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) I've added VM_BUG_ON(!(val _PAGE_PRESENT)) in just as a curiosity, I'll update how that one looks as well. Sorry for the rather long delay. The patch looks fine, the issue didn't reproduce. The added VM_BUG_ON didn't trigger either, so maybe we should consider adding it in. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
Mel Gorman writes: > On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote: >> > -#define pmd_mknonnuma pmd_mknonnuma >> > -static inline pmd_t pmd_mknonnuma(pmd_t pmd) >> > +/* >> > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist >> > + * which was inherited from x86. For the purposes of powerpc pte_basic_t >> > is >> > + * equivalent >> > + */ >> > +#define pteval_t pte_basic_t >> > +#define pmdval_t pmd_t >> > +static inline pteval_t pte_flags(pte_t pte) >> > { >> > - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); >> > + return pte_val(pte) & PAGE_PROT_BITS; >> >> PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have >> to check further to find out why the mask doesn't include >> _PAGE_PRESENT. >> > > Dumb of me, not sure how I managed that. For the purposes of what is required > it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask > that defines what bits are of interest to the generic helpers which is what > this version attempts to do. It's not tested on powerpc at all > unfortunately. Boot tested on ppc64. # grep numa /proc/vmstat numa_hit 156722 numa_miss 0 numa_foreign 0 numa_interleave 6365 numa_local 153457 numa_other 3265 numa_pte_updates 169 numa_huge_pte_updates 0 numa_hint_faults 150 numa_hint_faults_local 138 numa_pages_migrated 10 > > ---8<--- > mm: Remove misleading ARCH_USES_NUMA_PROT_NONE > > ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented > _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and > relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting > fault scanner. This was found to be conceptually confusing with a lot of > implicit assumptions and it was asked that an alternative be found. > > Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the > PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap > PTE bits and shrunk the maximum possible swap size but it did not go far > enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA > but the relics still exist. > > This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary > duplication in powerpc vs the generic implementation by defining the types > the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. > This necessitated that a PTE bit mask be created that identified the bits > that distinguish present from NUMA pte entries but it is expected this > will only differ between arches based on _PAGE_PROTNONE. The naming for > the generic helpers was taken from x86 originally but ppc64 has types that > are equivalent for the purposes of the helper so they are mapped instead > of duplicating code. > > Signed-off-by: Mel Gorman > --- > arch/powerpc/include/asm/pgtable.h| 57 > --- > arch/powerpc/include/asm/pte-common.h | 5 +++ > arch/x86/Kconfig | 1 - > arch/x86/include/asm/pgtable_types.h | 7 + > include/asm-generic/pgtable.h | 27 ++--- > init/Kconfig | 11 --- > 6 files changed, 33 insertions(+), 75 deletions(-) > > diff --git a/arch/powerpc/include/asm/pgtable.h > b/arch/powerpc/include/asm/pgtable.h > index d98c1ec..beeb09e 100644 > --- a/arch/powerpc/include/asm/pgtable.h > +++ b/arch/powerpc/include/asm/pgtable.h > @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte) { > return (pte_val(pte) & ~_PTE_NONE_MASK) > static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) > & PAGE_PROT_BITS); } > > #ifdef CONFIG_NUMA_BALANCING > - > static inline int pte_present(pte_t pte) > { > - return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA); > + return pte_val(pte) & _PAGE_NUMA_MASK; > } > > #define pte_present_nonuma pte_present_nonuma > @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte) > return pte_val(pte) & (_PAGE_PRESENT); > } > > -#define pte_numa pte_numa > -static inline int pte_numa(pte_t pte) > -{ > - return (pte_val(pte) & > - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA; > -} > - > -#define pte_mknonnuma pte_mknonnuma > -static inline pte_t pte_mknonnuma(pte_t pte) > -{ > - pte_val(pte) &= ~_PAGE_NUMA; > - pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED; > - return pte; > -} > - > -#define pte_mknuma pte_mknuma > -static inline pte_t pte_mknuma(pte_t pte) > -{ > - /* > - * We should not set _PAGE_NUMA on non present ptes. Also clear the > - * present bit so that hash_page will return 1 and we collect this > - * as numa fault. > - */ > - if (pte_present(pte)) { > - pte_val(pte) |= _PAGE_NUMA; > - pte_val(pte) &= ~_PAGE_PRESENT; > - } else > - VM_BUG_ON(1); > - return pte; > -} > - > #define ptep_set_numa ptep_set_numa > static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, >
Re: mm: BUG in unmap_page_range
Mel Gorman mgor...@suse.de writes: On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote: -#define pmd_mknonnuma pmd_mknonnuma -static inline pmd_t pmd_mknonnuma(pmd_t pmd) +/* + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist + * which was inherited from x86. For the purposes of powerpc pte_basic_t is + * equivalent + */ +#define pteval_t pte_basic_t +#define pmdval_t pmd_t +static inline pteval_t pte_flags(pte_t pte) { - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); + return pte_val(pte) PAGE_PROT_BITS; PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have to check further to find out why the mask doesn't include _PAGE_PRESENT. Dumb of me, not sure how I managed that. For the purposes of what is required it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask that defines what bits are of interest to the generic helpers which is what this version attempts to do. It's not tested on powerpc at all unfortunately. Boot tested on ppc64. # grep numa /proc/vmstat numa_hit 156722 numa_miss 0 numa_foreign 0 numa_interleave 6365 numa_local 153457 numa_other 3265 numa_pte_updates 169 numa_huge_pte_updates 0 numa_hint_faults 150 numa_hint_faults_local 138 numa_pages_migrated 10 ---8--- mm: Remove misleading ARCH_USES_NUMA_PROT_NONE ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting fault scanner. This was found to be conceptually confusing with a lot of implicit assumptions and it was asked that an alternative be found. Commit c46a7c81 x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels redefined _PAGE_NUMA on x86 to be one of the swap PTE bits and shrunk the maximum possible swap size but it did not go far enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA but the relics still exist. This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary duplication in powerpc vs the generic implementation by defining the types the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. This necessitated that a PTE bit mask be created that identified the bits that distinguish present from NUMA pte entries but it is expected this will only differ between arches based on _PAGE_PROTNONE. The naming for the generic helpers was taken from x86 originally but ppc64 has types that are equivalent for the purposes of the helper so they are mapped instead of duplicating code. Signed-off-by: Mel Gorman mgor...@suse.de --- arch/powerpc/include/asm/pgtable.h| 57 --- arch/powerpc/include/asm/pte-common.h | 5 +++ arch/x86/Kconfig | 1 - arch/x86/include/asm/pgtable_types.h | 7 + include/asm-generic/pgtable.h | 27 ++--- init/Kconfig | 11 --- 6 files changed, 33 insertions(+), 75 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index d98c1ec..beeb09e 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte) { return (pte_val(pte) ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING - static inline int pte_present(pte_t pte) { - return pte_val(pte) (_PAGE_PRESENT | _PAGE_NUMA); + return pte_val(pte) _PAGE_NUMA_MASK; } #define pte_present_nonuma pte_present_nonuma @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte) return pte_val(pte) (_PAGE_PRESENT); } -#define pte_numa pte_numa -static inline int pte_numa(pte_t pte) -{ - return (pte_val(pte) - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA; -} - -#define pte_mknonnuma pte_mknonnuma -static inline pte_t pte_mknonnuma(pte_t pte) -{ - pte_val(pte) = ~_PAGE_NUMA; - pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED; - return pte; -} - -#define pte_mknuma pte_mknuma -static inline pte_t pte_mknuma(pte_t pte) -{ - /* - * We should not set _PAGE_NUMA on non present ptes. Also clear the - * present bit so that hash_page will return 1 and we collect this - * as numa fault. - */ - if (pte_present(pte)) { - pte_val(pte) |= _PAGE_NUMA; - pte_val(pte) = ~_PAGE_PRESENT; - } else - VM_BUG_ON(1); - return pte; -} - #define ptep_set_numa ptep_set_numa static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr,
Re: mm: BUG in unmap_page_range
On Tue, Aug 05, 2014 at 05:42:03PM -0700, Hugh Dickins wrote: > > > > > > I'm attaching a preliminary pair of patches. The first which deals with > > ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised > > changelog. I'm adding Aneesh to the cc to look at the powerpc portion of > > the first patch. > > Thanks a lot, Mel. > > I am surprised by the ordering, but perhaps you meant nothing by it. I didn't mean anything by it. It was based on the order I looked at the patches in. Revisited c46a7c817, looked at ARCH_USES_NUMA_PROT_NONE issue to see if it had any potential impact to your patch and then moved on to your patch. > Isn't the first one a welcome but optional cleanup, and the second one > a fix that we need in 3.16-stable? Or does the fix actually depend in > some unstated way upon the cleanup, in powerpc-land perhaps? > It shouldn't as powerpc can use its old helpers. I've included Aneesh in the cc just in case. > Aside from that, for the first patch: yes, I heartily approve of the > disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and > CONFIG_ARCH_USES_NUMA_PROT_NONE. If you wish, add > Acked-by: Hugh Dickins > but of course it's really Aneesh and powerpc who are the test of it. > Thanks. I have a second version finished for that which I'll send once this bug is addressed. > One thing I did wonder, though: at first I was reassured by the > VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought > it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger > - asserting that indeed we do not put NUMA hints on PROT_NONE areas. > (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) > It shouldn't so I'll use the stronger test. Sasha, if it's not too late would you mind testing this patch in isolation as a -stable candidate for 3.16 please? It worked for me including within trinity but then again I was not seeing crashes with 3.16 either so I do not consider my trinity testing to be a reliable indicator. ---8<--- x86,mm: fix pte_special versus pte_numa Sasha Levin has shown oopses on ea0003480048 and ea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfff and 1, which would need special access. Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Reported-by: Sasha Levin Signed-off-by: Hugh Dickins Signed-off-by: Mel Gorman Cc: sta...@vger.kernel.org [3.16] --- arch/x86/include/asm/pgtable.h | 9 +++-- mm/memory.c| 7 +++ 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 0ec0560..aa97a07 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - return (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_SPECIAL)) == -(_PAGE_PRESENT|_PAGE_SPECIAL); + /* +* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. +* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == +* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. +*/ + return (pte_flags(pte) & _PAGE_SPECIAL) && + (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE)); } static inline unsigned long pte_pfn(pte_t pte) diff --git a/mm/memory.c b/mm/memory.c index 8b44f76..0a21f3d 100644 ---
Re: mm: BUG in unmap_page_range
On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote: > > -#define pmd_mknonnuma pmd_mknonnuma > > -static inline pmd_t pmd_mknonnuma(pmd_t pmd) > > +/* > > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist > > + * which was inherited from x86. For the purposes of powerpc pte_basic_t is > > + * equivalent > > + */ > > +#define pteval_t pte_basic_t > > +#define pmdval_t pmd_t > > +static inline pteval_t pte_flags(pte_t pte) > > { > > - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); > > + return pte_val(pte) & PAGE_PROT_BITS; > > PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have > to check further to find out why the mask doesn't include > _PAGE_PRESENT. > Dumb of me, not sure how I managed that. For the purposes of what is required it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask that defines what bits are of interest to the generic helpers which is what this version attempts to do. It's not tested on powerpc at all unfortunately. ---8<--- mm: Remove misleading ARCH_USES_NUMA_PROT_NONE ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting fault scanner. This was found to be conceptually confusing with a lot of implicit assumptions and it was asked that an alternative be found. Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE bits and shrunk the maximum possible swap size but it did not go far enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA but the relics still exist. This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary duplication in powerpc vs the generic implementation by defining the types the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. This necessitated that a PTE bit mask be created that identified the bits that distinguish present from NUMA pte entries but it is expected this will only differ between arches based on _PAGE_PROTNONE. The naming for the generic helpers was taken from x86 originally but ppc64 has types that are equivalent for the purposes of the helper so they are mapped instead of duplicating code. Signed-off-by: Mel Gorman --- arch/powerpc/include/asm/pgtable.h| 57 --- arch/powerpc/include/asm/pte-common.h | 5 +++ arch/x86/Kconfig | 1 - arch/x86/include/asm/pgtable_types.h | 7 + include/asm-generic/pgtable.h | 27 ++--- init/Kconfig | 11 --- 6 files changed, 33 insertions(+), 75 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index d98c1ec..beeb09e 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) & ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING - static inline int pte_present(pte_t pte) { - return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA); + return pte_val(pte) & _PAGE_NUMA_MASK; } #define pte_present_nonuma pte_present_nonuma @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte) return pte_val(pte) & (_PAGE_PRESENT); } -#define pte_numa pte_numa -static inline int pte_numa(pte_t pte) -{ - return (pte_val(pte) & - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA; -} - -#define pte_mknonnuma pte_mknonnuma -static inline pte_t pte_mknonnuma(pte_t pte) -{ - pte_val(pte) &= ~_PAGE_NUMA; - pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED; - return pte; -} - -#define pte_mknuma pte_mknuma -static inline pte_t pte_mknuma(pte_t pte) -{ - /* -* We should not set _PAGE_NUMA on non present ptes. Also clear the -* present bit so that hash_page will return 1 and we collect this -* as numa fault. -*/ - if (pte_present(pte)) { - pte_val(pte) |= _PAGE_NUMA; - pte_val(pte) &= ~_PAGE_PRESENT; - } else - VM_BUG_ON(1); - return pte; -} - #define ptep_set_numa ptep_set_numa static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, return; } -#define pmd_numa pmd_numa -static inline int pmd_numa(pmd_t pmd) -{ - return pte_numa(pmd_pte(pmd)); -} - #define pmdp_set_numa pmdp_set_numa static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp) @@ -109,16 +71,21 @@ static inline void
Re: mm: BUG in unmap_page_range
Mel Gorman writes: > From d0c77a2b497da46c52792ead066d461e5111a594 Mon Sep 17 00:00:00 2001 > From: Mel Gorman > Date: Tue, 5 Aug 2014 12:06:50 +0100 > Subject: [PATCH] mm: Remove misleading ARCH_USES_NUMA_PROT_NONE > > ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented > _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and > relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting > fault scanner. This was found to be conceptually confusing with a lot of > implicit assumptions and it was asked that an alternative be found. > > Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the > PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap > PTE bits and shrunk the maximum possible swap size but it did not go far > enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA > but the relics still exist. > > This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary > duplication in powerpc vs the generic implementation by defining the types > the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. > The unification for ppc64 is less than ideal because types do not exist > that the "generic" code expects to. This patch works around the problem > but it would be preferred if the powerpc people would look at this to see > if they have opinions on what might suit them better. > > Signed-off-by: Mel Gorman > --- > arch/powerpc/include/asm/pgtable.h | 55 > -- > arch/x86/Kconfig | 1 - > include/asm-generic/pgtable.h | 35 > init/Kconfig | 11 > 4 files changed, 29 insertions(+), 73 deletions(-) > > - > #define pmdp_set_numa pmdp_set_numa > static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, >pmd_t *pmdp) > @@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct mm_struct *mm, > unsigned long addr, > return; > } > > -#define pmd_mknonnuma pmd_mknonnuma > -static inline pmd_t pmd_mknonnuma(pmd_t pmd) > +/* > + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist > + * which was inherited from x86. For the purposes of powerpc pte_basic_t is > + * equivalent > + */ > +#define pteval_t pte_basic_t > +#define pmdval_t pmd_t > +static inline pteval_t pte_flags(pte_t pte) > { > - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); > + return pte_val(pte) & PAGE_PROT_BITS; PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have to check further to find out why the mask doesn't include _PAGE_PRESENT. > } > > -#define pmd_mknuma pmd_mknuma > -static inline pmd_t pmd_mknuma(pmd_t pmd) > +static inline pteval_t pmd_flags(pte_t pte) > { static inline pmdval_t ? > - return pte_pmd(pte_mknuma(pmd_pte(pmd))); > + return pmd_val(pte) & PAGE_PROT_BITS; > } > -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
Mel Gorman mgor...@suse.de writes: From d0c77a2b497da46c52792ead066d461e5111a594 Mon Sep 17 00:00:00 2001 From: Mel Gorman mgor...@suse.de Date: Tue, 5 Aug 2014 12:06:50 +0100 Subject: [PATCH] mm: Remove misleading ARCH_USES_NUMA_PROT_NONE ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting fault scanner. This was found to be conceptually confusing with a lot of implicit assumptions and it was asked that an alternative be found. Commit c46a7c81 x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels redefined _PAGE_NUMA on x86 to be one of the swap PTE bits and shrunk the maximum possible swap size but it did not go far enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA but the relics still exist. This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary duplication in powerpc vs the generic implementation by defining the types the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. The unification for ppc64 is less than ideal because types do not exist that the generic code expects to. This patch works around the problem but it would be preferred if the powerpc people would look at this to see if they have opinions on what might suit them better. Signed-off-by: Mel Gorman mgor...@suse.de --- arch/powerpc/include/asm/pgtable.h | 55 -- arch/x86/Kconfig | 1 - include/asm-generic/pgtable.h | 35 init/Kconfig | 11 4 files changed, 29 insertions(+), 73 deletions(-) - #define pmdp_set_numa pmdp_set_numa static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp) @@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, return; } -#define pmd_mknonnuma pmd_mknonnuma -static inline pmd_t pmd_mknonnuma(pmd_t pmd) +/* + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist + * which was inherited from x86. For the purposes of powerpc pte_basic_t is + * equivalent + */ +#define pteval_t pte_basic_t +#define pmdval_t pmd_t +static inline pteval_t pte_flags(pte_t pte) { - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); + return pte_val(pte) PAGE_PROT_BITS; PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have to check further to find out why the mask doesn't include _PAGE_PRESENT. } -#define pmd_mknuma pmd_mknuma -static inline pmd_t pmd_mknuma(pmd_t pmd) +static inline pteval_t pmd_flags(pte_t pte) { static inline pmdval_t ? - return pte_pmd(pte_mknuma(pmd_pte(pmd))); + return pmd_val(pte) PAGE_PROT_BITS; } -aneesh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Wed, Aug 06, 2014 at 12:44:45PM +0530, Aneesh Kumar K.V wrote: -#define pmd_mknonnuma pmd_mknonnuma -static inline pmd_t pmd_mknonnuma(pmd_t pmd) +/* + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist + * which was inherited from x86. For the purposes of powerpc pte_basic_t is + * equivalent + */ +#define pteval_t pte_basic_t +#define pmdval_t pmd_t +static inline pteval_t pte_flags(pte_t pte) { - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); + return pte_val(pte) PAGE_PROT_BITS; PAGE_PROT_BITS don't get the _PAGE_NUMA and _PAGE_PRESENT. I will have to check further to find out why the mask doesn't include _PAGE_PRESENT. Dumb of me, not sure how I managed that. For the purposes of what is required it doesn't matter what PAGE_PROT_BITS does. It is clearer if there is a mask that defines what bits are of interest to the generic helpers which is what this version attempts to do. It's not tested on powerpc at all unfortunately. ---8--- mm: Remove misleading ARCH_USES_NUMA_PROT_NONE ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting fault scanner. This was found to be conceptually confusing with a lot of implicit assumptions and it was asked that an alternative be found. Commit c46a7c81 x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels redefined _PAGE_NUMA on x86 to be one of the swap PTE bits and shrunk the maximum possible swap size but it did not go far enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA but the relics still exist. This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary duplication in powerpc vs the generic implementation by defining the types the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. This necessitated that a PTE bit mask be created that identified the bits that distinguish present from NUMA pte entries but it is expected this will only differ between arches based on _PAGE_PROTNONE. The naming for the generic helpers was taken from x86 originally but ppc64 has types that are equivalent for the purposes of the helper so they are mapped instead of duplicating code. Signed-off-by: Mel Gorman mgor...@suse.de --- arch/powerpc/include/asm/pgtable.h| 57 --- arch/powerpc/include/asm/pte-common.h | 5 +++ arch/x86/Kconfig | 1 - arch/x86/include/asm/pgtable_types.h | 7 + include/asm-generic/pgtable.h | 27 ++--- init/Kconfig | 11 --- 6 files changed, 33 insertions(+), 75 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index d98c1ec..beeb09e 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -38,10 +38,9 @@ static inline int pte_none(pte_t pte){ return (pte_val(pte) ~_PTE_NONE_MASK) static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) PAGE_PROT_BITS); } #ifdef CONFIG_NUMA_BALANCING - static inline int pte_present(pte_t pte) { - return pte_val(pte) (_PAGE_PRESENT | _PAGE_NUMA); + return pte_val(pte) _PAGE_NUMA_MASK; } #define pte_present_nonuma pte_present_nonuma @@ -50,37 +49,6 @@ static inline int pte_present_nonuma(pte_t pte) return pte_val(pte) (_PAGE_PRESENT); } -#define pte_numa pte_numa -static inline int pte_numa(pte_t pte) -{ - return (pte_val(pte) - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA; -} - -#define pte_mknonnuma pte_mknonnuma -static inline pte_t pte_mknonnuma(pte_t pte) -{ - pte_val(pte) = ~_PAGE_NUMA; - pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED; - return pte; -} - -#define pte_mknuma pte_mknuma -static inline pte_t pte_mknuma(pte_t pte) -{ - /* -* We should not set _PAGE_NUMA on non present ptes. Also clear the -* present bit so that hash_page will return 1 and we collect this -* as numa fault. -*/ - if (pte_present(pte)) { - pte_val(pte) |= _PAGE_NUMA; - pte_val(pte) = ~_PAGE_PRESENT; - } else - VM_BUG_ON(1); - return pte; -} - #define ptep_set_numa ptep_set_numa static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep) @@ -92,12 +60,6 @@ static inline void ptep_set_numa(struct mm_struct *mm, unsigned long addr, return; } -#define pmd_numa pmd_numa -static inline int pmd_numa(pmd_t pmd) -{ - return pte_numa(pmd_pte(pmd)); -} - #define pmdp_set_numa pmdp_set_numa static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp) @@ -109,16 +71,21 @@ static inline void pmdp_set_numa(struct
Re: mm: BUG in unmap_page_range
On Tue, Aug 05, 2014 at 05:42:03PM -0700, Hugh Dickins wrote: SNIP I'm attaching a preliminary pair of patches. The first which deals with ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised changelog. I'm adding Aneesh to the cc to look at the powerpc portion of the first patch. Thanks a lot, Mel. I am surprised by the ordering, but perhaps you meant nothing by it. I didn't mean anything by it. It was based on the order I looked at the patches in. Revisited c46a7c817, looked at ARCH_USES_NUMA_PROT_NONE issue to see if it had any potential impact to your patch and then moved on to your patch. Isn't the first one a welcome but optional cleanup, and the second one a fix that we need in 3.16-stable? Or does the fix actually depend in some unstated way upon the cleanup, in powerpc-land perhaps? It shouldn't as powerpc can use its old helpers. I've included Aneesh in the cc just in case. Aside from that, for the first patch: yes, I heartily approve of the disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and CONFIG_ARCH_USES_NUMA_PROT_NONE. If you wish, add Acked-by: Hugh Dickins hu...@google.com but of course it's really Aneesh and powerpc who are the test of it. Thanks. I have a second version finished for that which I'll send once this bug is addressed. One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val _PAGE_PRESENT)), being stronger - asserting that indeed we do not put NUMA hints on PROT_NONE areas. (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) It shouldn't so I'll use the stronger test. Sasha, if it's not too late would you mind testing this patch in isolation as a -stable candidate for 3.16 please? It worked for me including within trinity but then again I was not seeing crashes with 3.16 either so I do not consider my trinity testing to be a reliable indicator. ---8--- x86,mm: fix pte_special versus pte_numa Sasha Levin has shown oopses on ea0003480048 and ea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page-mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfff and 1, which would need special access. Commit c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels) has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels) Reported-by: Sasha Levin sasha.le...@oracle.com Signed-off-by: Hugh Dickins hu...@google.com Signed-off-by: Mel Gorman mgor...@suse.de Cc: sta...@vger.kernel.org [3.16] --- arch/x86/include/asm/pgtable.h | 9 +++-- mm/memory.c| 7 +++ 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 0ec0560..aa97a07 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte) static inline int pte_special(pte_t pte) { - return (pte_flags(pte) (_PAGE_PRESENT|_PAGE_SPECIAL)) == -(_PAGE_PRESENT|_PAGE_SPECIAL); + /* +* See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h. +* On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 == +* __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL. +*/ + return (pte_flags(pte) _PAGE_SPECIAL) + (pte_flags(pte) (_PAGE_PRESENT|_PAGE_PROTNONE)); } static inline unsigned long pte_pfn(pte_t pte) diff --git a/mm/memory.c b/mm/memory.c index
Re: mm: BUG in unmap_page_range
Thanks Hugh, Mel. I've added both patches to my local tree and will update tomorrow with the weather. Also: On 08/05/2014 08:42 PM, Hugh Dickins wrote: > One thing I did wonder, though: at first I was reassured by the > VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought > it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger > - asserting that indeed we do not put NUMA hints on PROT_NONE areas. > (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll update how that one looks as well. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Tue, 5 Aug 2014, Mel Gorman wrote: > On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote: > > > > [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa > > > > Sasha Levin has shown oopses on ea0003480048 and ea0003480008 > > at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: > > where zap_pte_range() checks page->mapping to see if PageAnon(page). > > > > Those addresses fit struct pages for pfns d2001 and d2000, and in each > > dump a register or a stack slot showed d2001730 or d2000730: pte flags > > 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has > > a hole between cfff and 1, which would need special access. > > > > Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing software bits on > > the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL > > pte no longer passes the pte_special() test, so zap_pte_range() goes on > > to try to access a non-existent struct page. > > > > :( > > > Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) > > to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). > > > > It's unclear why c46a7c817e66 added pte_numa() test to vm_normal_page(), > > and moved its is_zero_pfn() test from slow to fast path: I suspect both > > were papering over PROT_NONE issues seen with inadequate pte_special(). > > Revert vm_normal_page() to how it was before, relying on pte_special(). > > > > Rather than answering directly I updated your changelog > > Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) > to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). > > A hint that this was a problem was that c46a7c817e66 added pte_numa() > test to vm_normal_page(), and moved its is_zero_pfn() test from slow to > fast path: This was papering over a pte_special() snag when the zero > page was encountered during zap. This patch reverts vm_normal_page() > to how it was before, relying on pte_special(). Thanks, that's fine. > > > I find it confusing, that the only example of ARCH_USES_NUMA_PROT_NONE > > no longer uses PROTNONE for NUMA, but SPECIAL instead: update the > > asm-generic comment a little, but that config option remains unhelpful. > > > > ARCH_USES_NUMA_PROT_NONE should have been sent to the farm at the same time > as that patch and by rights unified with the powerpc helpers. With the new > _PAGE_NUMA bit, there is no reason they should have different implementations > of pte_numa and related functions. Unfortunately unifying them is a little > problematic due to differences in fundamental types. It could be done with > #defines but I'm attaching a preliminary prototype to illustrate the issue. > > > But more seriously, I think this patch is incomplete: aren't there > > other places which need to be handling PROTNONE along with PRESENT? > > For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, > > but on a PROT_NONE area, I think that will now make it pte_special()? > > So it ought to clear _PAGE_PROTNONE too. Or maybe we can never > > pte_mknuma() on a PROT_NONE area - there would be no point? > > > > We are depending on the fact that inaccessible VMAs are skipped by the > NUMA hinting scanner. Ah, okay. And the other way round (mprotecting to PROT_NONE an area which already contains _PAGE_NUMA ptes) already looked safe to me. > > > Around here I began to wonder if it was just a mistake to have deserted > > the PROTNONE for NUMA model: I know Linus had a strong reaction against > > it, and I've never delved into its drawbacks myself; but bringing yet > > another (SPECIAL) flag into the game is not an obvious improvement. > > Should we just revert c46a7c817e66, or would that be a mistake? > > > > It's replacing one type of complexity with another. The downside is that > _PAGE_NUMA == _PAGE_PROTNONE puts subtle traps all over the core for > powerpc to fall foul of. Okay. > > I'm attaching a preliminary pair of patches. The first which deals with > ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised > changelog. I'm adding Aneesh to the cc to look at the powerpc portion of > the first patch. Thanks a lot, Mel. I am surprised by the ordering, but perhaps you meant nothing by it. Isn't the first one a welcome but optional cleanup, and the second one a fix that we need in 3.16-stable? Or does the fix actually depend in some unstated way upon the cleanup, in powerpc-land perhaps? Aside from that, for the first patch: yes, I heartily approve of the disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and CONFIG_ARCH_USES_NUMA_PROT_NONE. If you wish, add Acked-by: Hugh Dickins but of course it's really Aneesh and powerpc who are the test of it. One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)),
Re: mm: BUG in unmap_page_range
On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote: > On Sat, 2 Aug 2014, Sasha Levin wrote: > > > Hi all, > > > > While fuzzing with trinity inside a KVM tools guest running the latest -next > > kernel, I've stumbled on the following spew: > > > > [ 2957.087977] BUG: unable to handle kernel paging request at > > ea0003480008 > > [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 > > mm/memory.c:1277 mm/memory.c:1301) > > [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0 > > [ 2957.088041] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC > > [ 2957.088087] Dumping ftrace buffer: > > [ 2957.088266](ftrace buffer empty) > > [ 2957.088279] Modules linked in: > > [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted > > 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990 > > [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: > > 880739fb4000 > > [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 > > mm/memory.c:1277 mm/memory.c:1301) > > [ 2957.088328] RSP: 0018:880739fb7c58 EFLAGS: 00010246 > > [ 2957.088336] RAX: RBX: 880eb2bdbed8 RCX: > > dfff971b4280 > > [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: > > ea0003480008 > > [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: > > 00b6e000 > > [ 2957.088357] R10: R11: 0001 R12: > > ea000348 > > [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: > > 7f00e85db000 > > [ 2957.088374] FS: 7f00e85d8700() GS:88177fa0() > > knlGS: > > [ 2957.088381] CS: 0010 DS: ES: CR0: 80050033 > > [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: > > 06a0 > > [ 2957.088406] DR0: DR1: DR2: > > > > [ 2957.088413] DR3: DR6: 0ff0 DR7: > > 0600 > > [ 2957.088416] Stack: > > [ 2957.088432] 88171726d570 0010 0008 > > d2000730 > > [ 2957.088450] 19d00250 7f00e85dc000 880f9d311900 > > 880739fb7e20 > > [ 2957.088466] 8807a8c507a0 8807a8c5 8807a75fe000 > > 8807ceaa7a10 > > [ 2957.088469] Call Trace: > > [ 2957.088490] unmap_single_vma (mm/memory.c:1348) > > [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3)) > > [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4)) > > [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 > > include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 > > mm/mmap.c:493) > > [ 2957.088559] ? vmacache_update (mm/vmacache.c:61) > > [ 2957.088572] do_munmap (mm/mmap.c:2581) > > [ 2957.088583] vm_munmap (mm/mmap.c:2596) > > [ 2957.088595] SyS_munmap (mm/mmap.c:2601) > > [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541) > > [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 > > 0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 > > <41> f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd > > All code > > > >0: ff (bad) > >1: ff e8 ljmpq * > >3: f9 stc > >4: 5f pop%rdi > >5: 07 (bad) > >6: 00 48 8badd%cl,-0x75(%rax) > >9: 45 90 rex.RB xchg %eax,%r8d > >b: 80 48 18 01 orb$0x1,0x18(%rax) > >f: 4d 85 e4test %r12,%r12 > > 12: 0f 84 8b fe ff ff je 0xfea3 > > 18: 45 84 edtest %r13b,%r13b > > 1b: 0f 85 fc 03 00 00 jne0x41d > > 21: 49 8d 7c 24 08 lea0x8(%r12),%rdi > > 26: e8 b5 67 07 00 callq 0x767e0 > > 2b:* 41 f6 44 24 08 01 testb $0x1,0x8(%r12) <-- > > trapping instruction > > 31: 0f 84 29 02 00 00 je 0x260 > > 37: 83 6d c8 01 subl $0x1,-0x38(%rbp) > > 3b: 4c 89 e7mov%r12,%rdi > > 3e: e8 .byte 0xe8 > > 3f: bd .byte 0xbd > > This differs in which functions got inlined (unmap_page_range showing up > in place of zap_pte_range), but this is the same "if (PageAnon(page))" > that Sasha reported in the "hang in shmem_fallocate" thread on June 26th. > > I can see what it is now, and here is most of a patch (which I don't > expect to satisfy Trinity yet); at this point I think I had better > hand it over to Mel, to complete or to discard. > > [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa > > Sasha Levin has shown oopses on ea0003480048 and ea0003480008 > at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: > where zap_pte_range() checks
Re: mm: BUG in unmap_page_range
On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote: On Sat, 2 Aug 2014, Sasha Levin wrote: Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel, I've stumbled on the following spew: [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008 [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0 [ 2957.088041] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 2957.088087] Dumping ftrace buffer: [ 2957.088266](ftrace buffer empty) [ 2957.088279] Modules linked in: [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990 [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 880739fb4000 [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088328] RSP: 0018:880739fb7c58 EFLAGS: 00010246 [ 2957.088336] RAX: RBX: 880eb2bdbed8 RCX: dfff971b4280 [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: ea0003480008 [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 00b6e000 [ 2957.088357] R10: R11: 0001 R12: ea000348 [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 7f00e85db000 [ 2957.088374] FS: 7f00e85d8700() GS:88177fa0() knlGS: [ 2957.088381] CS: 0010 DS: ES: CR0: 80050033 [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 06a0 [ 2957.088406] DR0: DR1: DR2: [ 2957.088413] DR3: DR6: 0ff0 DR7: 0600 [ 2957.088416] Stack: [ 2957.088432] 88171726d570 0010 0008 d2000730 [ 2957.088450] 19d00250 7f00e85dc000 880f9d311900 880739fb7e20 [ 2957.088466] 8807a8c507a0 8807a8c5 8807a75fe000 8807ceaa7a10 [ 2957.088469] Call Trace: [ 2957.088490] unmap_single_vma (mm/memory.c:1348) [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3)) [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4)) [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 mm/mmap.c:493) [ 2957.088559] ? vmacache_update (mm/vmacache.c:61) [ 2957.088572] do_munmap (mm/mmap.c:2581) [ 2957.088583] vm_munmap (mm/mmap.c:2596) [ 2957.088595] SyS_munmap (mm/mmap.c:2601) [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541) [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 41 f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd All code 0: ff (bad) 1: ff e8 ljmpq *internal disassembler error 3: f9 stc 4: 5f pop%rdi 5: 07 (bad) 6: 00 48 8badd%cl,-0x75(%rax) 9: 45 90 rex.RB xchg %eax,%r8d b: 80 48 18 01 orb$0x1,0x18(%rax) f: 4d 85 e4test %r12,%r12 12: 0f 84 8b fe ff ff je 0xfea3 18: 45 84 edtest %r13b,%r13b 1b: 0f 85 fc 03 00 00 jne0x41d 21: 49 8d 7c 24 08 lea0x8(%r12),%rdi 26: e8 b5 67 07 00 callq 0x767e0 2b:* 41 f6 44 24 08 01 testb $0x1,0x8(%r12) -- trapping instruction 31: 0f 84 29 02 00 00 je 0x260 37: 83 6d c8 01 subl $0x1,-0x38(%rbp) 3b: 4c 89 e7mov%r12,%rdi 3e: e8 .byte 0xe8 3f: bd .byte 0xbd This differs in which functions got inlined (unmap_page_range showing up in place of zap_pte_range), but this is the same if (PageAnon(page)) that Sasha reported in the hang in shmem_fallocate thread on June 26th. I can see what it is now, and here is most of a patch (which I don't expect to satisfy Trinity yet); at this point I think I had better hand it over to Mel, to complete or to discard. [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa Sasha Levin has shown oopses on ea0003480048 and ea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page-mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or
Re: mm: BUG in unmap_page_range
On Tue, 5 Aug 2014, Mel Gorman wrote: On Mon, Aug 04, 2014 at 04:40:38AM -0700, Hugh Dickins wrote: [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa Sasha Levin has shown oopses on ea0003480048 and ea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page-mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfff and 1, which would need special access. Commit c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels) has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. :( Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). It's unclear why c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: I suspect both were papering over PROT_NONE issues seen with inadequate pte_special(). Revert vm_normal_page() to how it was before, relying on pte_special(). Rather than answering directly I updated your changelog Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c817e66 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). Thanks, that's fine. I find it confusing, that the only example of ARCH_USES_NUMA_PROT_NONE no longer uses PROTNONE for NUMA, but SPECIAL instead: update the asm-generic comment a little, but that config option remains unhelpful. ARCH_USES_NUMA_PROT_NONE should have been sent to the farm at the same time as that patch and by rights unified with the powerpc helpers. With the new _PAGE_NUMA bit, there is no reason they should have different implementations of pte_numa and related functions. Unfortunately unifying them is a little problematic due to differences in fundamental types. It could be done with #defines but I'm attaching a preliminary prototype to illustrate the issue. But more seriously, I think this patch is incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, I think that will now make it pte_special()? So it ought to clear _PAGE_PROTNONE too. Or maybe we can never pte_mknuma() on a PROT_NONE area - there would be no point? We are depending on the fact that inaccessible VMAs are skipped by the NUMA hinting scanner. Ah, okay. And the other way round (mprotecting to PROT_NONE an area which already contains _PAGE_NUMA ptes) already looked safe to me. Around here I began to wonder if it was just a mistake to have deserted the PROTNONE for NUMA model: I know Linus had a strong reaction against it, and I've never delved into its drawbacks myself; but bringing yet another (SPECIAL) flag into the game is not an obvious improvement. Should we just revert c46a7c817e66, or would that be a mistake? It's replacing one type of complexity with another. The downside is that _PAGE_NUMA == _PAGE_PROTNONE puts subtle traps all over the core for powerpc to fall foul of. Okay. I'm attaching a preliminary pair of patches. The first which deals with ARCH_USES_NUMA_PROT_NONE and the second which is yours with a revised changelog. I'm adding Aneesh to the cc to look at the powerpc portion of the first patch. Thanks a lot, Mel. I am surprised by the ordering, but perhaps you meant nothing by it. Isn't the first one a welcome but optional cleanup, and the second one a fix that we need in 3.16-stable? Or does the fix actually depend in some unstated way upon the cleanup, in powerpc-land perhaps? Aside from that, for the first patch: yes, I heartily approve of the disappearance of CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE and CONFIG_ARCH_USES_NUMA_PROT_NONE. If you wish, add Acked-by: Hugh Dickins hu...@google.com but of course it's really Aneesh and powerpc who are the test of it. One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val _PAGE_PRESENT)), being stronger - asserting that indeed we do not put NUMA hints on PROT_NONE areas. (But I have not tested,
Re: mm: BUG in unmap_page_range
Thanks Hugh, Mel. I've added both patches to my local tree and will update tomorrow with the weather. Also: On 08/05/2014 08:42 PM, Hugh Dickins wrote: One thing I did wonder, though: at first I was reassured by the VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought it would be better as VM_BUG_ON(!(val _PAGE_PRESENT)), being stronger - asserting that indeed we do not put NUMA hints on PROT_NONE areas. (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) I've added VM_BUG_ON(!(val _PAGE_PRESENT)) in just as a curiosity, I'll update how that one looks as well. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On Sat, 2 Aug 2014, Sasha Levin wrote: > Hi all, > > While fuzzing with trinity inside a KVM tools guest running the latest -next > kernel, I've stumbled on the following spew: > > [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008 > [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 > mm/memory.c:1277 mm/memory.c:1301) > [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0 > [ 2957.088041] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 2957.088087] Dumping ftrace buffer: > [ 2957.088266](ftrace buffer empty) > [ 2957.088279] Modules linked in: > [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted > 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990 > [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: > 880739fb4000 > [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 > mm/memory.c:1277 mm/memory.c:1301) > [ 2957.088328] RSP: 0018:880739fb7c58 EFLAGS: 00010246 > [ 2957.088336] RAX: RBX: 880eb2bdbed8 RCX: > dfff971b4280 > [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: > ea0003480008 > [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: > 00b6e000 > [ 2957.088357] R10: R11: 0001 R12: > ea000348 > [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: > 7f00e85db000 > [ 2957.088374] FS: 7f00e85d8700() GS:88177fa0() > knlGS: > [ 2957.088381] CS: 0010 DS: ES: CR0: 80050033 > [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: > 06a0 > [ 2957.088406] DR0: DR1: DR2: > > [ 2957.088413] DR3: DR6: 0ff0 DR7: > 0600 > [ 2957.088416] Stack: > [ 2957.088432] 88171726d570 0010 0008 > d2000730 > [ 2957.088450] 19d00250 7f00e85dc000 880f9d311900 > 880739fb7e20 > [ 2957.088466] 8807a8c507a0 8807a8c5 8807a75fe000 > 8807ceaa7a10 > [ 2957.088469] Call Trace: > [ 2957.088490] unmap_single_vma (mm/memory.c:1348) > [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3)) > [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4)) > [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 > include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 > mm/mmap.c:493) > [ 2957.088559] ? vmacache_update (mm/vmacache.c:61) > [ 2957.088572] do_munmap (mm/mmap.c:2581) > [ 2957.088583] vm_munmap (mm/mmap.c:2596) > [ 2957.088595] SyS_munmap (mm/mmap.c:2601) > [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541) > [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f > 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 <41> > f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd > All code > >0: ff (bad) >1: ff e8 ljmpq * >3: f9 stc >4: 5f pop%rdi >5: 07 (bad) >6: 00 48 8badd%cl,-0x75(%rax) >9: 45 90 rex.RB xchg %eax,%r8d >b: 80 48 18 01 orb$0x1,0x18(%rax) >f: 4d 85 e4test %r12,%r12 > 12: 0f 84 8b fe ff ff je 0xfea3 > 18: 45 84 edtest %r13b,%r13b > 1b: 0f 85 fc 03 00 00 jne0x41d > 21: 49 8d 7c 24 08 lea0x8(%r12),%rdi > 26: e8 b5 67 07 00 callq 0x767e0 > 2b:*41 f6 44 24 08 01 testb $0x1,0x8(%r12) <-- > trapping instruction > 31: 0f 84 29 02 00 00 je 0x260 > 37: 83 6d c8 01 subl $0x1,-0x38(%rbp) > 3b: 4c 89 e7mov%r12,%rdi > 3e: e8 .byte 0xe8 > 3f: bd .byte 0xbd This differs in which functions got inlined (unmap_page_range showing up in place of zap_pte_range), but this is the same "if (PageAnon(page))" that Sasha reported in the "hang in shmem_fallocate" thread on June 26th. I can see what it is now, and here is most of a patch (which I don't expect to satisfy Trinity yet); at this point I think I had better hand it over to Mel, to complete or to discard. [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa Sasha Levin has shown oopses on ea0003480048 and ea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfff and 1, which would need special access. Commit c46a7c817e66 ("x86: define _PAGE_NUMA by reusing
Re: mm: BUG in unmap_page_range
On Sat, 2 Aug 2014, Sasha Levin wrote: Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel, I've stumbled on the following spew: [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008 [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0 [ 2957.088041] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 2957.088087] Dumping ftrace buffer: [ 2957.088266](ftrace buffer empty) [ 2957.088279] Modules linked in: [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990 [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 880739fb4000 [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088328] RSP: 0018:880739fb7c58 EFLAGS: 00010246 [ 2957.088336] RAX: RBX: 880eb2bdbed8 RCX: dfff971b4280 [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: ea0003480008 [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 00b6e000 [ 2957.088357] R10: R11: 0001 R12: ea000348 [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 7f00e85db000 [ 2957.088374] FS: 7f00e85d8700() GS:88177fa0() knlGS: [ 2957.088381] CS: 0010 DS: ES: CR0: 80050033 [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 06a0 [ 2957.088406] DR0: DR1: DR2: [ 2957.088413] DR3: DR6: 0ff0 DR7: 0600 [ 2957.088416] Stack: [ 2957.088432] 88171726d570 0010 0008 d2000730 [ 2957.088450] 19d00250 7f00e85dc000 880f9d311900 880739fb7e20 [ 2957.088466] 8807a8c507a0 8807a8c5 8807a75fe000 8807ceaa7a10 [ 2957.088469] Call Trace: [ 2957.088490] unmap_single_vma (mm/memory.c:1348) [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3)) [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4)) [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 mm/mmap.c:493) [ 2957.088559] ? vmacache_update (mm/vmacache.c:61) [ 2957.088572] do_munmap (mm/mmap.c:2581) [ 2957.088583] vm_munmap (mm/mmap.c:2596) [ 2957.088595] SyS_munmap (mm/mmap.c:2601) [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541) [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 41 f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd All code 0: ff (bad) 1: ff e8 ljmpq *internal disassembler error 3: f9 stc 4: 5f pop%rdi 5: 07 (bad) 6: 00 48 8badd%cl,-0x75(%rax) 9: 45 90 rex.RB xchg %eax,%r8d b: 80 48 18 01 orb$0x1,0x18(%rax) f: 4d 85 e4test %r12,%r12 12: 0f 84 8b fe ff ff je 0xfea3 18: 45 84 edtest %r13b,%r13b 1b: 0f 85 fc 03 00 00 jne0x41d 21: 49 8d 7c 24 08 lea0x8(%r12),%rdi 26: e8 b5 67 07 00 callq 0x767e0 2b:*41 f6 44 24 08 01 testb $0x1,0x8(%r12) -- trapping instruction 31: 0f 84 29 02 00 00 je 0x260 37: 83 6d c8 01 subl $0x1,-0x38(%rbp) 3b: 4c 89 e7mov%r12,%rdi 3e: e8 .byte 0xe8 3f: bd .byte 0xbd This differs in which functions got inlined (unmap_page_range showing up in place of zap_pte_range), but this is the same if (PageAnon(page)) that Sasha reported in the hang in shmem_fallocate thread on June 26th. I can see what it is now, and here is most of a patch (which I don't expect to satisfy Trinity yet); at this point I think I had better hand it over to Mel, to complete or to discard. [INCOMPLETE PATCH] x86,mm: fix pte_special versus pte_numa Sasha Levin has shown oopses on ea0003480048 and ea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page-mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfff and 1, which would need special access. Commit c46a7c817e66 (x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels) has broken
mm: BUG in unmap_page_range
Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel, I've stumbled on the following spew: [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008 [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0 [ 2957.088041] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 2957.088087] Dumping ftrace buffer: [ 2957.088266](ftrace buffer empty) [ 2957.088279] Modules linked in: [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990 [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 880739fb4000 [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088328] RSP: 0018:880739fb7c58 EFLAGS: 00010246 [ 2957.088336] RAX: RBX: 880eb2bdbed8 RCX: dfff971b4280 [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: ea0003480008 [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 00b6e000 [ 2957.088357] R10: R11: 0001 R12: ea000348 [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 7f00e85db000 [ 2957.088374] FS: 7f00e85d8700() GS:88177fa0() knlGS: [ 2957.088381] CS: 0010 DS: ES: CR0: 80050033 [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 06a0 [ 2957.088406] DR0: DR1: DR2: [ 2957.088413] DR3: DR6: 0ff0 DR7: 0600 [ 2957.088416] Stack: [ 2957.088432] 88171726d570 0010 0008 d2000730 [ 2957.088450] 19d00250 7f00e85dc000 880f9d311900 880739fb7e20 [ 2957.088466] 8807a8c507a0 8807a8c5 8807a75fe000 8807ceaa7a10 [ 2957.088469] Call Trace: [ 2957.088490] unmap_single_vma (mm/memory.c:1348) [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3)) [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4)) [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 mm/mmap.c:493) [ 2957.088559] ? vmacache_update (mm/vmacache.c:61) [ 2957.088572] do_munmap (mm/mmap.c:2581) [ 2957.088583] vm_munmap (mm/mmap.c:2596) [ 2957.088595] SyS_munmap (mm/mmap.c:2601) [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541) [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 <41> f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd All code 0: ff (bad) 1: ff e8 ljmpq * 3: f9 stc 4: 5f pop%rdi 5: 07 (bad) 6: 00 48 8badd%cl,-0x75(%rax) 9: 45 90 rex.RB xchg %eax,%r8d b: 80 48 18 01 orb$0x1,0x18(%rax) f: 4d 85 e4test %r12,%r12 12: 0f 84 8b fe ff ff je 0xfea3 18: 45 84 edtest %r13b,%r13b 1b: 0f 85 fc 03 00 00 jne0x41d 21: 49 8d 7c 24 08 lea0x8(%r12),%rdi 26: e8 b5 67 07 00 callq 0x767e0 2b:* 41 f6 44 24 08 01 testb $0x1,0x8(%r12) <-- trapping instruction 31: 0f 84 29 02 00 00 je 0x260 37: 83 6d c8 01 subl $0x1,-0x38(%rbp) 3b: 4c 89 e7mov%r12,%rdi 3e: e8 .byte 0xe8 3f: bd .byte 0xbd ... Code starting with the faulting instruction === 0: 41 f6 44 24 08 01 testb $0x1,0x8(%r12) 6: 0f 84 29 02 00 00 je 0x235 c: 83 6d c8 01 subl $0x1,-0x38(%rbp) 10: 4c 89 e7mov%r12,%rdi 13: e8 .byte 0xe8 14: bd .byte 0xbd ... [ 2957.088784] RIP unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088789] RSP [ 2957.088794] CR2: ea0003480008 Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mm: BUG in unmap_page_range
Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel, I've stumbled on the following spew: [ 2957.087977] BUG: unable to handle kernel paging request at ea0003480008 [ 2957.088008] IP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088024] PGD 7fffc6067 PUD 7fffc5067 PMD 0 [ 2957.088041] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 2957.088087] Dumping ftrace buffer: [ 2957.088266](ftrace buffer empty) [ 2957.088279] Modules linked in: [ 2957.088293] CPU: 2 PID: 15417 Comm: trinity-c200 Not tainted 3.16.0-rc7-next-20140801-sasha-00047-gd6ce559 #990 [ 2957.088301] task: 8807a8c5 ti: 880739fb4000 task.ti: 880739fb4000 [ 2957.088320] RIP: unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088328] RSP: 0018:880739fb7c58 EFLAGS: 00010246 [ 2957.088336] RAX: RBX: 880eb2bdbed8 RCX: dfff971b4280 [ 2957.088343] RDX: 1100e73f6fc4 RSI: 7f00e85db000 RDI: ea0003480008 [ 2957.088350] RBP: 880739fb7d58 R08: 0001 R09: 00b6e000 [ 2957.088357] R10: R11: 0001 R12: ea000348 [ 2957.088365] R13: d2000700 R14: 7f00e85dc000 R15: 7f00e85db000 [ 2957.088374] FS: 7f00e85d8700() GS:88177fa0() knlGS: [ 2957.088381] CS: 0010 DS: ES: CR0: 80050033 [ 2957.088387] CR2: ea0003480008 CR3: 0007a802a000 CR4: 06a0 [ 2957.088406] DR0: DR1: DR2: [ 2957.088413] DR3: DR6: 0ff0 DR7: 0600 [ 2957.088416] Stack: [ 2957.088432] 88171726d570 0010 0008 d2000730 [ 2957.088450] 19d00250 7f00e85dc000 880f9d311900 880739fb7e20 [ 2957.088466] 8807a8c507a0 8807a8c5 8807a75fe000 8807ceaa7a10 [ 2957.088469] Call Trace: [ 2957.088490] unmap_single_vma (mm/memory.c:1348) [ 2957.088505] unmap_vmas (mm/memory.c:1375 (discriminator 3)) [ 2957.088520] unmap_region (mm/mmap.c:2386 (discriminator 4)) [ 2957.088542] ? vma_rb_erase (mm/mmap.c:454 include/linux/rbtree_augmented.h:219 include/linux/rbtree_augmented.h:227 mm/mmap.c:493) [ 2957.088559] ? vmacache_update (mm/vmacache.c:61) [ 2957.088572] do_munmap (mm/mmap.c:2581) [ 2957.088583] vm_munmap (mm/mmap.c:2596) [ 2957.088595] SyS_munmap (mm/mmap.c:2601) [ 2957.088616] tracesys (arch/x86/kernel/entry_64.S:541) [ 2957.088770] Code: ff ff e8 f9 5f 07 00 48 8b 45 90 80 48 18 01 4d 85 e4 0f 84 8b fe ff ff 45 84 ed 0f 85 fc 03 00 00 49 8d 7c 24 08 e8 b5 67 07 00 41 f6 44 24 08 01 0f 84 29 02 00 00 83 6d c8 01 4c 89 e7 e8 bd All code 0: ff (bad) 1: ff e8 ljmpq *internal disassembler error 3: f9 stc 4: 5f pop%rdi 5: 07 (bad) 6: 00 48 8badd%cl,-0x75(%rax) 9: 45 90 rex.RB xchg %eax,%r8d b: 80 48 18 01 orb$0x1,0x18(%rax) f: 4d 85 e4test %r12,%r12 12: 0f 84 8b fe ff ff je 0xfea3 18: 45 84 edtest %r13b,%r13b 1b: 0f 85 fc 03 00 00 jne0x41d 21: 49 8d 7c 24 08 lea0x8(%r12),%rdi 26: e8 b5 67 07 00 callq 0x767e0 2b:* 41 f6 44 24 08 01 testb $0x1,0x8(%r12) -- trapping instruction 31: 0f 84 29 02 00 00 je 0x260 37: 83 6d c8 01 subl $0x1,-0x38(%rbp) 3b: 4c 89 e7mov%r12,%rdi 3e: e8 .byte 0xe8 3f: bd .byte 0xbd ... Code starting with the faulting instruction === 0: 41 f6 44 24 08 01 testb $0x1,0x8(%r12) 6: 0f 84 29 02 00 00 je 0x235 c: 83 6d c8 01 subl $0x1,-0x38(%rbp) 10: 4c 89 e7mov%r12,%rdi 13: e8 .byte 0xe8 14: bd .byte 0xbd ... [ 2957.088784] RIP unmap_page_range (mm/memory.c:1132 mm/memory.c:1256 mm/memory.c:1277 mm/memory.c:1301) [ 2957.088789] RSP 880739fb7c58 [ 2957.088794] CR2: ea0003480008 Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/