Re: [PATCH 3/3] mm:fix gup_pud_range
On Sat, Sep 21, 2019 at 9:19 AM John Hubbard wrote: > > On 9/20/19 5:33 PM, Qiujun Huang wrote: > >> On 9/20/19 8:51 AM, Qiujun Huang wrote: > ... > >> It would be nice if this spelled out a little more clearly what's > >> wrong. I think you and Aneesh are saying that the entry is really > >> a swap entry, created by the MCE response to a bad page? > > do_machine_check-> > > do_memory_failure-> > > memory_failure-> > > hwpoison_user_mappings > > will updated PUD level PTE entry as a swap entry. > > > > static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > unsigned long address, void *arg) > > { > > ... > > if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { > > pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > > OK, that helps. Let's add something approximately like this to the > commit description: > > do_machine_check() > do_memory_failure() > memory_failure() > hw_poison_user_mappings() > try_to_unmap() > pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); > > ...and now we have a swap entry that indicates that the page entry > refers to a bad (and poisoned) page of memory, but gup_fast() at this > level of the page table was ignoring swap entries, and incorrectly > assuming that "!pxd_none() == valid and present". > > And this was not just a poisoned page problem, but a generaly swap entry > problem. So, any swap entry type (device memory migration, numa migration, > or just regular swapping) could lead to the same problem. > > Fix this by checking for pxd_present(), instead of pxd_none(). > > > > ... > >> > >>> > >>> Signed-off-by: Qiujun Huang > >>> --- > >>> mm/gup.c | 2 ++ > >>> 1 file changed, 2 insertions(+) > >>> > >>> diff --git a/mm/gup.c b/mm/gup.c > >>> index 98f13ab..6157ed9 100644 > >>> --- a/mm/gup.c > >>> +++ b/mm/gup.c > >>> @@ -2230,6 +2230,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long > >>> addr, unsigned long end, > >>> next = pud_addr_end(addr, end); > >>> if (pud_none(pud)) > >>> return 0; > >>> + if (unlikely(!pud_present(pud))) > >>> + return 0; > >> > >> If the MCE hwpoison behavior puts in swap entries, then it seems like all > >> page table walkers would need to check for p*d_present(), and maybe at all > >> levels too, right? > > I think so > >> > > Should those changes be part of this fix, do you think? Yes, please.Thanks > > thanks, > -- > John Hubbard > NVIDIA
Re: [PATCH 3/3] mm:fix gup_pud_range
>On 9/20/19 8:51 AM, Qiujun Huang wrote: >> __get_user_pages_fast try to walk the page table but the >> hugepage pte is replace by hwpoison swap entry by mca path. > >I expect you mean MCE (machine check exception), rather than mca? Yeah > >> ... >> [15798.177437] mce: Uncorrected hardware memory error in >> user-access at 224f1761c0 >> [15798.180171] MCE 0x224f176: Killing pal_main:6784 due to >> hardware memory corruption >> [15798.180176] MCE 0x224f176: Killing qemu-system-x86:167336 >> due to hardware memory corruption >> ... >> [15798.180206] BUG: unable to handle kernel >> [15798.180226] paging request at 89123000 >> [15798.180236] IP: [] gup_pud_range+ >> 0x13e/0x1e0 >> ... >> >> We need to skip the hwpoison entry in gup_pud_range. > >It would be nice if this spelled out a little more clearly what's >wrong. I think you and Aneesh are saying that the entry is really >a swap entry, created by the MCE response to a bad page? do_machine_check-> do_memory_failure-> memory_failure-> hwpoison_user_mappings will updated PUD level PTE entry as a swap entry. static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { ... if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); if (PageHuge(page)) { int nr = 1 << compound_order(page); hugetlb_count_sub(nr, mm); set_huge_swap_pte_at(mm, address, pvmw.pte, pteval, vma_mmu_pagesize(vma)); } else { dec_mm_counter(mm, mm_counter(page)); set_pte_at(mm, address, pvmw.pte, pteval); } ... and, gup_pud_range will reference the pud entry. gup_pud_range->gup_pmd_range: static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; pmdp = pmd_offset(&pud, addr); do { pmd_t pmd = *pmdp; <--the pmdp is hwpoison swap entry. 89123000 and results in corruption ... > >> >> Signed-off-by: Qiujun Huang >> --- >> mm/gup.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 98f13ab..6157ed9 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2230,6 +2230,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long >> addr, unsigned long end, >> next = pud_addr_end(addr, end); >> if (pud_none(pud)) >> return 0; >> + if (unlikely(!pud_present(pud))) >> + return 0; > >If the MCE hwpoison behavior puts in swap entries, then it seems like all >page table walkers would need to check for p*d_present(), and maybe at all >levels too, right? I think so > >thanks, On Sat, Sep 21, 2019 at 3:37 AM John Hubbard wrote: > > On 9/20/19 8:51 AM, Qiujun Huang wrote: > > __get_user_pages_fast try to walk the page table but the > > hugepage pte is replace by hwpoison swap entry by mca path. > > I expect you mean MCE (machine check exception), rather than mca? > > > ... > > [15798.177437] mce: Uncorrected hardware memory error in > > user-access at 224f1761c0 > > [15798.180171] MCE 0x224f176: Killing pal_main:6784 due to > > hardware memory corruption > > [15798.180176] MCE 0x224f176: Killing qemu-system-x86:167336 > > due to hardware memory corruption > > ... > > [15798.180206] BUG: unable to handle kernel > > [15798.180226] paging request at 89123000 > > [15798.180236] IP: [] gup_pud_range+ > > 0x13e/0x1e0 > > ... > > > > We need to skip the hwpoison entry in gup_pud_range. > > It would be nice if this spelled out a little more clearly what's > wrong. I think you and Aneesh are saying that the entry is really > a swap entry, created by the MCE response to a bad page? > > > > > Signed-off-by: Qiujun Huang > > --- > > mm/gup.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index 98f13ab..6157ed9 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -2230,6 +2230,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long > > addr, unsigned long end, > > next = pud_addr_end(addr, end); > > if (pud_none(pud)) > > return 0; > > + if (unlikely(!pud_present(pud))) > > + return 0; > > If the MCE hwpoison behavior puts in swap entries, then it seems like all > page table walkers would need to check for p*d_present(), and maybe at all > levels too, right? > > thanks, > -- > John Hubbard > NVIDIA > > > > if (unlikely(pud_huge(pud))) { > > if (!gup_huge_pud(pud, pudp, addr, next, flags, > > pages, nr)) > >
Re: [PATCH 3/3] mm:fix gup_pud_range
On 9/20/19 8:51 AM, Qiujun Huang wrote: > __get_user_pages_fast try to walk the page table but the > hugepage pte is replace by hwpoison swap entry by mca path. I expect you mean MCE (machine check exception), rather than mca? > ... > [15798.177437] mce: Uncorrected hardware memory error in > user-access at 224f1761c0 > [15798.180171] MCE 0x224f176: Killing pal_main:6784 due to > hardware memory corruption > [15798.180176] MCE 0x224f176: Killing qemu-system-x86:167336 > due to hardware memory corruption > ... > [15798.180206] BUG: unable to handle kernel > [15798.180226] paging request at 89123000 > [15798.180236] IP: [] gup_pud_range+ > 0x13e/0x1e0 > ... > > We need to skip the hwpoison entry in gup_pud_range. It would be nice if this spelled out a little more clearly what's wrong. I think you and Aneesh are saying that the entry is really a swap entry, created by the MCE response to a bad page? > > Signed-off-by: Qiujun Huang > --- > mm/gup.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/gup.c b/mm/gup.c > index 98f13ab..6157ed9 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2230,6 +2230,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, > unsigned long end, > next = pud_addr_end(addr, end); > if (pud_none(pud)) > return 0; > + if (unlikely(!pud_present(pud))) > + return 0; If the MCE hwpoison behavior puts in swap entries, then it seems like all page table walkers would need to check for p*d_present(), and maybe at all levels too, right? thanks, -- John Hubbard NVIDIA > if (unlikely(pud_huge(pud))) { > if (!gup_huge_pud(pud, pudp, addr, next, flags, > pages, nr)) >
Re: [PATCH 3/3] mm:fix gup_pud_range
On 9/20/19 9:21 PM, Qiujun Huang wrote: __get_user_pages_fast try to walk the page table but the hugepage pte is replace by hwpoison swap entry by mca path. ... Can you describe this in more details. I guess you are facing the issue with respect PUD level PTE entry that got updated by hwpoison as a swap entry. Since we don't specifically check for pud_present(), we walk the page table with wrong values and that results in corruption? [15798.177437] mce: Uncorrected hardware memory error in user-access at 224f1761c0 [15798.180171] MCE 0x224f176: Killing pal_main:6784 due to hardware memory corruption [15798.180176] MCE 0x224f176: Killing qemu-system-x86:167336 due to hardware memory corruption ... [15798.180206] BUG: unable to handle kernel [15798.180226] paging request at 89123000 [15798.180236] IP: [] gup_pud_range+ 0x13e/0x1e0 ... We need to skip the hwpoison entry in gup_pud_range. Signed-off-by: Qiujun Huang --- mm/gup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 98f13ab..6157ed9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2230,6 +2230,8 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; + if (unlikely(!pud_present(pud))) + return 0; You should be able to remove that if (pud_none(pud)) check and just keep the pud_present() check? if (unlikely(pud_huge(pud))) { if (!gup_huge_pud(pud, pudp, addr, next, flags, pages, nr))