Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
Michael Ellerman writes: > On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote: >> With ppc64 we use the deposited pgtable_t to store the hash pte slot >> information. We should not withdraw the deposited pgtable_t without >> marking the pmd none. This ensure that low level hash fault handling >> will skip this huge pte and we will handle them at upper levels. >> >> Recent change to pmd splitting changed the above in order to handle the >> race between pmd split and exit_mmap. The race is explained below. >> >> Consider following race: >> >> CPU0CPU1 >> shrink_page_list() >> add_to_swap() >> split_huge_page_to_list() >> __split_huge_pmd_locked() >> pmdp_huge_clear_flush_notify() >> // pmd_none() == true >> exit_mmap() >>unmap_vmas() >> zap_pmd_range() >>// no action on pmd since >> pmd_none() == true >> pmd_populate() >> >> As result the THP will not be freed. The leak is detected by check_mm(): >> >> BUG: Bad rss-counter state mm:880058d2e580 idx:1 val:512 >> >> The above required us to not mark pmd none during a pmd split. >> >> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low >> level fault handling code skip this pte. At higher level we do take ptl >> lock. That should serialze us against the pmd split. Once the lock is >> acquired we do check the pmd again using pmd_same. That should always >> return false for us and hence we should retry the access. We do the >> pmd_same check in all case after taking plt with >> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and >> huge_pmd_set_accessed) >> >> Also make sure we wait for irq disable section in other cpus to finish >> before flipping a huge pte entry with a regular pmd entry. Code paths >> like find_linux_pte_or_hugepte depend on irq disable to get >> a stable pte_t pointer. A parallel thp split need to make sure we >> don't convert a pmd pte to a regular pmd entry without waiting for the >> irq disable section to finish. >> >> Acked-by: Kirill A. Shutemov >> Signed-off-by: Aneesh Kumar K.V > > Applied to powerpc fixes, thanks. > > https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3 > Can we apply the below hunk ?. The reason for marking pmd none was to avoid clearing both _PAGE_USER and _PAGE_PRESENT on the pte. At pmd level that used to mean a hugepd pointer before. We did fix that earlier by introducing _PAGE_PTE. But then I was thinking it was harmless to mark pmd none. Now marking it one will still result in the race I explained above, eventhough the window is much smaller now. diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index c8a00da39969..03f6e72697d0 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -694,7 +694,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { - pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0); + pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0); /* * This ensures that generic code that rely on IRQ disabling * to prevent a parallel THP split work as expected. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)
On Sat, 13 Feb 2016, Kirill A. Shutemov wrote: > Could you check if revert of fecffad25458 helps? I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with: ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual kernel address space ¢ 1851.721075! failing address: TEID: 0483 ¢ 1851.721078! Fault in home space mode while using kernel ASCE. ¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 P:003d ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en ib_sa ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core ib_addr ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 des_generic genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common crc_itu_t dm_mod scm_block vhost_net tun vhost eadm_sch macvtap macvlan kvm autofs4 ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 4.5.0-rc3-00058-g07923d7-dirty #178 ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 8c604000 ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 (__rb_erase_color+0x280/0x308) ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 EA:3 Krnl GPRS: 0001 0020 bd07eff1 ¢ 1851.721205!0027ca10 83e45898 77b61198 ¢ 1851.721207!7ce1a490 bd07eff0 7ce1a548 0027ca10 ¢ 1851.721210!bd07c350 bd07eff0 8c607aa8 8c607a68 ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024 stg %%r12,8(%%r13) 0045d3b0: b9040039 lgr %%r3,%%r9 #0045d3b4: a53b0001 oill%%r3,1 >0045d3b8: e3301024 stg %%r3,0(%%r1) 0045d3be: ec28000e007c cgij %%r2,0,8,45d3da 0045d3c4: e3402004 lg %%r4,0(%%r2) 0045d3ca: b904001c lgr %%r1,%%r12 0045d3ce: ec143f3f0056 rosbg %%r1,%%r4,63,63,0 ¢ 1851.721269! Call Trace: ¢ 1851.721273! (¢<83e45898>! 0x83e45898) ¢ 1851.721279! ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8 ¢ 1851.721282! ¢<00283f34>! free_pgtables+0xcc/0x148 ¢ 1851.721285! ¢<0028c376>! exit_mmap+0xd6/0x300 ¢ 1851.721289! ¢<00134db8>! mmput+0x90/0x118 ¢ 1851.721294! ¢<002d76bc>! flush_old_exec+0x5d4/0x700 ¢ 1851.721298! ¢<003369f4>! load_elf_binary+0x2f4/0x13e8 ¢ 1851.721301! ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8 ¢ 1851.721304! ¢<002d8970>! do_execveat_common.isra.32+0x668/0x9a0 ¢ 1851.721307! ¢<002d8cec>! do_execve+0x44/0x58 ¢ 1851.721310! ¢<002d8f92>! SyS_execve+0x3a/0x48 ¢ 1851.721315! ¢<006fb096>! system_call+0xd6/0x258 ¢ 1851.721317! ¢<03ff997436d6>! 0x3ff997436d6 ¢ 1851.721319! INFO: lockdep is turned off. ¢ 1851.721321! Last Breaking-Event-Address: ¢ 1851.721323! ¢<0045d31a>! __rb_erase_color+0x1e2/0x308 ¢ 1851.721327! ¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !--- > > And could you share how crashes looks like? I haven't seen backtraces yet. > Sure. I didn't because they really looked random to me. Most of the time in rcu or list debugging but I thought these have just been the messenger observing a corruption first. Anyhow, here is an older one that might look interesting: [ 59.851421] list_del corruption. next->prev should be 6e1eb000, but was 0400 [ 59.851469] [ cut here ] [ 59.851472] WARNING: at lib/list_debug.c:71 [ 59.851475] Modules linked in: bridge stp llc btrfs xor mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad ib_core ib_addr ghash_s390 prng raid6_pq ecb aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 mlx4_core sha_common genwqe_card scm_block crc_itu_t vhost_net tun vhost dm_mod macvtap eadm_sch macvlan kvm autofs4 [ 59.851532] CPU: 0 PID: 5400 Comm: git Not tainted 4.4.0-07794-ga4eff16-dirty #77 [ 59.851535] task: d231 ti: d661 task.ti: d661 [ 59.851539] Krnl PSW : 0704c0018000 00487434 (__list_del_entry+0xa4/0xe0) [ 59.851548]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3 Krnl GPRS: 01a7a1cf d231 0054 0001 [ 59.851554]00487430 774e6900 [ 59.851557]03ff5300 6d4017a0 03ff52f0 03ff52f0 [ 59.851560]03d10178 6e1eb000 00487430 d6613b00 [ 59.851571] Krnl C
Re: [PATCH V2 00/29] Book3s abstraction in preparation for new MMU model
On 2/13/16, Aneesh Kumar K.V wrote: > Paul Mackerras writes: > >> On Mon, Feb 08, 2016 at 02:50:12PM +0530, Aneesh Kumar K.V wrote: >>> Hello, >>> >>> This is a large series, mostly consisting of code movement. No new >>> features >>> are done in this series. The changes are done to accomodate the upcoming >>> new memory >>> model in future powerpc chips. The details of the new MMU model can be >>> found at >>> >>> http://ibm.biz/power-isa3 (Needs registration). I am including a summary >>> of the changes below. That's not a good idea to put your changes somewhere and ask people to register to be able to download them. It's just complicates testing your big amount of changes. >> >> This series doesn't seem to apply against either v4.4 or Linus' >> current master. What is this patch against? >> > > The patchset have dependencies against other patcheset posted to the > list. The best option is to pull the branch mentioned instead of trying to > apply them individually. > > -aneesh > > ___ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev