Re: Multiple potential races on vma->vm_flags
On Wed, Oct 14, 2015 at 12:33 AM, Hugh Dickins wrote: > I think I've found the answer to that at last: we were indeed > all looking in the wrong direction. Your ktsan tree shows > > static __always_inline int atomic_add_negative(int i, atomic_t *v) > { > #ifndef CONFIG_KTSAN > GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s"); > #else > return (ktsan_atomic32_fetch_add((void *)v, i, > ktsan_memory_order_acq_rel) + i) < 0; > #endif > } > > but ktsan_atomic32_fetch_add() returns u32: so it looks like > your implementation of atomic_add_negative() always returns 0, > and page_remove_file_rmap() never calls clear_page_mlock(), as > it ought when an Mlocked page has been truncated or punched out. > > /proc/meminfo gives you crazy AnonPages and Mapped too, yes? Yes, you're correct, there was a bug in KTSAN annotations. Thank you for finding this. Fixing that bug fixes the bad page reports. Sorry for troubling you with this. The race reports are still there though. > >> >> It seems that your patch doesn't fix the race from the report below, since >> pte >> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) >> is being executed. (Line numbers are from kernel with your patch applied.) > > I was not trying to "fix" that with my patch, because I couldn't find > any problem with the way it reads vm_flags there; I can't even see any > need for READ_ONCE or more barriers, we have sufficient locking already. > > Sure, try_to_unmap_one() may read vm_flags an instant before or after > a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED; > but the syscalls (or exit) then work their way up the address space to > establish the final state, no problem. > > But I am glad you drew attention to the inadequacy of the > down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting > that patch (doing the mlock_vma_page under pt lock instead), I have > identifed one case that it would fix - though it clearly wasn't > involved in your stacktrace (it's a race with truncating COWed pages, > but your trace was holepunching, which leaves the COWs alone). > > I'll go forward with that patch, but it rather falls into a series > I was preparing, must finish up all their comments before posting. > > Hugh > >> >> === >> ThreadSanitizer: data-race in munlock_vma_pages_range >> >> Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2: >> [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 >> [< inline >] munlock_vma_pages_all mm/internal.h:252 >> [] exit_mmap+0x163/0x190 mm/mmap.c:2824 >> [] mmput+0x65/0x190 kernel/fork.c:708 >> [< inline >] exit_mm kernel/exit.c:437 >> [] do_exit+0x457/0x1400 kernel/exit.c:733 >> [] do_group_exit+0x7f/0x140 kernel/exit.c:874 >> [] get_signal+0x375/0xa70 kernel/signal.c:2353 >> [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 >> [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 >> [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 >> >> Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1: >> [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208 >> [< inline >] rmap_walk_file mm/rmap.c:1522 >> [] rmap_walk+0x147/0x450 mm/rmap.c:1541 >> [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405 >> [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 >> [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 >> [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 >> [< inline >] munlock_vma_pages_all mm/internal.h:252 >> [] exit_mmap+0x163/0x190 mm/mmap.c:2824 >> [] mmput+0x65/0x190 kernel/fork.c:708 >> [< inline >] exit_mm kernel/exit.c:437 >> [] do_exit+0x457/0x1400 kernel/exit.c:733 >> [] do_group_exit+0x7f/0x140 kernel/exit.c:874 >> [] get_signal+0x375/0xa70 kernel/signal.c:2353 >> [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 >> [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 >> [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 >> === >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Oct 14, 2015 at 12:33 AM, Hugh Dickinswrote: > I think I've found the answer to that at last: we were indeed > all looking in the wrong direction. Your ktsan tree shows > > static __always_inline int atomic_add_negative(int i, atomic_t *v) > { > #ifndef CONFIG_KTSAN > GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s"); > #else > return (ktsan_atomic32_fetch_add((void *)v, i, > ktsan_memory_order_acq_rel) + i) < 0; > #endif > } > > but ktsan_atomic32_fetch_add() returns u32: so it looks like > your implementation of atomic_add_negative() always returns 0, > and page_remove_file_rmap() never calls clear_page_mlock(), as > it ought when an Mlocked page has been truncated or punched out. > > /proc/meminfo gives you crazy AnonPages and Mapped too, yes? Yes, you're correct, there was a bug in KTSAN annotations. Thank you for finding this. Fixing that bug fixes the bad page reports. Sorry for troubling you with this. The race reports are still there though. > >> >> It seems that your patch doesn't fix the race from the report below, since >> pte >> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) >> is being executed. (Line numbers are from kernel with your patch applied.) > > I was not trying to "fix" that with my patch, because I couldn't find > any problem with the way it reads vm_flags there; I can't even see any > need for READ_ONCE or more barriers, we have sufficient locking already. > > Sure, try_to_unmap_one() may read vm_flags an instant before or after > a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED; > but the syscalls (or exit) then work their way up the address space to > establish the final state, no problem. > > But I am glad you drew attention to the inadequacy of the > down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting > that patch (doing the mlock_vma_page under pt lock instead), I have > identifed one case that it would fix - though it clearly wasn't > involved in your stacktrace (it's a race with truncating COWed pages, > but your trace was holepunching, which leaves the COWs alone). > > I'll go forward with that patch, but it rather falls into a series > I was preparing, must finish up all their comments before posting. > > Hugh > >> >> === >> ThreadSanitizer: data-race in munlock_vma_pages_range >> >> Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2: >> [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 >> [< inline >] munlock_vma_pages_all mm/internal.h:252 >> [] exit_mmap+0x163/0x190 mm/mmap.c:2824 >> [] mmput+0x65/0x190 kernel/fork.c:708 >> [< inline >] exit_mm kernel/exit.c:437 >> [] do_exit+0x457/0x1400 kernel/exit.c:733 >> [] do_group_exit+0x7f/0x140 kernel/exit.c:874 >> [] get_signal+0x375/0xa70 kernel/signal.c:2353 >> [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 >> [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 >> [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 >> >> Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1: >> [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208 >> [< inline >] rmap_walk_file mm/rmap.c:1522 >> [] rmap_walk+0x147/0x450 mm/rmap.c:1541 >> [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405 >> [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 >> [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 >> [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 >> [< inline >] munlock_vma_pages_all mm/internal.h:252 >> [] exit_mmap+0x163/0x190 mm/mmap.c:2824 >> [] mmput+0x65/0x190 kernel/fork.c:708 >> [< inline >] exit_mm kernel/exit.c:437 >> [] do_exit+0x457/0x1400 kernel/exit.c:733 >> [] do_group_exit+0x7f/0x140 kernel/exit.c:874 >> [] get_signal+0x375/0xa70 kernel/signal.c:2353 >> [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 >> [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 >> [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 >> === >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, 23 Sep 2015, Sasha Levin wrote: > On 09/23/2015 09:08 AM, Andrey Konovalov wrote: > > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins wrote: > >> > This is totally untested, and one of you may quickly prove me wrong; > >> > but I went in to fix your "Bad page state (mlocked)" by holding pte > >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > >> > then couldn't see why it would need mmap_sem at all, given how mlock > >> > and munlock first assert intention by setting or clearing VM_LOCKED > >> > in vm_flags, then work their way up the vma, taking pte locks. > >> > > >> > Calling mlock_vma_page() under pte lock may look suspicious > >> > at first: but what it does is similar to clear_page_mlock(), > >> > which we regularly call under pte lock from page_remove_rmap(). > >> > > >> > I'd rather wait to hear whether this appears to work in practice, > >> > and whether you agree that it should work in theory, before writing > >> > the proper description. I'd love to lose that down_read_trylock. > > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". > > > > It seems that your patch doesn't fix the race from the report below, since > > pte > > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > > is being executed. (Line numbers are from kernel with your patch applied.) > > I've fired up my HZ_1 patch, and this seems to be a real race that is > somewhat easy to reproduce under those conditions. > > Here's a fresh backtrace from my VMs: > > [1935109.882343] BUG: Bad page state in process trinity-subchil pfn:3ca200 > [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping: > (null) index:0x1e00 compound_mapcount: 0 > [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked) > [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [1935109.888197] bad because of flags: > [1935109.888759] flags: 0x10(mlocked) > [1935109.889525] Modules linked in: > [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted > 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 > [1935109.891876] 16445448 e5dca494 8803f7657708 > a70402da > [1935109.893504] ea000f288000 8803f7657738 a56e522b > 022f80144008 > [1935109.894947] ea000f288020 ea000f288000 > 8803f76577a8 > [1935109.896413] Call Trace: > [1935109.899102] [] dump_stack+0x4e/0x84 > [1935109.899821] [] bad_page+0x17b/0x210 > [1935109.900469] [] free_pages_prepare+0xb48/0x1110 > [1935109.902127] [] __free_pages_ok+0x21/0x260 > [1935109.904435] [] free_compound_page+0x63/0x80 > [1935109.905614] [] free_transhuge_page+0x6e/0x80 free_transhuge_page belongs to Kirill's THP refcounting patchset, it's not in 4.3-rc or 4.3.0-rc2-next-20150923 or mmotm. Well worth testing, thank you, but please make it clear what you are testing: I'll not spend longer on this one, not at this time. Hugh > [1935109.906752] [] __put_compound_page+0x76/0xa0 > [1935109.907884] [] release_pages+0x4d5/0x9f0 > [1935109.913027] [] tlb_flush_mmu_free+0x8a/0x120 > [1935109.913957] [] unmap_page_range+0xe73/0x1460 > [1935109.915737] [] unmap_single_vma+0x126/0x2f0 > [1935109.916646] [] unmap_vmas+0xdd/0x190 > [1935109.917454] [] exit_mmap+0x221/0x430 > [1935109.921176] [] mmput+0xb1/0x240 > [1935109.921919] [] do_exit+0x732/0x27c0 > [1935109.928561] [] do_group_exit+0xf9/0x300 > [1935109.929786] [] SyS_exit_group+0x1d/0x20 > [1935109.930617] [] entry_SYSCALL_64_fastpath+0x16/0x7a > > > Thanks, > Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, 23 Sep 2015, Andrey Konovalov wrote: > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins wrote: > > This is totally untested, and one of you may quickly prove me wrong; > > but I went in to fix your "Bad page state (mlocked)" by holding pte > > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > > then couldn't see why it would need mmap_sem at all, given how mlock > > and munlock first assert intention by setting or clearing VM_LOCKED > > in vm_flags, then work their way up the vma, taking pte locks. > > > > Calling mlock_vma_page() under pte lock may look suspicious > > at first: but what it does is similar to clear_page_mlock(), > > which we regularly call under pte lock from page_remove_rmap(). > > > > I'd rather wait to hear whether this appears to work in practice, > > and whether you agree that it should work in theory, before writing > > the proper description. I'd love to lose that down_read_trylock. > > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". I think I've found the answer to that at last: we were indeed all looking in the wrong direction. Your ktsan tree shows static __always_inline int atomic_add_negative(int i, atomic_t *v) { #ifndef CONFIG_KTSAN GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s"); #else return (ktsan_atomic32_fetch_add((void *)v, i, ktsan_memory_order_acq_rel) + i) < 0; #endif } but ktsan_atomic32_fetch_add() returns u32: so it looks like your implementation of atomic_add_negative() always returns 0, and page_remove_file_rmap() never calls clear_page_mlock(), as it ought when an Mlocked page has been truncated or punched out. /proc/meminfo gives you crazy AnonPages and Mapped too, yes? > > It seems that your patch doesn't fix the race from the report below, since pte > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > is being executed. (Line numbers are from kernel with your patch applied.) I was not trying to "fix" that with my patch, because I couldn't find any problem with the way it reads vm_flags there; I can't even see any need for READ_ONCE or more barriers, we have sufficient locking already. Sure, try_to_unmap_one() may read vm_flags an instant before or after a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED; but the syscalls (or exit) then work their way up the address space to establish the final state, no problem. But I am glad you drew attention to the inadequacy of the down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting that patch (doing the mlock_vma_page under pt lock instead), I have identifed one case that it would fix - though it clearly wasn't involved in your stacktrace (it's a race with truncating COWed pages, but your trace was holepunching, which leaves the COWs alone). I'll go forward with that patch, but it rather falls into a series I was preparing, must finish up all their comments before posting. Hugh > > === > ThreadSanitizer: data-race in munlock_vma_pages_range > > Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2: > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > [< inline >] munlock_vma_pages_all mm/internal.h:252 > [] exit_mmap+0x163/0x190 mm/mmap.c:2824 > [] mmput+0x65/0x190 kernel/fork.c:708 > [< inline >] exit_mm kernel/exit.c:437 > [] do_exit+0x457/0x1400 kernel/exit.c:733 > [] do_group_exit+0x7f/0x140 kernel/exit.c:874 > [] get_signal+0x375/0xa70 kernel/signal.c:2353 > [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 > [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 > [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 > > Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1: > [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208 > [< inline >] rmap_walk_file mm/rmap.c:1522 > [] rmap_walk+0x147/0x450 mm/rmap.c:1541 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [< inline >] munlock_vma_pages_all mm/internal.h:252 > [] exit_mmap+0x163/0x190 mm/mmap.c:2824 > [] mmput+0x65/0x190 kernel/fork.c:708 > [< inline >] exit_mm kernel/exit.c:437 > [] do_exit+0x457/0x1400 kernel/exit.c:733 > [] do_group_exit+0x7f/0x140 kernel/exit.c:874 > [] get_signal+0x375/0xa70 kernel/signal.c:2353 > [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 > [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 > [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 > === > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, 23 Sep 2015, Sasha Levin wrote: > On 09/23/2015 09:08 AM, Andrey Konovalov wrote: > > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickinswrote: > >> > This is totally untested, and one of you may quickly prove me wrong; > >> > but I went in to fix your "Bad page state (mlocked)" by holding pte > >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > >> > then couldn't see why it would need mmap_sem at all, given how mlock > >> > and munlock first assert intention by setting or clearing VM_LOCKED > >> > in vm_flags, then work their way up the vma, taking pte locks. > >> > > >> > Calling mlock_vma_page() under pte lock may look suspicious > >> > at first: but what it does is similar to clear_page_mlock(), > >> > which we regularly call under pte lock from page_remove_rmap(). > >> > > >> > I'd rather wait to hear whether this appears to work in practice, > >> > and whether you agree that it should work in theory, before writing > >> > the proper description. I'd love to lose that down_read_trylock. > > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". > > > > It seems that your patch doesn't fix the race from the report below, since > > pte > > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > > is being executed. (Line numbers are from kernel with your patch applied.) > > I've fired up my HZ_1 patch, and this seems to be a real race that is > somewhat easy to reproduce under those conditions. > > Here's a fresh backtrace from my VMs: > > [1935109.882343] BUG: Bad page state in process trinity-subchil pfn:3ca200 > [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping: > (null) index:0x1e00 compound_mapcount: 0 > [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked) > [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [1935109.888197] bad because of flags: > [1935109.888759] flags: 0x10(mlocked) > [1935109.889525] Modules linked in: > [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted > 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 > [1935109.891876] 16445448 e5dca494 8803f7657708 > a70402da > [1935109.893504] ea000f288000 8803f7657738 a56e522b > 022f80144008 > [1935109.894947] ea000f288020 ea000f288000 > 8803f76577a8 > [1935109.896413] Call Trace: > [1935109.899102] [] dump_stack+0x4e/0x84 > [1935109.899821] [] bad_page+0x17b/0x210 > [1935109.900469] [] free_pages_prepare+0xb48/0x1110 > [1935109.902127] [] __free_pages_ok+0x21/0x260 > [1935109.904435] [] free_compound_page+0x63/0x80 > [1935109.905614] [] free_transhuge_page+0x6e/0x80 free_transhuge_page belongs to Kirill's THP refcounting patchset, it's not in 4.3-rc or 4.3.0-rc2-next-20150923 or mmotm. Well worth testing, thank you, but please make it clear what you are testing: I'll not spend longer on this one, not at this time. Hugh > [1935109.906752] [] __put_compound_page+0x76/0xa0 > [1935109.907884] [] release_pages+0x4d5/0x9f0 > [1935109.913027] [] tlb_flush_mmu_free+0x8a/0x120 > [1935109.913957] [] unmap_page_range+0xe73/0x1460 > [1935109.915737] [] unmap_single_vma+0x126/0x2f0 > [1935109.916646] [] unmap_vmas+0xdd/0x190 > [1935109.917454] [] exit_mmap+0x221/0x430 > [1935109.921176] [] mmput+0xb1/0x240 > [1935109.921919] [] do_exit+0x732/0x27c0 > [1935109.928561] [] do_group_exit+0xf9/0x300 > [1935109.929786] [] SyS_exit_group+0x1d/0x20 > [1935109.930617] [] entry_SYSCALL_64_fastpath+0x16/0x7a > > > Thanks, > Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, 23 Sep 2015, Andrey Konovalov wrote: > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickinswrote: > > This is totally untested, and one of you may quickly prove me wrong; > > but I went in to fix your "Bad page state (mlocked)" by holding pte > > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > > then couldn't see why it would need mmap_sem at all, given how mlock > > and munlock first assert intention by setting or clearing VM_LOCKED > > in vm_flags, then work their way up the vma, taking pte locks. > > > > Calling mlock_vma_page() under pte lock may look suspicious > > at first: but what it does is similar to clear_page_mlock(), > > which we regularly call under pte lock from page_remove_rmap(). > > > > I'd rather wait to hear whether this appears to work in practice, > > and whether you agree that it should work in theory, before writing > > the proper description. I'd love to lose that down_read_trylock. > > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". I think I've found the answer to that at last: we were indeed all looking in the wrong direction. Your ktsan tree shows static __always_inline int atomic_add_negative(int i, atomic_t *v) { #ifndef CONFIG_KTSAN GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s"); #else return (ktsan_atomic32_fetch_add((void *)v, i, ktsan_memory_order_acq_rel) + i) < 0; #endif } but ktsan_atomic32_fetch_add() returns u32: so it looks like your implementation of atomic_add_negative() always returns 0, and page_remove_file_rmap() never calls clear_page_mlock(), as it ought when an Mlocked page has been truncated or punched out. /proc/meminfo gives you crazy AnonPages and Mapped too, yes? > > It seems that your patch doesn't fix the race from the report below, since pte > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > is being executed. (Line numbers are from kernel with your patch applied.) I was not trying to "fix" that with my patch, because I couldn't find any problem with the way it reads vm_flags there; I can't even see any need for READ_ONCE or more barriers, we have sufficient locking already. Sure, try_to_unmap_one() may read vm_flags an instant before or after a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED; but the syscalls (or exit) then work their way up the address space to establish the final state, no problem. But I am glad you drew attention to the inadequacy of the down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting that patch (doing the mlock_vma_page under pt lock instead), I have identifed one case that it would fix - though it clearly wasn't involved in your stacktrace (it's a race with truncating COWed pages, but your trace was holepunching, which leaves the COWs alone). I'll go forward with that patch, but it rather falls into a series I was preparing, must finish up all their comments before posting. Hugh > > === > ThreadSanitizer: data-race in munlock_vma_pages_range > > Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2: > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > [< inline >] munlock_vma_pages_all mm/internal.h:252 > [] exit_mmap+0x163/0x190 mm/mmap.c:2824 > [] mmput+0x65/0x190 kernel/fork.c:708 > [< inline >] exit_mm kernel/exit.c:437 > [] do_exit+0x457/0x1400 kernel/exit.c:733 > [] do_group_exit+0x7f/0x140 kernel/exit.c:874 > [] get_signal+0x375/0xa70 kernel/signal.c:2353 > [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 > [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 > [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 > > Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1: > [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208 > [< inline >] rmap_walk_file mm/rmap.c:1522 > [] rmap_walk+0x147/0x450 mm/rmap.c:1541 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [< inline >] munlock_vma_pages_all mm/internal.h:252 > [] exit_mmap+0x163/0x190 mm/mmap.c:2824 > [] mmput+0x65/0x190 kernel/fork.c:708 > [< inline >] exit_mm kernel/exit.c:437 > [] do_exit+0x457/0x1400 kernel/exit.c:733 > [] do_group_exit+0x7f/0x140 kernel/exit.c:874 > [] get_signal+0x375/0xa70 kernel/signal.c:2353 > [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 > [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 > [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 > === > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 23, 2015 at 08:42:26PM -0400, Sasha Levin wrote: > On 09/23/2015 09:08 AM, Andrey Konovalov wrote: > > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins wrote: > >> > This is totally untested, and one of you may quickly prove me wrong; > >> > but I went in to fix your "Bad page state (mlocked)" by holding pte > >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > >> > then couldn't see why it would need mmap_sem at all, given how mlock > >> > and munlock first assert intention by setting or clearing VM_LOCKED > >> > in vm_flags, then work their way up the vma, taking pte locks. > >> > > >> > Calling mlock_vma_page() under pte lock may look suspicious > >> > at first: but what it does is similar to clear_page_mlock(), > >> > which we regularly call under pte lock from page_remove_rmap(). > >> > > >> > I'd rather wait to hear whether this appears to work in practice, > >> > and whether you agree that it should work in theory, before writing > >> > the proper description. I'd love to lose that down_read_trylock. > > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". > > > > It seems that your patch doesn't fix the race from the report below, since > > pte > > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > > is being executed. (Line numbers are from kernel with your patch applied.) > > I've fired up my HZ_1 patch, Can we make HZ_1 thing into upstream? Under KERNEL_DEBUG, or something? > and this seems to be a real race that is > somewhat easy to reproduce under those conditions. > > Here's a fresh backtrace from my VMs: > > [1935109.882343] BUG: Bad page state in process trinity-subchil pfn:3ca200 > [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping: > (null) index:0x1e00 compound_mapcount: 0 > [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked) > [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [1935109.888197] bad because of flags: > [1935109.888759] flags: 0x10(mlocked) > [1935109.889525] Modules linked in: > [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted > 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 > [1935109.891876] 16445448 e5dca494 8803f7657708 > a70402da > [1935109.893504] ea000f288000 8803f7657738 a56e522b > 022f80144008 > [1935109.894947] ea000f288020 ea000f288000 > 8803f76577a8 > [1935109.896413] Call Trace: > [1935109.899102] [] dump_stack+0x4e/0x84 > [1935109.899821] [] bad_page+0x17b/0x210 > [1935109.900469] [] free_pages_prepare+0xb48/0x1110 > [1935109.902127] [] __free_pages_ok+0x21/0x260 > [1935109.904435] [] free_compound_page+0x63/0x80 > [1935109.905614] [] free_transhuge_page+0x6e/0x80 > [1935109.906752] [] __put_compound_page+0x76/0xa0 > [1935109.907884] [] release_pages+0x4d5/0x9f0 > [1935109.913027] [] tlb_flush_mmu_free+0x8a/0x120 > [1935109.913957] [] unmap_page_range+0xe73/0x1460 > [1935109.915737] [] unmap_single_vma+0x126/0x2f0 > [1935109.916646] [] unmap_vmas+0xdd/0x190 > [1935109.917454] [] exit_mmap+0x221/0x430 > [1935109.921176] [] mmput+0xb1/0x240 > [1935109.921919] [] do_exit+0x732/0x27c0 > [1935109.928561] [] do_group_exit+0xf9/0x300 > [1935109.929786] [] SyS_exit_group+0x1d/0x20 > [1935109.930617] [] entry_SYSCALL_64_fastpath+0x16/0x7a Would it make any difference if you'll add mmap_sem protection in exit_mmap? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/23, Sasha Levin wrote: > > Another similar trace where we see a problem during process exit: > > [1922964.887922] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > [1922964.890234] Modules linked in: > [1922964.890844] CPU: 1 PID: 21477 Comm: trinity-c161 Tainted: GW > 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 > [1922964.892584] task: 880251858000 ti: 88009f258000 task.ti: > 88009f258000 > [1922964.893723] RIP: acct_collect (kernel/acct.c:542) 530 void acct_collect(long exitcode, int group_dead) 531 { 532 struct pacct_struct *pacct = >signal->pacct; 533 cputime_t utime, stime; 534 unsigned long vsize = 0; 535 536 if (group_dead && current->mm) { 537 struct vm_area_struct *vma; 538 539 down_read(>mm->mmap_sem); 540 vma = current->mm->mmap; 541 while (vma) { 542 vsize += vma->vm_end - vma->vm_start; // 543 vma = vma->vm_next; 544 } 545 up_read(>mm->mmap_sem); 546 } > [1922964.895105] RSP: :88009f25f908 EFLAGS: 00010207 > [1922964.895935] RAX: dc00 RBX: 2ce0 RCX: > > [1922964.897008] RDX: 2152b153 RSI: 059c2000 RDI: > 2ce10007 > [1922964.898091] RBP: 88009f25f9e8 R08: 0001 R09: > 03ef > [1922964.899178] R10: ed014d7a3a01 R11: 0001 R12: > 880082b485c0 > [1922964.901643] R13: 2152b153 R14: 110013e4bf24 R15: > 88009f25f9c0 ... >0: 02 00 add(%rax),%al >2: 0f 85 9d 05 00 00 jne0x5a5 >8: 48 8b 1bmov(%rbx),%rbx >b: 48 85 dbtest %rbx,%rbx >e: 0f 84 7b 05 00 00 je 0x58f Probably "mov (%rbx),%rbx" is "vma = mm->mmap", > 14: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > 1b: fc ff df > 1e: 31 d2 xor%edx,%edx > 20: 48 8d 7b 08 lea0x8(%rbx),%rdi and this loads the addr of vma->vm_end for kasan, > 24: 48 89 femov%rdi,%rsi > 27: 48 c1 ee 03 shr$0x3,%rsi > 2b:* 80 3c 06 00 cmpb $0x0,(%rsi,%rax,1) <-- > trapping instruction which reporst the error. But in this case this is not NULL-deref, note that $rbx = 2ce0 and this is below __PAGE_OFFSET (but above TASK_SIZE_MAX). It seems it is not even canonical. In any case this odd value can't be valid. Again, looks like mm->mmap pointer was corrupted. Perhaps you can re-test with the stupid patch below. But unlikely it will help. If mm was freed we would probably see something else. Oleg. --- --- a/kernel/fork.c +++ b/kernel/fork.c @@ -672,6 +672,7 @@ struct mm_struct *mm_alloc(void) void __mmdrop(struct mm_struct *mm) { BUG_ON(mm == _mm); + BUG_ON(atomic_read(>mm_users)); mm_free_pgd(mm); destroy_context(mm); mmu_notifier_mm_destroy(mm); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24, Andrey Ryabinin wrote: > > 2015-09-24 20:26 GMT+03:00 Oleg Nesterov : > > On 09/24, Sasha Levin wrote: > >> > >> void unmap_vmas(struct mmu_gather *tlb, > >> struct vm_area_struct *vma, unsigned long start_addr, > >> unsigned long end_addr) > >> { > >> struct mm_struct *mm = vma->vm_mm; > >> > >> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); > >> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) > >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); > >> <--- this > >> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); > >> } > > > > And I do not see any dereference at this line, > > > > I noticed, that addr2line sometimes doesn't work reliably on > compiler-instrumented code. > I've seen couple times that it points to the next line of code. Yes, I know that we can't trust it. That is why I think (at least in this particular case) function+offset would be more helpful. And we need more asm probably. > >> >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) > >> >>6: 85 22 test %esp,(%rdx) > >> >>8: 01 00 add%eax,(%rax) > >> >>a: 00 48 8badd%cl,-0x75(%rax) > >> >>d: 43 rex.XB > >> >>e: 40 rex > >> >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi > >> >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) > >> >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > >> >> 21: fc ff df > >> >> 24: 48 89 famov%rdi,%rdx > >> >> 27: 48 c1 ea 03 shr$0x3,%rdx > >> >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) > >> >> <-- trapping instruction > >> >> 2f: 0f 85 ee 00 00 00 jne0x123 > >> >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax > >> >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) > >> >> 40: 00 > >> > > >> > And I do not see anything similar in "objdump -d". So could you at least > >> > show mm/memory.c:1337 in your tree? > >> > > >> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks > >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? > >> > (I can't check, my gcc is very old). Or what? > >> > >> This is indeed kasan code. 0xdc00 is the shadow base, and you > >> see > >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. > > > > and thus this asm can't help, right? > > > > I think it can. > > > So how can we figure out where exactly the kernel hits NULL ? And what > > exactly it tries to dereference? > > So we tried to dereference 0x4c8. That 0x4c8 is probably offset in some > struct. > The only big struct here is mm_struct. > So I think that we tried to derefernce null mm, and this asm: > > cmpq $0x0,0x4c8(%rax) > > is likely from inlined mm_has_notifiers(): > static inline int mm_has_notifiers(struct mm_struct *mm) > { > return unlikely(mm->mmu_notifier_mm); > } Looks reasonable... Thanks. I was going to say that this is impossible because the caller should have crashed if ->mm == NULL. But unmap_vmas() uses mm = vma->vm_mm, so it looks like this vma or mm->mmap was corrupted... Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 23, 2015 at 08:42:26PM -0400, Sasha Levin wrote: > On 09/23/2015 09:08 AM, Andrey Konovalov wrote: > > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickinswrote: > >> > This is totally untested, and one of you may quickly prove me wrong; > >> > but I went in to fix your "Bad page state (mlocked)" by holding pte > >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > >> > then couldn't see why it would need mmap_sem at all, given how mlock > >> > and munlock first assert intention by setting or clearing VM_LOCKED > >> > in vm_flags, then work their way up the vma, taking pte locks. > >> > > >> > Calling mlock_vma_page() under pte lock may look suspicious > >> > at first: but what it does is similar to clear_page_mlock(), > >> > which we regularly call under pte lock from page_remove_rmap(). > >> > > >> > I'd rather wait to hear whether this appears to work in practice, > >> > and whether you agree that it should work in theory, before writing > >> > the proper description. I'd love to lose that down_read_trylock. > > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". > > > > It seems that your patch doesn't fix the race from the report below, since > > pte > > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > > is being executed. (Line numbers are from kernel with your patch applied.) > > I've fired up my HZ_1 patch, Can we make HZ_1 thing into upstream? Under KERNEL_DEBUG, or something? > and this seems to be a real race that is > somewhat easy to reproduce under those conditions. > > Here's a fresh backtrace from my VMs: > > [1935109.882343] BUG: Bad page state in process trinity-subchil pfn:3ca200 > [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping: > (null) index:0x1e00 compound_mapcount: 0 > [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked) > [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [1935109.888197] bad because of flags: > [1935109.888759] flags: 0x10(mlocked) > [1935109.889525] Modules linked in: > [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted > 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 > [1935109.891876] 16445448 e5dca494 8803f7657708 > a70402da > [1935109.893504] ea000f288000 8803f7657738 a56e522b > 022f80144008 > [1935109.894947] ea000f288020 ea000f288000 > 8803f76577a8 > [1935109.896413] Call Trace: > [1935109.899102] [] dump_stack+0x4e/0x84 > [1935109.899821] [] bad_page+0x17b/0x210 > [1935109.900469] [] free_pages_prepare+0xb48/0x1110 > [1935109.902127] [] __free_pages_ok+0x21/0x260 > [1935109.904435] [] free_compound_page+0x63/0x80 > [1935109.905614] [] free_transhuge_page+0x6e/0x80 > [1935109.906752] [] __put_compound_page+0x76/0xa0 > [1935109.907884] [] release_pages+0x4d5/0x9f0 > [1935109.913027] [] tlb_flush_mmu_free+0x8a/0x120 > [1935109.913957] [] unmap_page_range+0xe73/0x1460 > [1935109.915737] [] unmap_single_vma+0x126/0x2f0 > [1935109.916646] [] unmap_vmas+0xdd/0x190 > [1935109.917454] [] exit_mmap+0x221/0x430 > [1935109.921176] [] mmput+0xb1/0x240 > [1935109.921919] [] do_exit+0x732/0x27c0 > [1935109.928561] [] do_group_exit+0xf9/0x300 > [1935109.929786] [] SyS_exit_group+0x1d/0x20 > [1935109.930617] [] entry_SYSCALL_64_fastpath+0x16/0x7a Would it make any difference if you'll add mmap_sem protection in exit_mmap? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24, Andrey Ryabinin wrote: > > 2015-09-24 20:26 GMT+03:00 Oleg Nesterov: > > On 09/24, Sasha Levin wrote: > >> > >> void unmap_vmas(struct mmu_gather *tlb, > >> struct vm_area_struct *vma, unsigned long start_addr, > >> unsigned long end_addr) > >> { > >> struct mm_struct *mm = vma->vm_mm; > >> > >> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); > >> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) > >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); > >> <--- this > >> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); > >> } > > > > And I do not see any dereference at this line, > > > > I noticed, that addr2line sometimes doesn't work reliably on > compiler-instrumented code. > I've seen couple times that it points to the next line of code. Yes, I know that we can't trust it. That is why I think (at least in this particular case) function+offset would be more helpful. And we need more asm probably. > >> >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) > >> >>6: 85 22 test %esp,(%rdx) > >> >>8: 01 00 add%eax,(%rax) > >> >>a: 00 48 8badd%cl,-0x75(%rax) > >> >>d: 43 rex.XB > >> >>e: 40 rex > >> >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi > >> >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) > >> >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > >> >> 21: fc ff df > >> >> 24: 48 89 famov%rdi,%rdx > >> >> 27: 48 c1 ea 03 shr$0x3,%rdx > >> >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) > >> >> <-- trapping instruction > >> >> 2f: 0f 85 ee 00 00 00 jne0x123 > >> >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax > >> >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) > >> >> 40: 00 > >> > > >> > And I do not see anything similar in "objdump -d". So could you at least > >> > show mm/memory.c:1337 in your tree? > >> > > >> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks > >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? > >> > (I can't check, my gcc is very old). Or what? > >> > >> This is indeed kasan code. 0xdc00 is the shadow base, and you > >> see > >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. > > > > and thus this asm can't help, right? > > > > I think it can. > > > So how can we figure out where exactly the kernel hits NULL ? And what > > exactly it tries to dereference? > > So we tried to dereference 0x4c8. That 0x4c8 is probably offset in some > struct. > The only big struct here is mm_struct. > So I think that we tried to derefernce null mm, and this asm: > > cmpq $0x0,0x4c8(%rax) > > is likely from inlined mm_has_notifiers(): > static inline int mm_has_notifiers(struct mm_struct *mm) > { > return unlikely(mm->mmu_notifier_mm); > } Looks reasonable... Thanks. I was going to say that this is impossible because the caller should have crashed if ->mm == NULL. But unmap_vmas() uses mm = vma->vm_mm, so it looks like this vma or mm->mmap was corrupted... Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/23, Sasha Levin wrote: > > Another similar trace where we see a problem during process exit: > > [1922964.887922] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > [1922964.890234] Modules linked in: > [1922964.890844] CPU: 1 PID: 21477 Comm: trinity-c161 Tainted: GW > 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 > [1922964.892584] task: 880251858000 ti: 88009f258000 task.ti: > 88009f258000 > [1922964.893723] RIP: acct_collect (kernel/acct.c:542) 530 void acct_collect(long exitcode, int group_dead) 531 { 532 struct pacct_struct *pacct = >signal->pacct; 533 cputime_t utime, stime; 534 unsigned long vsize = 0; 535 536 if (group_dead && current->mm) { 537 struct vm_area_struct *vma; 538 539 down_read(>mm->mmap_sem); 540 vma = current->mm->mmap; 541 while (vma) { 542 vsize += vma->vm_end - vma->vm_start; // 543 vma = vma->vm_next; 544 } 545 up_read(>mm->mmap_sem); 546 } > [1922964.895105] RSP: :88009f25f908 EFLAGS: 00010207 > [1922964.895935] RAX: dc00 RBX: 2ce0 RCX: > > [1922964.897008] RDX: 2152b153 RSI: 059c2000 RDI: > 2ce10007 > [1922964.898091] RBP: 88009f25f9e8 R08: 0001 R09: > 03ef > [1922964.899178] R10: ed014d7a3a01 R11: 0001 R12: > 880082b485c0 > [1922964.901643] R13: 2152b153 R14: 110013e4bf24 R15: > 88009f25f9c0 ... >0: 02 00 add(%rax),%al >2: 0f 85 9d 05 00 00 jne0x5a5 >8: 48 8b 1bmov(%rbx),%rbx >b: 48 85 dbtest %rbx,%rbx >e: 0f 84 7b 05 00 00 je 0x58f Probably "mov (%rbx),%rbx" is "vma = mm->mmap", > 14: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > 1b: fc ff df > 1e: 31 d2 xor%edx,%edx > 20: 48 8d 7b 08 lea0x8(%rbx),%rdi and this loads the addr of vma->vm_end for kasan, > 24: 48 89 femov%rdi,%rsi > 27: 48 c1 ee 03 shr$0x3,%rsi > 2b:* 80 3c 06 00 cmpb $0x0,(%rsi,%rax,1) <-- > trapping instruction which reporst the error. But in this case this is not NULL-deref, note that $rbx = 2ce0 and this is below __PAGE_OFFSET (but above TASK_SIZE_MAX). It seems it is not even canonical. In any case this odd value can't be valid. Again, looks like mm->mmap pointer was corrupted. Perhaps you can re-test with the stupid patch below. But unlikely it will help. If mm was freed we would probably see something else. Oleg. --- --- a/kernel/fork.c +++ b/kernel/fork.c @@ -672,6 +672,7 @@ struct mm_struct *mm_alloc(void) void __mmdrop(struct mm_struct *mm) { BUG_ON(mm == _mm); + BUG_ON(atomic_read(>mm_users)); mm_free_pgd(mm); destroy_context(mm); mmu_notifier_mm_destroy(mm); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24/2015 02:52 PM, Andrey Ryabinin wrote: > Sasha, could you confirm that in your kernel mmu_notifier_mm field has > 0x4c8 offset? > I would use gdb for that: > gdb vmlinux > (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm) (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm) $1 = 0x4c8 Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
2015-09-24 20:26 GMT+03:00 Oleg Nesterov : > On 09/24, Sasha Levin wrote: >> >> On 09/24/2015 09:11 AM, Oleg Nesterov wrote: >> > >> > Well, I know absolutely nothing about kasan, to the point I can't even >> > unserstand where does this message come from. grep didn't help. But this >> > doesn't matter... >> >> The reason behind this message is that NULL ptr derefs when using kasan are >> manifested as GFPs. This is because in order to validate an access to a given >> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in >> the case of >> a NULL it would try to access shadow_base + 0, which would GFP. > > OK, so this just means the kernele derefs the NULL pointer, > >> I'm running -next + Kirill's THP patchset. >> >> > struct mm_struct *mm = vma->vm_mm; >> >> void unmap_vmas(struct mmu_gather *tlb, >> struct vm_area_struct *vma, unsigned long start_addr, >> unsigned long end_addr) >> { >> struct mm_struct *mm = vma->vm_mm; >> >> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); >> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- >> this >> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); >> } > > And I do not see any dereference at this line, > I noticed, that addr2line sometimes doesn't work reliably on compiler-instrumented code. I've seen couple times that it points to the next line of code. >> >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >> >>6: 85 22 test %esp,(%rdx) >> >>8: 01 00 add%eax,(%rax) >> >>a: 00 48 8badd%cl,-0x75(%rax) >> >>d: 43 rex.XB >> >>e: 40 rex >> >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi >> >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) >> >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax >> >> 21: fc ff df >> >> 24: 48 89 famov%rdi,%rdx >> >> 27: 48 c1 ea 03 shr$0x3,%rdx >> >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) >> >> <-- trapping instruction >> >> 2f: 0f 85 ee 00 00 00 jne0x123 >> >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax >> >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) >> >> 40: 00 >> > >> > And I do not see anything similar in "objdump -d". So could you at least >> > show mm/memory.c:1337 in your tree? >> > >> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? >> > (I can't check, my gcc is very old). Or what? >> >> This is indeed kasan code. 0xdc00 is the shadow base, and you see >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. > > and thus this asm can't help, right? > I think it can. > So how can we figure out where exactly the kernel hits NULL ? And what > exactly it tries to dereference? So we tried to dereference 0x4c8. That 0x4c8 is probably offset in some struct. The only big struct here is mm_struct. So I think that we tried to derefernce null mm, and this asm: > cmpq $0x0,0x4c8(%rax) is likely from inlined mm_has_notifiers(): static inline int mm_has_notifiers(struct mm_struct *mm) { return unlikely(mm->mmu_notifier_mm); } Sasha, could you confirm that in your kernel mmu_notifier_mm field has 0x4c8 offset? I would use gdb for that: gdb vmlinux (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24, Sasha Levin wrote: > > On 09/24/2015 09:11 AM, Oleg Nesterov wrote: > > > > Well, I know absolutely nothing about kasan, to the point I can't even > > unserstand where does this message come from. grep didn't help. But this > > doesn't matter... > > The reason behind this message is that NULL ptr derefs when using kasan are > manifested as GFPs. This is because in order to validate an access to a given > memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the > case of > a NULL it would try to access shadow_base + 0, which would GFP. OK, so this just means the kernele derefs the NULL pointer, > I'm running -next + Kirill's THP patchset. > > > struct mm_struct *mm = vma->vm_mm; > > void unmap_vmas(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long start_addr, > unsigned long end_addr) > { > struct mm_struct *mm = vma->vm_mm; > > mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) > unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- > this > mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); > } And I do not see any dereference at this line, > >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) > >>6: 85 22 test %esp,(%rdx) > >>8: 01 00 add%eax,(%rax) > >>a: 00 48 8badd%cl,-0x75(%rax) > >>d: 43 rex.XB > >>e: 40 rex > >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi > >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) > >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > >> 21: fc ff df > >> 24: 48 89 famov%rdi,%rdx > >> 27: 48 c1 ea 03 shr$0x3,%rdx > >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) > >> <-- trapping instruction > >> 2f: 0f 85 ee 00 00 00 jne0x123 > >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax > >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) > >> 40: 00 > > > > And I do not see anything similar in "objdump -d". So could you at least > > show mm/memory.c:1337 in your tree? > > > > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks > > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? > > (I can't check, my gcc is very old). Or what? > > This is indeed kasan code. 0xdc00 is the shadow base, and you see > kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. and thus this asm can't help, right? So how can we figure out where exactly the kernel hits NULL ? And what exactly it tries to dereference? > I hope the information above helped, please let me know if it didn't and you > need anything else. Thanks a lot, it does help, but I am still confused. Looks like, "function+offset" is more useful than the line numbers, at least we could look at mm/memory.s. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24/2015 09:11 AM, Oleg Nesterov wrote: > On 09/15, Sasha Levin wrote: >> >> I've modified my tests to stress the exit path of processes with many vmas, >> and hit the following NULL ptr deref (not sure if it's related to the >> original issue): > > I am shy to ask. Looks like I am the only stupid one who needs > more info... > >> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory >> accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > > Well, I know absolutely nothing about kasan, to the point I can't even > unserstand where does this message come from. grep didn't help. But this > doesn't matter... The reason behind this message is that NULL ptr derefs when using kasan are manifested as GFPs. This is because in order to validate an access to a given memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the case of a NULL it would try to access shadow_base + 0, which would GFP. >> [1181047.937223] Modules linked in: >> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted >> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 >> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: >> 880433f0 >> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) > > I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 > is I'm running -next + Kirill's THP patchset. > struct mm_struct *mm = vma->vm_mm; void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- this mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); } > but this doesn't match the asm below, > >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >>6: 85 22 test %esp,(%rdx) >>8: 01 00 add%eax,(%rax) >>a: 00 48 8badd%cl,-0x75(%rax) >>d: 43 rex.XB >>e: 40 rex >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax >> 21: fc ff df >> 24: 48 89 famov%rdi,%rdx >> 27: 48 c1 ea 03 shr$0x3,%rdx >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- >> trapping instruction >> 2f: 0f 85 ee 00 00 00 jne0x123 >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) >> 40: 00 > > And I do not see anything similar in "objdump -d". So could you at least > show mm/memory.c:1337 in your tree? > > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? > (I can't check, my gcc is very old). Or what? This is indeed kasan code. 0xdc00 is the shadow base, and you see kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. Looking at the assembly, the address we were trying to access was: RDI: 04c8 > Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas? I hope the information above helped, please let me know if it didn't and you need anything else. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/15, Sasha Levin wrote: > > I've modified my tests to stress the exit path of processes with many vmas, > and hit the following NULL ptr deref (not sure if it's related to the > original issue): I am shy to ask. Looks like I am the only stupid one who needs more info... > [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN Well, I know absolutely nothing about kasan, to the point I can't even unserstand where does this message come from. grep didn't help. But this doesn't matter... > [1181047.937223] Modules linked in: > [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted > 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 > [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: > 880433f0 > [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 is struct mm_struct *mm = vma->vm_mm; but this doesn't match the asm below, >0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >6: 85 22 test %esp,(%rdx) >8: 01 00 add%eax,(%rax) >a: 00 48 8badd%cl,-0x75(%rax) >d: 43 rex.XB >e: 40 rex >f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi > 16: 48 89 45 d0 mov%rax,-0x30(%rbp) > 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > 21: fc ff df > 24: 48 89 famov%rdi,%rdx > 27: 48 c1 ea 03 shr$0x3,%rdx > 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- > trapping instruction > 2f: 0f 85 ee 00 00 00 jne0x123 > 35: 48 8b 45 d0 mov-0x30(%rbp),%rax > 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) > 40: 00 And I do not see anything similar in "objdump -d". So could you at least show mm/memory.c:1337 in your tree? Hmm. movabs $0xdc00,%rax above looks suspicious, this looks like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? (I can't check, my gcc is very old). Or what? Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/15, Sasha Levin wrote: > > I've modified my tests to stress the exit path of processes with many vmas, > and hit the following NULL ptr deref (not sure if it's related to the > original issue): I am shy to ask. Looks like I am the only stupid one who needs more info... > [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN Well, I know absolutely nothing about kasan, to the point I can't even unserstand where does this message come from. grep didn't help. But this doesn't matter... > [1181047.937223] Modules linked in: > [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted > 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 > [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: > 880433f0 > [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 is struct mm_struct *mm = vma->vm_mm; but this doesn't match the asm below, >0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >6: 85 22 test %esp,(%rdx) >8: 01 00 add%eax,(%rax) >a: 00 48 8badd%cl,-0x75(%rax) >d: 43 rex.XB >e: 40 rex >f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi > 16: 48 89 45 d0 mov%rax,-0x30(%rbp) > 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > 21: fc ff df > 24: 48 89 famov%rdi,%rdx > 27: 48 c1 ea 03 shr$0x3,%rdx > 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- > trapping instruction > 2f: 0f 85 ee 00 00 00 jne0x123 > 35: 48 8b 45 d0 mov-0x30(%rbp),%rax > 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) > 40: 00 And I do not see anything similar in "objdump -d". So could you at least show mm/memory.c:1337 in your tree? Hmm. movabs $0xdc00,%rax above looks suspicious, this looks like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? (I can't check, my gcc is very old). Or what? Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24/2015 09:11 AM, Oleg Nesterov wrote: > On 09/15, Sasha Levin wrote: >> >> I've modified my tests to stress the exit path of processes with many vmas, >> and hit the following NULL ptr deref (not sure if it's related to the >> original issue): > > I am shy to ask. Looks like I am the only stupid one who needs > more info... > >> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory >> accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > > Well, I know absolutely nothing about kasan, to the point I can't even > unserstand where does this message come from. grep didn't help. But this > doesn't matter... The reason behind this message is that NULL ptr derefs when using kasan are manifested as GFPs. This is because in order to validate an access to a given memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the case of a NULL it would try to access shadow_base + 0, which would GFP. >> [1181047.937223] Modules linked in: >> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted >> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 >> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: >> 880433f0 >> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) > > I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 > is I'm running -next + Kirill's THP patchset. > struct mm_struct *mm = vma->vm_mm; void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- this mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); } > but this doesn't match the asm below, > >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >>6: 85 22 test %esp,(%rdx) >>8: 01 00 add%eax,(%rax) >>a: 00 48 8badd%cl,-0x75(%rax) >>d: 43 rex.XB >>e: 40 rex >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax >> 21: fc ff df >> 24: 48 89 famov%rdi,%rdx >> 27: 48 c1 ea 03 shr$0x3,%rdx >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- >> trapping instruction >> 2f: 0f 85 ee 00 00 00 jne0x123 >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) >> 40: 00 > > And I do not see anything similar in "objdump -d". So could you at least > show mm/memory.c:1337 in your tree? > > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? > (I can't check, my gcc is very old). Or what? This is indeed kasan code. 0xdc00 is the shadow base, and you see kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. Looking at the assembly, the address we were trying to access was: RDI: 04c8 > Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas? I hope the information above helped, please let me know if it didn't and you need anything else. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24, Sasha Levin wrote: > > On 09/24/2015 09:11 AM, Oleg Nesterov wrote: > > > > Well, I know absolutely nothing about kasan, to the point I can't even > > unserstand where does this message come from. grep didn't help. But this > > doesn't matter... > > The reason behind this message is that NULL ptr derefs when using kasan are > manifested as GFPs. This is because in order to validate an access to a given > memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the > case of > a NULL it would try to access shadow_base + 0, which would GFP. OK, so this just means the kernele derefs the NULL pointer, > I'm running -next + Kirill's THP patchset. > > > struct mm_struct *mm = vma->vm_mm; > > void unmap_vmas(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long start_addr, > unsigned long end_addr) > { > struct mm_struct *mm = vma->vm_mm; > > mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) > unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- > this > mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); > } And I do not see any dereference at this line, > >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) > >>6: 85 22 test %esp,(%rdx) > >>8: 01 00 add%eax,(%rax) > >>a: 00 48 8badd%cl,-0x75(%rax) > >>d: 43 rex.XB > >>e: 40 rex > >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi > >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) > >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax > >> 21: fc ff df > >> 24: 48 89 famov%rdi,%rdx > >> 27: 48 c1 ea 03 shr$0x3,%rdx > >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) > >> <-- trapping instruction > >> 2f: 0f 85 ee 00 00 00 jne0x123 > >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax > >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) > >> 40: 00 > > > > And I do not see anything similar in "objdump -d". So could you at least > > show mm/memory.c:1337 in your tree? > > > > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks > > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? > > (I can't check, my gcc is very old). Or what? > > This is indeed kasan code. 0xdc00 is the shadow base, and you see > kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. and thus this asm can't help, right? So how can we figure out where exactly the kernel hits NULL ? And what exactly it tries to dereference? > I hope the information above helped, please let me know if it didn't and you > need anything else. Thanks a lot, it does help, but I am still confused. Looks like, "function+offset" is more useful than the line numbers, at least we could look at mm/memory.s. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
2015-09-24 20:26 GMT+03:00 Oleg Nesterov: > On 09/24, Sasha Levin wrote: >> >> On 09/24/2015 09:11 AM, Oleg Nesterov wrote: >> > >> > Well, I know absolutely nothing about kasan, to the point I can't even >> > unserstand where does this message come from. grep didn't help. But this >> > doesn't matter... >> >> The reason behind this message is that NULL ptr derefs when using kasan are >> manifested as GFPs. This is because in order to validate an access to a given >> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in >> the case of >> a NULL it would try to access shadow_base + 0, which would GFP. > > OK, so this just means the kernele derefs the NULL pointer, > >> I'm running -next + Kirill's THP patchset. >> >> > struct mm_struct *mm = vma->vm_mm; >> >> void unmap_vmas(struct mmu_gather *tlb, >> struct vm_area_struct *vma, unsigned long start_addr, >> unsigned long end_addr) >> { >> struct mm_struct *mm = vma->vm_mm; >> >> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); >> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- >> this >> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); >> } > > And I do not see any dereference at this line, > I noticed, that addr2line sometimes doesn't work reliably on compiler-instrumented code. I've seen couple times that it points to the next line of code. >> >>0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >> >>6: 85 22 test %esp,(%rdx) >> >>8: 01 00 add%eax,(%rax) >> >>a: 00 48 8badd%cl,-0x75(%rax) >> >>d: 43 rex.XB >> >>e: 40 rex >> >>f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi >> >> 16: 48 89 45 d0 mov%rax,-0x30(%rbp) >> >> 1a: 48 b8 00 00 00 00 00movabs $0xdc00,%rax >> >> 21: fc ff df >> >> 24: 48 89 famov%rdi,%rdx >> >> 27: 48 c1 ea 03 shr$0x3,%rdx >> >> 2b:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) >> >> <-- trapping instruction >> >> 2f: 0f 85 ee 00 00 00 jne0x123 >> >> 35: 48 8b 45 d0 mov-0x30(%rbp),%rax >> >> 39: 48 83 b8 c8 04 00 00cmpq $0x0,0x4c8(%rax) >> >> 40: 00 >> > >> > And I do not see anything similar in "objdump -d". So could you at least >> > show mm/memory.c:1337 in your tree? >> > >> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan? >> > (I can't check, my gcc is very old). Or what? >> >> This is indeed kasan code. 0xdc00 is the shadow base, and you see >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP. > > and thus this asm can't help, right? > I think it can. > So how can we figure out where exactly the kernel hits NULL ? And what > exactly it tries to dereference? So we tried to dereference 0x4c8. That 0x4c8 is probably offset in some struct. The only big struct here is mm_struct. So I think that we tried to derefernce null mm, and this asm: > cmpq $0x0,0x4c8(%rax) is likely from inlined mm_has_notifiers(): static inline int mm_has_notifiers(struct mm_struct *mm) { return unlikely(mm->mmu_notifier_mm); } Sasha, could you confirm that in your kernel mmu_notifier_mm field has 0x4c8 offset? I would use gdb for that: gdb vmlinux (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/24/2015 02:52 PM, Andrey Ryabinin wrote: > Sasha, could you confirm that in your kernel mmu_notifier_mm field has > 0x4c8 offset? > I would use gdb for that: > gdb vmlinux > (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm) (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm) $1 = 0x4c8 Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/23/2015 09:08 AM, Andrey Konovalov wrote: > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins wrote: >> > This is totally untested, and one of you may quickly prove me wrong; >> > but I went in to fix your "Bad page state (mlocked)" by holding pte >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), >> > then couldn't see why it would need mmap_sem at all, given how mlock >> > and munlock first assert intention by setting or clearing VM_LOCKED >> > in vm_flags, then work their way up the vma, taking pte locks. >> > >> > Calling mlock_vma_page() under pte lock may look suspicious >> > at first: but what it does is similar to clear_page_mlock(), >> > which we regularly call under pte lock from page_remove_rmap(). >> > >> > I'd rather wait to hear whether this appears to work in practice, >> > and whether you agree that it should work in theory, before writing >> > the proper description. I'd love to lose that down_read_trylock. > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". > > It seems that your patch doesn't fix the race from the report below, since pte > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > is being executed. (Line numbers are from kernel with your patch applied.) I've fired up my HZ_1 patch, and this seems to be a real race that is somewhat easy to reproduce under those conditions. Here's a fresh backtrace from my VMs: [1935109.882343] BUG: Bad page state in process trinity-subchil pfn:3ca200 [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping: (null) index:0x1e00 compound_mapcount: 0 [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked) [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [1935109.888197] bad because of flags: [1935109.888759] flags: 0x10(mlocked) [1935109.889525] Modules linked in: [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 [1935109.891876] 16445448 e5dca494 8803f7657708 a70402da [1935109.893504] ea000f288000 8803f7657738 a56e522b 022f80144008 [1935109.894947] ea000f288020 ea000f288000 8803f76577a8 [1935109.896413] Call Trace: [1935109.899102] [] dump_stack+0x4e/0x84 [1935109.899821] [] bad_page+0x17b/0x210 [1935109.900469] [] free_pages_prepare+0xb48/0x1110 [1935109.902127] [] __free_pages_ok+0x21/0x260 [1935109.904435] [] free_compound_page+0x63/0x80 [1935109.905614] [] free_transhuge_page+0x6e/0x80 [1935109.906752] [] __put_compound_page+0x76/0xa0 [1935109.907884] [] release_pages+0x4d5/0x9f0 [1935109.913027] [] tlb_flush_mmu_free+0x8a/0x120 [1935109.913957] [] unmap_page_range+0xe73/0x1460 [1935109.915737] [] unmap_single_vma+0x126/0x2f0 [1935109.916646] [] unmap_vmas+0xdd/0x190 [1935109.917454] [] exit_mmap+0x221/0x430 [1935109.921176] [] mmput+0xb1/0x240 [1935109.921919] [] do_exit+0x732/0x27c0 [1935109.928561] [] do_group_exit+0xf9/0x300 [1935109.929786] [] SyS_exit_group+0x1d/0x20 [1935109.930617] [] entry_SYSCALL_64_fastpath+0x16/0x7a Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, 23 Sep 2015, Kirill A. Shutemov wrote: On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote: [...] I'd rather wait to hear whether this appears to work in practice, and whether you agree that it should work in theory, before writing the proper description. I'd love to lose that down_read_trylock. You mention how Sasha hit the "Bad page state (mlocked)" back in November: that was one of the reasons we reverted Davidlohr's i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(), without understanding why it was needed. Yes, it would lock out a concurrent try_to_unmap(), whose setting of PageMlocked was not sufficiently serialized by the down_read_trylock of mmap_sem. But I don't remember the other reasons for that revert (and haven't looked very hard as yet): anyone else remember? Yeah, I don't think this was ever resolved, but ultimately the patch got reverted[1] because it exposed issues in the form of bad pages (shmem, vmsplice) and corrupted vm_flags while in untrack_pfn() causing, for example, vm_file to dissapear. I hoped Davidlohr will come back with something after the revert, but it never happend. I think the reverted patch was responsible for most of scalability boost from rwsem for i_mmap_lock... Actually no, the change that got reverted was something we got in very last minute, just because it made sense and had the blessing of some key people. The main winner of the series was migration (rmap), which later Hugh addressed more specifically for unmapped pages: https://lkml.org/lkml/2014/11/30/349 So I really didn't care about the reverted patch, and therefore was never on my radar. [1] https://lkml.org/lkml/2014/12/22/375 Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/15/2015 01:36 PM, Sasha Levin wrote: > On 09/11/2015 09:27 PM, Hugh Dickins wrote: >> > I'm inclined to echo Vlastimil's comment from earlier in the thread: >> > sounds like an overkill, unless we find something more serious than this. > I've modified my tests to stress the exit path of processes with many vmas, > and hit the following NULL ptr deref (not sure if it's related to the > original issue): > > [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > [1181047.937223] Modules linked in: > [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted > 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 > [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: > 880433f0 > [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) > [1181047.941842] RSP: :880433f078a8 EFLAGS: 00010206 > [1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: > > [1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: > 04c8 > [1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: > 15c59e01 > [1181047.944604] R10: R11: 0001 R12: > > [1181047.944833] pps pps0: PPS event at 21837.866101174 > [1181047.944838] pps pps0: capture assert seq #7188 > [1181047.946261] R13: R14: 880433f07910 R15: > 2e0d > [1181047.947005] FS: () GS:88025200() > knlGS: > [1181047.947779] CS: 0010 DS: ES: CR0: 8005003b > [1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: > 06a0 > [1181047.949085] Stack: > [1181047.949350] 880433f07910 1100867e0f1e > dc00 > [1181047.950164] 8801d825d000 2e0d 880433f079d0 > 9276c4ab > [1181047.951070] 88041acd000a 41b58ab3 9ecd1a43 > 9276c2a0 > [1181047.951906] Call Trace: > [1181047.952201] exit_mmap (mm/mmap.c:2856) > [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826) > [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 > include/linux/sched.h:2557 mm/huge_memory.c:2169) > [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109) > [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 > mm/slub.c:2746) > [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 > include/linux/sched.h:2557 mm/huge_memory.c:2169) > [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 > kernel/fork.c:702) > [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 > include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733) > [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298) > [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654) > [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57) > [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249) > [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298) > [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?) > [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641) > [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?) > [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 > kernel/exit.c:859) > [1181047.965569] get_signal (kernel/signal.c:2353) > [1181047.966430] do_signal (arch/x86/kernel/signal.c:709) > [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 > fs/read_write.c:821) > [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407) > [1181047.969126] ? vfs_write (fs/read_write.c:777) > [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706) > [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?) > [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) > [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 > include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199) > [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 > kernel/time/itimer.c:227) > [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18) > [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) > [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251) > [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318) > [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282) > [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 > c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 > <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00 > All code > >0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >6: 85 22 test %esp,(%rdx) >8: 01 00
Re: Multiple potential races on vma->vm_flags
On 09/11, Kirill A. Shutemov wrote: > > This one is tricky. I *assume* the mm cannot be generally accessible after > mm_users drops to zero, but I'm not entirely sure about it. > procfs? ptrace? Well, all I can say is that proc/ptrace look fine afaics... This is off-topic, but how about the patch below? Different threads can expand different vma's at the same time under read_lock(mmap_sem), so vma_lock_anon_vma() can't help to serialize "locked_vm += grow". Oleg. --- x/mm/mmap.c +++ x/mm/mmap.c @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a if (security_vm_enough_memory_mm(mm, grow)) return -ENOMEM; - /* Ok, everything looks good - let it rip */ - if (vma->vm_flags & VM_LOCKED) - mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow); return 0; } @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct * against concurrent vma expansions. */ spin_lock(>vm_mm->page_table_lock); + if (vma->vm_flags & VM_LOCKED) + mm->locked_vm += grow; anon_vma_interval_tree_pre_update_vma(vma); vma->vm_end = address; anon_vma_interval_tree_post_update_vma(vma); @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru * against concurrent vma expansions. */ spin_lock(>vm_mm->page_table_lock); + if (vma->vm_flags & VM_LOCKED) + mm->locked_vm += grow; anon_vma_interval_tree_pre_update_vma(vma); vma->vm_start = address; vma->vm_pgoff -= grow; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/23, Oleg Nesterov wrote: > > On 09/11, Kirill A. Shutemov wrote: > > > > This one is tricky. I *assume* the mm cannot be generally accessible after > > mm_users drops to zero, but I'm not entirely sure about it. > > procfs? ptrace? > > Well, all I can say is that proc/ptrace look fine afaics... > > This is off-topic, but how about the patch below? Different threads can > expand different vma's at the same time under read_lock(mmap_sem), so > vma_lock_anon_vma() can't help to serialize "locked_vm += grow". perhaps vm_stat_account() should be moved too, but total_vm/etc is less important. Or I missed something? Oleg. > --- x/mm/mmap.c > +++ x/mm/mmap.c > @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a > if (security_vm_enough_memory_mm(mm, grow)) > return -ENOMEM; > > - /* Ok, everything looks good - let it rip */ > - if (vma->vm_flags & VM_LOCKED) > - mm->locked_vm += grow; > vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow); > return 0; > } > @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct >* against concurrent vma expansions. >*/ > spin_lock(>vm_mm->page_table_lock); > + if (vma->vm_flags & VM_LOCKED) > + mm->locked_vm += grow; > anon_vma_interval_tree_pre_update_vma(vma); > vma->vm_end = address; > anon_vma_interval_tree_post_update_vma(vma); > @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru >* against concurrent vma expansions. >*/ > spin_lock(>vm_mm->page_table_lock); > + if (vma->vm_flags & VM_LOCKED) > + mm->locked_vm += grow; > anon_vma_interval_tree_pre_update_vma(vma); > vma->vm_start = address; > vma->vm_pgoff -= grow; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins wrote: > This is totally untested, and one of you may quickly prove me wrong; > but I went in to fix your "Bad page state (mlocked)" by holding pte > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > then couldn't see why it would need mmap_sem at all, given how mlock > and munlock first assert intention by setting or clearing VM_LOCKED > in vm_flags, then work their way up the vma, taking pte locks. > > Calling mlock_vma_page() under pte lock may look suspicious > at first: but what it does is similar to clear_page_mlock(), > which we regularly call under pte lock from page_remove_rmap(). > > I'd rather wait to hear whether this appears to work in practice, > and whether you agree that it should work in theory, before writing > the proper description. I'd love to lose that down_read_trylock. No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". It seems that your patch doesn't fix the race from the report below, since pte lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) is being executed. (Line numbers are from kernel with your patch applied.) === ThreadSanitizer: data-race in munlock_vma_pages_range Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2: [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [< inline >] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inline >] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1400 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874 [] get_signal+0x375/0xa70 kernel/signal.c:2353 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1: [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208 [< inline >] rmap_walk_file mm/rmap.c:1522 [] rmap_walk+0x147/0x450 mm/rmap.c:1541 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [< inline >] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inline >] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1400 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874 [] get_signal+0x375/0xa70 kernel/signal.c:2353 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 === -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote: > On Tue, 22 Sep 2015, Andrey Konovalov wrote: > > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins wrote: > > > On Tue, 22 Sep 2015, Andrey Konovalov wrote: > > >> If anybody comes up with a patch to fix the original issue I easily > > >> can test it, since I'm hitting "BUG: Bad page state" in a second when > > >> fuzzing with KTSAN and Trinity. > > > > > > This "BUG: Bad page state" sounds more serious, but I cannot track down > > > your report of it: please repost - thanks - though on seeing it, I may > > > well end up with no ideas. > > > > The report is below. > > Thanks. > > > > > I get it after a few seconds of running Trinity on a kernel with KTSAN > > and targeting mlock, munlock and madvise syscalls. > > Sasha also observed a very similar crash a while ago > > (https://lkml.org/lkml/2014/11/6/1055). > > I didn't manage to reproduce this in a kernel build without KTSAN though. > > The idea was that data races KTSAN reports might be the explanation of > > these crashes. > > > > BUG: Bad page state in process trinity-c15 pfn:281999 > > page:ea000a066640 count:0 mapcount:0 mapping: (null) index:0xd > > flags: 0x228000c(referenced|uptodate|swapbacked|mlocked) > > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > > bad because of flags: > > flags: 0x20(mlocked) > > Modules linked in: > > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > > 821c3b70 00014741 8800b857f948 > > 81e9926c 0003 ea000a066640 8800b857f978 > > 811ce045 821c3b70 ea000a066640 0001 > > Call Trace: > > [< inline >] __dump_stack lib/dump_stack.c:15 > > [] dump_stack+0x63/0x81 lib/dump_stack.c:50 > > [] bad_page+0x115/0x1a0 mm/page_alloc.c:409 > > [< inline >] free_pages_check mm/page_alloc.c:731 > > [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922 > > [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908 > > [] free_hot_cold_page_list+0x5f/0x100 > > mm/page_alloc.c:1956 (discriminator 3) > > [] release_pages+0x151/0x300 mm/swap.c:967 > > [] __pagevec_release+0x43/0x60 mm/swap.c:984 > > [< inline >] pagevec_release include/linux/pagevec.h:69 > > [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446 > > [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540 > > [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086 > > [] vfs_fallocate+0x1e0/0x310 fs/open.c:303 > > [< inline >] madvise_remove mm/madvise.c:326 > > [< inline >] madvise_vma mm/madvise.c:378 > > [< inline >] SYSC_madvise mm/madvise.c:528 > > [] SyS_madvise+0x378/0x760 mm/madvise.c:459 > > [] ? kt_atomic64_store+0x76/0x130 > > mm/ktsan/sync_atomic.c:161 > > [] entry_SYSCALL_64_fastpath+0x31/0x95 > > arch/x86/entry/entry_64.S:188 > > Disabling lock debugging due to kernel taint > > This is totally untested, and one of you may quickly prove me wrong; > but I went in to fix your "Bad page state (mlocked)" by holding pte > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > then couldn't see why it would need mmap_sem at all, given how mlock > and munlock first assert intention by setting or clearing VM_LOCKED > in vm_flags, then work their way up the vma, taking pte locks. > > Calling mlock_vma_page() under pte lock may look suspicious > at first: but what it does is similar to clear_page_mlock(), > which we regularly call under pte lock from page_remove_rmap(). Indeed. Looks fishy. But probably will work. That was not obvious for me what makes clearing VM_LOCKED visible in try_to_unmap_one() without mmap_sem. After looking some more it seems we would rely on page_check_address() in try_to_unmap_one() having acquire semantics and follow_page_mask() in munlock_vma_pages_range() having release semantics. But I would prefer to have something more explicit. That's mess. > > I'd rather wait to hear whether this appears to work in practice, > and whether you agree that it should work in theory, before writing > the proper description. I'd love to lose that down_read_trylock. > > You mention how Sasha hit the "Bad page state (mlocked)" back in > November: that was one of the reasons we reverted Davidlohr's > i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(), > without understanding why it was needed. Yes, it would lock out > a concurrent try_to_unmap(), whose setting of PageMlocked was not > sufficiently serialized by the down_read_trylock of mmap_sem. > > But I don't remember the other reasons for that revert (and > haven't looked very hard as yet): anyone else remember? I hoped Davidlohr will come back with something after the revert, but it never happend. I think the reverted patch was responsible for most of scalability boost from rwsem for i_mmap_lock... -- Kirill A. Shutemov -- To
Re: Multiple potential races on vma->vm_flags
On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote: > On Tue, 22 Sep 2015, Andrey Konovalov wrote: > > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickinswrote: > > > On Tue, 22 Sep 2015, Andrey Konovalov wrote: > > >> If anybody comes up with a patch to fix the original issue I easily > > >> can test it, since I'm hitting "BUG: Bad page state" in a second when > > >> fuzzing with KTSAN and Trinity. > > > > > > This "BUG: Bad page state" sounds more serious, but I cannot track down > > > your report of it: please repost - thanks - though on seeing it, I may > > > well end up with no ideas. > > > > The report is below. > > Thanks. > > > > > I get it after a few seconds of running Trinity on a kernel with KTSAN > > and targeting mlock, munlock and madvise syscalls. > > Sasha also observed a very similar crash a while ago > > (https://lkml.org/lkml/2014/11/6/1055). > > I didn't manage to reproduce this in a kernel build without KTSAN though. > > The idea was that data races KTSAN reports might be the explanation of > > these crashes. > > > > BUG: Bad page state in process trinity-c15 pfn:281999 > > page:ea000a066640 count:0 mapcount:0 mapping: (null) index:0xd > > flags: 0x228000c(referenced|uptodate|swapbacked|mlocked) > > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > > bad because of flags: > > flags: 0x20(mlocked) > > Modules linked in: > > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > > 821c3b70 00014741 8800b857f948 > > 81e9926c 0003 ea000a066640 8800b857f978 > > 811ce045 821c3b70 ea000a066640 0001 > > Call Trace: > > [< inline >] __dump_stack lib/dump_stack.c:15 > > [] dump_stack+0x63/0x81 lib/dump_stack.c:50 > > [] bad_page+0x115/0x1a0 mm/page_alloc.c:409 > > [< inline >] free_pages_check mm/page_alloc.c:731 > > [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922 > > [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908 > > [] free_hot_cold_page_list+0x5f/0x100 > > mm/page_alloc.c:1956 (discriminator 3) > > [] release_pages+0x151/0x300 mm/swap.c:967 > > [] __pagevec_release+0x43/0x60 mm/swap.c:984 > > [< inline >] pagevec_release include/linux/pagevec.h:69 > > [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446 > > [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540 > > [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086 > > [] vfs_fallocate+0x1e0/0x310 fs/open.c:303 > > [< inline >] madvise_remove mm/madvise.c:326 > > [< inline >] madvise_vma mm/madvise.c:378 > > [< inline >] SYSC_madvise mm/madvise.c:528 > > [] SyS_madvise+0x378/0x760 mm/madvise.c:459 > > [] ? kt_atomic64_store+0x76/0x130 > > mm/ktsan/sync_atomic.c:161 > > [] entry_SYSCALL_64_fastpath+0x31/0x95 > > arch/x86/entry/entry_64.S:188 > > Disabling lock debugging due to kernel taint > > This is totally untested, and one of you may quickly prove me wrong; > but I went in to fix your "Bad page state (mlocked)" by holding pte > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > then couldn't see why it would need mmap_sem at all, given how mlock > and munlock first assert intention by setting or clearing VM_LOCKED > in vm_flags, then work their way up the vma, taking pte locks. > > Calling mlock_vma_page() under pte lock may look suspicious > at first: but what it does is similar to clear_page_mlock(), > which we regularly call under pte lock from page_remove_rmap(). Indeed. Looks fishy. But probably will work. That was not obvious for me what makes clearing VM_LOCKED visible in try_to_unmap_one() without mmap_sem. After looking some more it seems we would rely on page_check_address() in try_to_unmap_one() having acquire semantics and follow_page_mask() in munlock_vma_pages_range() having release semantics. But I would prefer to have something more explicit. That's mess. > > I'd rather wait to hear whether this appears to work in practice, > and whether you agree that it should work in theory, before writing > the proper description. I'd love to lose that down_read_trylock. > > You mention how Sasha hit the "Bad page state (mlocked)" back in > November: that was one of the reasons we reverted Davidlohr's > i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(), > without understanding why it was needed. Yes, it would lock out > a concurrent try_to_unmap(), whose setting of PageMlocked was not > sufficiently serialized by the down_read_trylock of mmap_sem. > > But I don't remember the other reasons for that revert (and > haven't looked very hard as yet): anyone else remember? I hoped Davidlohr will come back with something after the revert, but it never happend. I think the reverted patch was responsible for most of scalability boost from rwsem for i_mmap_lock... -- Kirill A.
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickinswrote: > This is totally untested, and one of you may quickly prove me wrong; > but I went in to fix your "Bad page state (mlocked)" by holding pte > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), > then couldn't see why it would need mmap_sem at all, given how mlock > and munlock first assert intention by setting or clearing VM_LOCKED > in vm_flags, then work their way up the vma, taking pte locks. > > Calling mlock_vma_page() under pte lock may look suspicious > at first: but what it does is similar to clear_page_mlock(), > which we regularly call under pte lock from page_remove_rmap(). > > I'd rather wait to hear whether this appears to work in practice, > and whether you agree that it should work in theory, before writing > the proper description. I'd love to lose that down_read_trylock. No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". It seems that your patch doesn't fix the race from the report below, since pte lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) is being executed. (Line numbers are from kernel with your patch applied.) === ThreadSanitizer: data-race in munlock_vma_pages_range Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2: [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [< inline >] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inline >] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1400 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874 [] get_signal+0x375/0xa70 kernel/signal.c:2353 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1: [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208 [< inline >] rmap_walk_file mm/rmap.c:1522 [] rmap_walk+0x147/0x450 mm/rmap.c:1541 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [< inline >] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inline >] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1400 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874 [] get_signal+0x375/0xa70 kernel/signal.c:2353 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329 === -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/23, Oleg Nesterov wrote: > > On 09/11, Kirill A. Shutemov wrote: > > > > This one is tricky. I *assume* the mm cannot be generally accessible after > > mm_users drops to zero, but I'm not entirely sure about it. > > procfs? ptrace? > > Well, all I can say is that proc/ptrace look fine afaics... > > This is off-topic, but how about the patch below? Different threads can > expand different vma's at the same time under read_lock(mmap_sem), so > vma_lock_anon_vma() can't help to serialize "locked_vm += grow". perhaps vm_stat_account() should be moved too, but total_vm/etc is less important. Or I missed something? Oleg. > --- x/mm/mmap.c > +++ x/mm/mmap.c > @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a > if (security_vm_enough_memory_mm(mm, grow)) > return -ENOMEM; > > - /* Ok, everything looks good - let it rip */ > - if (vma->vm_flags & VM_LOCKED) > - mm->locked_vm += grow; > vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow); > return 0; > } > @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct >* against concurrent vma expansions. >*/ > spin_lock(>vm_mm->page_table_lock); > + if (vma->vm_flags & VM_LOCKED) > + mm->locked_vm += grow; > anon_vma_interval_tree_pre_update_vma(vma); > vma->vm_end = address; > anon_vma_interval_tree_post_update_vma(vma); > @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru >* against concurrent vma expansions. >*/ > spin_lock(>vm_mm->page_table_lock); > + if (vma->vm_flags & VM_LOCKED) > + mm->locked_vm += grow; > anon_vma_interval_tree_pre_update_vma(vma); > vma->vm_start = address; > vma->vm_pgoff -= grow; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/11, Kirill A. Shutemov wrote: > > This one is tricky. I *assume* the mm cannot be generally accessible after > mm_users drops to zero, but I'm not entirely sure about it. > procfs? ptrace? Well, all I can say is that proc/ptrace look fine afaics... This is off-topic, but how about the patch below? Different threads can expand different vma's at the same time under read_lock(mmap_sem), so vma_lock_anon_vma() can't help to serialize "locked_vm += grow". Oleg. --- x/mm/mmap.c +++ x/mm/mmap.c @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a if (security_vm_enough_memory_mm(mm, grow)) return -ENOMEM; - /* Ok, everything looks good - let it rip */ - if (vma->vm_flags & VM_LOCKED) - mm->locked_vm += grow; vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow); return 0; } @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct * against concurrent vma expansions. */ spin_lock(>vm_mm->page_table_lock); + if (vma->vm_flags & VM_LOCKED) + mm->locked_vm += grow; anon_vma_interval_tree_pre_update_vma(vma); vma->vm_end = address; anon_vma_interval_tree_post_update_vma(vma); @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru * against concurrent vma expansions. */ spin_lock(>vm_mm->page_table_lock); + if (vma->vm_flags & VM_LOCKED) + mm->locked_vm += grow; anon_vma_interval_tree_pre_update_vma(vma); vma->vm_start = address; vma->vm_pgoff -= grow; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/15/2015 01:36 PM, Sasha Levin wrote: > On 09/11/2015 09:27 PM, Hugh Dickins wrote: >> > I'm inclined to echo Vlastimil's comment from earlier in the thread: >> > sounds like an overkill, unless we find something more serious than this. > I've modified my tests to stress the exit path of processes with many vmas, > and hit the following NULL ptr deref (not sure if it's related to the > original issue): > > [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > [1181047.937223] Modules linked in: > [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted > 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 > [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: > 880433f0 > [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) > [1181047.941842] RSP: :880433f078a8 EFLAGS: 00010206 > [1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: > > [1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: > 04c8 > [1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: > 15c59e01 > [1181047.944604] R10: R11: 0001 R12: > > [1181047.944833] pps pps0: PPS event at 21837.866101174 > [1181047.944838] pps pps0: capture assert seq #7188 > [1181047.946261] R13: R14: 880433f07910 R15: > 2e0d > [1181047.947005] FS: () GS:88025200() > knlGS: > [1181047.947779] CS: 0010 DS: ES: CR0: 8005003b > [1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: > 06a0 > [1181047.949085] Stack: > [1181047.949350] 880433f07910 1100867e0f1e > dc00 > [1181047.950164] 8801d825d000 2e0d 880433f079d0 > 9276c4ab > [1181047.951070] 88041acd000a 41b58ab3 9ecd1a43 > 9276c2a0 > [1181047.951906] Call Trace: > [1181047.952201] exit_mmap (mm/mmap.c:2856) > [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826) > [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 > include/linux/sched.h:2557 mm/huge_memory.c:2169) > [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109) > [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 > mm/slub.c:2746) > [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 > include/linux/sched.h:2557 mm/huge_memory.c:2169) > [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 > kernel/fork.c:702) > [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 > include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733) > [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298) > [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654) > [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57) > [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249) > [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298) > [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?) > [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641) > [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?) > [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 > kernel/exit.c:859) > [1181047.965569] get_signal (kernel/signal.c:2353) > [1181047.966430] do_signal (arch/x86/kernel/signal.c:709) > [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 > fs/read_write.c:821) > [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407) > [1181047.969126] ? vfs_write (fs/read_write.c:777) > [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706) > [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?) > [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) > [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 > include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199) > [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 > kernel/time/itimer.c:227) > [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18) > [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) > [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251) > [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318) > [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282) > [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 > c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 > <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00 > All code > >0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) >6: 85 22 test %esp,(%rdx) >8: 01 00
Re: Multiple potential races on vma->vm_flags
On Wed, 23 Sep 2015, Kirill A. Shutemov wrote: On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote: [...] I'd rather wait to hear whether this appears to work in practice, and whether you agree that it should work in theory, before writing the proper description. I'd love to lose that down_read_trylock. You mention how Sasha hit the "Bad page state (mlocked)" back in November: that was one of the reasons we reverted Davidlohr's i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(), without understanding why it was needed. Yes, it would lock out a concurrent try_to_unmap(), whose setting of PageMlocked was not sufficiently serialized by the down_read_trylock of mmap_sem. But I don't remember the other reasons for that revert (and haven't looked very hard as yet): anyone else remember? Yeah, I don't think this was ever resolved, but ultimately the patch got reverted[1] because it exposed issues in the form of bad pages (shmem, vmsplice) and corrupted vm_flags while in untrack_pfn() causing, for example, vm_file to dissapear. I hoped Davidlohr will come back with something after the revert, but it never happend. I think the reverted patch was responsible for most of scalability boost from rwsem for i_mmap_lock... Actually no, the change that got reverted was something we got in very last minute, just because it made sense and had the blessing of some key people. The main winner of the series was migration (rmap), which later Hugh addressed more specifically for unmapped pages: https://lkml.org/lkml/2014/11/30/349 So I really didn't care about the reverted patch, and therefore was never on my radar. [1] https://lkml.org/lkml/2014/12/22/375 Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/23/2015 09:08 AM, Andrey Konovalov wrote: > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickinswrote: >> > This is totally untested, and one of you may quickly prove me wrong; >> > but I went in to fix your "Bad page state (mlocked)" by holding pte >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), >> > then couldn't see why it would need mmap_sem at all, given how mlock >> > and munlock first assert intention by setting or clearing VM_LOCKED >> > in vm_flags, then work their way up the vma, taking pte locks. >> > >> > Calling mlock_vma_page() under pte lock may look suspicious >> > at first: but what it does is similar to clear_page_mlock(), >> > which we regularly call under pte lock from page_remove_rmap(). >> > >> > I'd rather wait to hear whether this appears to work in practice, >> > and whether you agree that it should work in theory, before writing >> > the proper description. I'd love to lose that down_read_trylock. > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)". > > It seems that your patch doesn't fix the race from the report below, since pte > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425) > is being executed. (Line numbers are from kernel with your patch applied.) I've fired up my HZ_1 patch, and this seems to be a real race that is somewhat easy to reproduce under those conditions. Here's a fresh backtrace from my VMs: [1935109.882343] BUG: Bad page state in process trinity-subchil pfn:3ca200 [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping: (null) index:0x1e00 compound_mapcount: 0 [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked) [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [1935109.888197] bad because of flags: [1935109.888759] flags: 0x10(mlocked) [1935109.889525] Modules linked in: [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569 [1935109.891876] 16445448 e5dca494 8803f7657708 a70402da [1935109.893504] ea000f288000 8803f7657738 a56e522b 022f80144008 [1935109.894947] ea000f288020 ea000f288000 8803f76577a8 [1935109.896413] Call Trace: [1935109.899102] [] dump_stack+0x4e/0x84 [1935109.899821] [] bad_page+0x17b/0x210 [1935109.900469] [] free_pages_prepare+0xb48/0x1110 [1935109.902127] [] __free_pages_ok+0x21/0x260 [1935109.904435] [] free_compound_page+0x63/0x80 [1935109.905614] [] free_transhuge_page+0x6e/0x80 [1935109.906752] [] __put_compound_page+0x76/0xa0 [1935109.907884] [] release_pages+0x4d5/0x9f0 [1935109.913027] [] tlb_flush_mmu_free+0x8a/0x120 [1935109.913957] [] unmap_page_range+0xe73/0x1460 [1935109.915737] [] unmap_single_vma+0x126/0x2f0 [1935109.916646] [] unmap_vmas+0xdd/0x190 [1935109.917454] [] exit_mmap+0x221/0x430 [1935109.921176] [] mmput+0xb1/0x240 [1935109.921919] [] do_exit+0x732/0x27c0 [1935109.928561] [] do_group_exit+0xf9/0x300 [1935109.929786] [] SyS_exit_group+0x1d/0x20 [1935109.930617] [] entry_SYSCALL_64_fastpath+0x16/0x7a Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Tue, 22 Sep 2015, Andrey Konovalov wrote: > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins wrote: > > On Tue, 22 Sep 2015, Andrey Konovalov wrote: > >> If anybody comes up with a patch to fix the original issue I easily > >> can test it, since I'm hitting "BUG: Bad page state" in a second when > >> fuzzing with KTSAN and Trinity. > > > > This "BUG: Bad page state" sounds more serious, but I cannot track down > > your report of it: please repost - thanks - though on seeing it, I may > > well end up with no ideas. > > The report is below. Thanks. > > I get it after a few seconds of running Trinity on a kernel with KTSAN > and targeting mlock, munlock and madvise syscalls. > Sasha also observed a very similar crash a while ago > (https://lkml.org/lkml/2014/11/6/1055). > I didn't manage to reproduce this in a kernel build without KTSAN though. > The idea was that data races KTSAN reports might be the explanation of > these crashes. > > BUG: Bad page state in process trinity-c15 pfn:281999 > page:ea000a066640 count:0 mapcount:0 mapping: (null) index:0xd > flags: 0x228000c(referenced|uptodate|swapbacked|mlocked) > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > bad because of flags: > flags: 0x20(mlocked) > Modules linked in: > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 821c3b70 00014741 8800b857f948 > 81e9926c 0003 ea000a066640 8800b857f978 > 811ce045 821c3b70 ea000a066640 0001 > Call Trace: > [< inline >] __dump_stack lib/dump_stack.c:15 > [] dump_stack+0x63/0x81 lib/dump_stack.c:50 > [] bad_page+0x115/0x1a0 mm/page_alloc.c:409 > [< inline >] free_pages_check mm/page_alloc.c:731 > [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922 > [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908 > [] free_hot_cold_page_list+0x5f/0x100 > mm/page_alloc.c:1956 (discriminator 3) > [] release_pages+0x151/0x300 mm/swap.c:967 > [] __pagevec_release+0x43/0x60 mm/swap.c:984 > [< inline >] pagevec_release include/linux/pagevec.h:69 > [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446 > [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540 > [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086 > [] vfs_fallocate+0x1e0/0x310 fs/open.c:303 > [< inline >] madvise_remove mm/madvise.c:326 > [< inline >] madvise_vma mm/madvise.c:378 > [< inline >] SYSC_madvise mm/madvise.c:528 > [] SyS_madvise+0x378/0x760 mm/madvise.c:459 > [] ? kt_atomic64_store+0x76/0x130 > mm/ktsan/sync_atomic.c:161 > [] entry_SYSCALL_64_fastpath+0x31/0x95 > arch/x86/entry/entry_64.S:188 > Disabling lock debugging due to kernel taint This is totally untested, and one of you may quickly prove me wrong; but I went in to fix your "Bad page state (mlocked)" by holding pte lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), then couldn't see why it would need mmap_sem at all, given how mlock and munlock first assert intention by setting or clearing VM_LOCKED in vm_flags, then work their way up the vma, taking pte locks. Calling mlock_vma_page() under pte lock may look suspicious at first: but what it does is similar to clear_page_mlock(), which we regularly call under pte lock from page_remove_rmap(). I'd rather wait to hear whether this appears to work in practice, and whether you agree that it should work in theory, before writing the proper description. I'd love to lose that down_read_trylock. You mention how Sasha hit the "Bad page state (mlocked)" back in November: that was one of the reasons we reverted Davidlohr's i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(), without understanding why it was needed. Yes, it would lock out a concurrent try_to_unmap(), whose setting of PageMlocked was not sufficiently serialized by the down_read_trylock of mmap_sem. But I don't remember the other reasons for that revert (and haven't looked very hard as yet): anyone else remember? Not-yet-Signed-off-by: Hugh Dickins --- mm/rmap.c | 32 +++- 1 file changed, 7 insertions(+), 25 deletions(-) --- 4.3-rc2/mm/rmap.c 2015-09-12 18:30:20.857039763 -0700 +++ linux/mm/rmap.c 2015-09-22 17:47:43.489096676 -0700 @@ -1314,9 +1314,12 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!(flags & TTU_IGNORE_MLOCK)) { - if (vma->vm_flags & VM_LOCKED) - goto out_mlock; - + if (vma->vm_flags & VM_LOCKED) { + /* Holding pte lock, we do *not* need mmap_sem here */ + mlock_vma_page(page); + ret = SWAP_MLOCK; + goto out_unmap; + } if (flags & TTU_MUNLOCK) goto
Re: Multiple potential races on vma->vm_flags
On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins wrote: > On Tue, 22 Sep 2015, Andrey Konovalov wrote: >> If anybody comes up with a patch to fix the original issue I easily >> can test it, since I'm hitting "BUG: Bad page state" in a second when >> fuzzing with KTSAN and Trinity. > > This "BUG: Bad page state" sounds more serious, but I cannot track down > your report of it: please repost - thanks - though on seeing it, I may > well end up with no ideas. The report is below. I get it after a few seconds of running Trinity on a kernel with KTSAN and targeting mlock, munlock and madvise syscalls. Sasha also observed a very similar crash a while ago (https://lkml.org/lkml/2014/11/6/1055). I didn't manage to reproduce this in a kernel build without KTSAN though. The idea was that data races KTSAN reports might be the explanation of these crashes. BUG: Bad page state in process trinity-c15 pfn:281999 page:ea000a066640 count:0 mapcount:0 mapping: (null) index:0xd flags: 0x228000c(referenced|uptodate|swapbacked|mlocked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x20(mlocked) Modules linked in: CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 821c3b70 00014741 8800b857f948 81e9926c 0003 ea000a066640 8800b857f978 811ce045 821c3b70 ea000a066640 0001 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0x63/0x81 lib/dump_stack.c:50 [] bad_page+0x115/0x1a0 mm/page_alloc.c:409 [< inline >] free_pages_check mm/page_alloc.c:731 [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922 [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908 [] free_hot_cold_page_list+0x5f/0x100 mm/page_alloc.c:1956 (discriminator 3) [] release_pages+0x151/0x300 mm/swap.c:967 [] __pagevec_release+0x43/0x60 mm/swap.c:984 [< inline >] pagevec_release include/linux/pagevec.h:69 [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446 [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540 [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086 [] vfs_fallocate+0x1e0/0x310 fs/open.c:303 [< inline >] madvise_remove mm/madvise.c:326 [< inline >] madvise_vma mm/madvise.c:378 [< inline >] SYSC_madvise mm/madvise.c:528 [] SyS_madvise+0x378/0x760 mm/madvise.c:459 [] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161 [] entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188 Disabling lock debugging due to kernel taint > > Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Tue, 22 Sep 2015, Andrey Konovalov wrote: > If anybody comes up with a patch to fix the original issue I easily > can test it, since I'm hitting "BUG: Bad page state" in a second when > fuzzing with KTSAN and Trinity. This "BUG: Bad page state" sounds more serious, but I cannot track down your report of it: please repost - thanks - though on seeing it, I may well end up with no ideas. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
If anybody comes up with a patch to fix the original issue I easily can test it, since I'm hitting "BUG: Bad page state" in a second when fuzzing with KTSAN and Trinity. On Tue, Sep 15, 2015 at 9:01 PM, Kirill A. Shutemov wrote: > On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote: >> On 09/11/2015 09:27 PM, Hugh Dickins wrote: >> > I'm inclined to echo Vlastimil's comment from earlier in the thread: >> > sounds like an overkill, unless we find something more serious than this. >> >> I've modified my tests to stress the exit path of processes with many vmas, > > Could you share the test? > >> and hit the following NULL ptr deref (not sure if it's related to the >> original issue): >> >> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory >> accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN >> [1181047.937223] Modules linked in: >> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted >> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 >> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: >> 880433f0 >> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) > > Is it "struct mm_struct *mm = vma->vm_mm;"? > > -- > Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
If anybody comes up with a patch to fix the original issue I easily can test it, since I'm hitting "BUG: Bad page state" in a second when fuzzing with KTSAN and Trinity. On Tue, Sep 15, 2015 at 9:01 PM, Kirill A. Shutemovwrote: > On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote: >> On 09/11/2015 09:27 PM, Hugh Dickins wrote: >> > I'm inclined to echo Vlastimil's comment from earlier in the thread: >> > sounds like an overkill, unless we find something more serious than this. >> >> I've modified my tests to stress the exit path of processes with many vmas, > > Could you share the test? > >> and hit the following NULL ptr deref (not sure if it's related to the >> original issue): >> >> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory >> accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN >> [1181047.937223] Modules linked in: >> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted >> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 >> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: >> 880433f0 >> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) > > Is it "struct mm_struct *mm = vma->vm_mm;"? > > -- > Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Tue, 22 Sep 2015, Andrey Konovalov wrote: > If anybody comes up with a patch to fix the original issue I easily > can test it, since I'm hitting "BUG: Bad page state" in a second when > fuzzing with KTSAN and Trinity. This "BUG: Bad page state" sounds more serious, but I cannot track down your report of it: please repost - thanks - though on seeing it, I may well end up with no ideas. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickinswrote: > On Tue, 22 Sep 2015, Andrey Konovalov wrote: >> If anybody comes up with a patch to fix the original issue I easily >> can test it, since I'm hitting "BUG: Bad page state" in a second when >> fuzzing with KTSAN and Trinity. > > This "BUG: Bad page state" sounds more serious, but I cannot track down > your report of it: please repost - thanks - though on seeing it, I may > well end up with no ideas. The report is below. I get it after a few seconds of running Trinity on a kernel with KTSAN and targeting mlock, munlock and madvise syscalls. Sasha also observed a very similar crash a while ago (https://lkml.org/lkml/2014/11/6/1055). I didn't manage to reproduce this in a kernel build without KTSAN though. The idea was that data races KTSAN reports might be the explanation of these crashes. BUG: Bad page state in process trinity-c15 pfn:281999 page:ea000a066640 count:0 mapcount:0 mapping: (null) index:0xd flags: 0x228000c(referenced|uptodate|swapbacked|mlocked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x20(mlocked) Modules linked in: CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 821c3b70 00014741 8800b857f948 81e9926c 0003 ea000a066640 8800b857f978 811ce045 821c3b70 ea000a066640 0001 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0x63/0x81 lib/dump_stack.c:50 [] bad_page+0x115/0x1a0 mm/page_alloc.c:409 [< inline >] free_pages_check mm/page_alloc.c:731 [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922 [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908 [] free_hot_cold_page_list+0x5f/0x100 mm/page_alloc.c:1956 (discriminator 3) [] release_pages+0x151/0x300 mm/swap.c:967 [] __pagevec_release+0x43/0x60 mm/swap.c:984 [< inline >] pagevec_release include/linux/pagevec.h:69 [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446 [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540 [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086 [] vfs_fallocate+0x1e0/0x310 fs/open.c:303 [< inline >] madvise_remove mm/madvise.c:326 [< inline >] madvise_vma mm/madvise.c:378 [< inline >] SYSC_madvise mm/madvise.c:528 [] SyS_madvise+0x378/0x760 mm/madvise.c:459 [] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161 [] entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188 Disabling lock debugging due to kernel taint > > Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Tue, 22 Sep 2015, Andrey Konovalov wrote: > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickinswrote: > > On Tue, 22 Sep 2015, Andrey Konovalov wrote: > >> If anybody comes up with a patch to fix the original issue I easily > >> can test it, since I'm hitting "BUG: Bad page state" in a second when > >> fuzzing with KTSAN and Trinity. > > > > This "BUG: Bad page state" sounds more serious, but I cannot track down > > your report of it: please repost - thanks - though on seeing it, I may > > well end up with no ideas. > > The report is below. Thanks. > > I get it after a few seconds of running Trinity on a kernel with KTSAN > and targeting mlock, munlock and madvise syscalls. > Sasha also observed a very similar crash a while ago > (https://lkml.org/lkml/2014/11/6/1055). > I didn't manage to reproduce this in a kernel build without KTSAN though. > The idea was that data races KTSAN reports might be the explanation of > these crashes. > > BUG: Bad page state in process trinity-c15 pfn:281999 > page:ea000a066640 count:0 mapcount:0 mapping: (null) index:0xd > flags: 0x228000c(referenced|uptodate|swapbacked|mlocked) > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > bad because of flags: > flags: 0x20(mlocked) > Modules linked in: > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 821c3b70 00014741 8800b857f948 > 81e9926c 0003 ea000a066640 8800b857f978 > 811ce045 821c3b70 ea000a066640 0001 > Call Trace: > [< inline >] __dump_stack lib/dump_stack.c:15 > [] dump_stack+0x63/0x81 lib/dump_stack.c:50 > [] bad_page+0x115/0x1a0 mm/page_alloc.c:409 > [< inline >] free_pages_check mm/page_alloc.c:731 > [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922 > [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908 > [] free_hot_cold_page_list+0x5f/0x100 > mm/page_alloc.c:1956 (discriminator 3) > [] release_pages+0x151/0x300 mm/swap.c:967 > [] __pagevec_release+0x43/0x60 mm/swap.c:984 > [< inline >] pagevec_release include/linux/pagevec.h:69 > [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446 > [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540 > [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086 > [] vfs_fallocate+0x1e0/0x310 fs/open.c:303 > [< inline >] madvise_remove mm/madvise.c:326 > [< inline >] madvise_vma mm/madvise.c:378 > [< inline >] SYSC_madvise mm/madvise.c:528 > [] SyS_madvise+0x378/0x760 mm/madvise.c:459 > [] ? kt_atomic64_store+0x76/0x130 > mm/ktsan/sync_atomic.c:161 > [] entry_SYSCALL_64_fastpath+0x31/0x95 > arch/x86/entry/entry_64.S:188 > Disabling lock debugging due to kernel taint This is totally untested, and one of you may quickly prove me wrong; but I went in to fix your "Bad page state (mlocked)" by holding pte lock across the down_read_trylock of mmap_sem in try_to_unmap_one(), then couldn't see why it would need mmap_sem at all, given how mlock and munlock first assert intention by setting or clearing VM_LOCKED in vm_flags, then work their way up the vma, taking pte locks. Calling mlock_vma_page() under pte lock may look suspicious at first: but what it does is similar to clear_page_mlock(), which we regularly call under pte lock from page_remove_rmap(). I'd rather wait to hear whether this appears to work in practice, and whether you agree that it should work in theory, before writing the proper description. I'd love to lose that down_read_trylock. You mention how Sasha hit the "Bad page state (mlocked)" back in November: that was one of the reasons we reverted Davidlohr's i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(), without understanding why it was needed. Yes, it would lock out a concurrent try_to_unmap(), whose setting of PageMlocked was not sufficiently serialized by the down_read_trylock of mmap_sem. But I don't remember the other reasons for that revert (and haven't looked very hard as yet): anyone else remember? Not-yet-Signed-off-by: Hugh Dickins --- mm/rmap.c | 32 +++- 1 file changed, 7 insertions(+), 25 deletions(-) --- 4.3-rc2/mm/rmap.c 2015-09-12 18:30:20.857039763 -0700 +++ linux/mm/rmap.c 2015-09-22 17:47:43.489096676 -0700 @@ -1314,9 +1314,12 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!(flags & TTU_IGNORE_MLOCK)) { - if (vma->vm_flags & VM_LOCKED) - goto out_mlock; - + if (vma->vm_flags & VM_LOCKED) { + /* Holding pte lock, we do *not* need mmap_sem here */ + mlock_vma_page(page); + ret = SWAP_MLOCK; + goto out_unmap; + } if (flags &
Re: Multiple potential races on vma->vm_flags
On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote: > On 09/11/2015 09:27 PM, Hugh Dickins wrote: > > I'm inclined to echo Vlastimil's comment from earlier in the thread: > > sounds like an overkill, unless we find something more serious than this. > > I've modified my tests to stress the exit path of processes with many vmas, Could you share the test? > and hit the following NULL ptr deref (not sure if it's related to the > original issue): > > [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > [1181047.937223] Modules linked in: > [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted > 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 > [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: > 880433f0 > [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) Is it "struct mm_struct *mm = vma->vm_mm;"? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/11/2015 09:27 PM, Hugh Dickins wrote: > I'm inclined to echo Vlastimil's comment from earlier in the thread: > sounds like an overkill, unless we find something more serious than this. I've modified my tests to stress the exit path of processes with many vmas, and hit the following NULL ptr deref (not sure if it's related to the original issue): [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [1181047.937223] Modules linked in: [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 880433f0 [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) [1181047.941842] RSP: :880433f078a8 EFLAGS: 00010206 [1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: [1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: 04c8 [1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: 15c59e01 [1181047.944604] R10: R11: 0001 R12: [1181047.944833] pps pps0: PPS event at 21837.866101174 [1181047.944838] pps pps0: capture assert seq #7188 [1181047.946261] R13: R14: 880433f07910 R15: 2e0d [1181047.947005] FS: () GS:88025200() knlGS: [1181047.947779] CS: 0010 DS: ES: CR0: 8005003b [1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: 06a0 [1181047.949085] Stack: [1181047.949350] 880433f07910 1100867e0f1e dc00 [1181047.950164] 8801d825d000 2e0d 880433f079d0 9276c4ab [1181047.951070] 88041acd000a 41b58ab3 9ecd1a43 9276c2a0 [1181047.951906] Call Trace: [1181047.952201] exit_mmap (mm/mmap.c:2856) [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826) [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169) [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109) [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 mm/slub.c:2746) [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169) [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 kernel/fork.c:702) [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733) [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298) [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654) [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57) [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249) [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298) [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?) [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641) [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?) [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:859) [1181047.965569] get_signal (kernel/signal.c:2353) [1181047.966430] do_signal (arch/x86/kernel/signal.c:709) [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 fs/read_write.c:821) [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407) [1181047.969126] ? vfs_write (fs/read_write.c:777) [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706) [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?) [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199) [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 kernel/time/itimer.c:227) [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18) [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251) [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318) [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282) [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00 All code 0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) 6: 85 22 test %esp,(%rdx) 8: 01 00 add%eax,(%rax) a: 00 48 8badd%cl,-0x75(%rax) d: 43 rex.XB e: 40 rex f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
Re: Multiple potential races on vma->vm_flags
On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote: > On 09/11/2015 09:27 PM, Hugh Dickins wrote: > > I'm inclined to echo Vlastimil's comment from earlier in the thread: > > sounds like an overkill, unless we find something more serious than this. > > I've modified my tests to stress the exit path of processes with many vmas, Could you share the test? > and hit the following NULL ptr deref (not sure if it's related to the > original issue): > > [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory > accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN > [1181047.937223] Modules linked in: > [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted > 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 > [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: > 880433f0 > [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) Is it "struct mm_struct *mm = vma->vm_mm;"? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/11/2015 09:27 PM, Hugh Dickins wrote: > I'm inclined to echo Vlastimil's comment from earlier in the thread: > sounds like an overkill, unless we find something more serious than this. I've modified my tests to stress the exit path of processes with many vmas, and hit the following NULL ptr deref (not sure if it's related to the original issue): [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [1181047.937223] Modules linked in: [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554 [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 880433f0 [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337) [1181047.941842] RSP: :880433f078a8 EFLAGS: 00010206 [1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: [1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: 04c8 [1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: 15c59e01 [1181047.944604] R10: R11: 0001 R12: [1181047.944833] pps pps0: PPS event at 21837.866101174 [1181047.944838] pps pps0: capture assert seq #7188 [1181047.946261] R13: R14: 880433f07910 R15: 2e0d [1181047.947005] FS: () GS:88025200() knlGS: [1181047.947779] CS: 0010 DS: ES: CR0: 8005003b [1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: 06a0 [1181047.949085] Stack: [1181047.949350] 880433f07910 1100867e0f1e dc00 [1181047.950164] 8801d825d000 2e0d 880433f079d0 9276c4ab [1181047.951070] 88041acd000a 41b58ab3 9ecd1a43 9276c2a0 [1181047.951906] Call Trace: [1181047.952201] exit_mmap (mm/mmap.c:2856) [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826) [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169) [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109) [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 mm/slub.c:2746) [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 include/linux/sched.h:2557 mm/huge_memory.c:2169) [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 kernel/fork.c:702) [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733) [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298) [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654) [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57) [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249) [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298) [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?) [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641) [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?) [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:859) [1181047.965569] get_signal (kernel/signal.c:2353) [1181047.966430] do_signal (arch/x86/kernel/signal.c:709) [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 fs/read_write.c:821) [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407) [1181047.969126] ? vfs_write (fs/read_write.c:777) [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706) [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?) [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199) [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 kernel/time/itimer.c:227) [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18) [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251) [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318) [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282) [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00 All code 0: 08 80 3c 02 00 0f or %al,0xf00023c(%rax) 6: 85 22 test %esp,(%rdx) 8: 01 00 add%eax,(%rax) a: 00 48 8badd%cl,-0x75(%rax) d: 43 rex.XB e: 40 rex f: 48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
Re: Multiple potential races on vma->vm_flags
On Fri, Sep 11, 2015 at 06:27:14PM -0700, Hugh Dickins wrote: > On Fri, 11 Sep 2015, Kirill A. Shutemov wrote: > > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: > > > Can a vma be shared among a few mm's? > > > > Define "shared". > > > > vma can belong only to one process (mm_struct), but it can be accessed > > from other process like in rmap case below. > > > > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make > > sure that the vma will not disappear under it. > > > > > If yes, then taking current->mm->mmap_sem to protect vma is not enough. > > > > Depends on what protection you are talking about. > > > > > In the first report below both T378 and T398 take > > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be > > > different locks (the addresses are different). > > > > See i_mmap_lock_read() in T398. It will guarantee that vma is there. > > > > > In the second report T309 doesn't take any locks at all, since it > > > assumes that after checking atomic_dec_and_test(>mm_users) the mm > > > has no other users, but then it does a write to vma. > > > > This one is tricky. I *assume* the mm cannot be generally accessible after > > mm_users drops to zero, but I'm not entirely sure about it. > > procfs? ptrace? > > Most of the things (including procfs and ptrace) that need to work on > a foreign mm do take a hold on mm_users with get_task_mm(). swapoff > uses atomic_inc_not_zero(>mm_users). In KSM I managed to get away > with just a hold on the structure itself, atomic_inc(>mm_count), > and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey > might like to turn KSM on: it wouldn't be entirely shocking if he were > to discover an anomaly from that). > > > > > The VMA is still accessible via rmap at this point. And I think it can be > > a problem: > > > > CPU0CPU1 > > exit_mmap() > > // mmap_sem is *not* taken > > munlock_vma_pages_all() > > munlock_vma_pages_range() > > try_to_unmap_one() > > > > down_read_trylock(>vm_mm->mmap_sem)) > > !!(vma->vm_flags & VM_LOCKED) > > == true > > vma->vm_flags &= ~VM_LOCKED; > > > > mlock_vma_page(page); > > // mlocked pages is leaked. > > > > The obvious solution is to take mmap_sem in exit path, but it would cause > > performance regression. > > > > Any comments? > > I'm inclined to echo Vlastimil's comment from earlier in the thread: > sounds like an overkill, unless we find something more serious than this. > > I'm not sure whether we'd actually see a regression from taking mmap_sem > in exit path; but given that it's mmap_sem, yes, history tells us please > not to take it any more than we have to. > > I do remember wishing, when working out KSM's mm handling, that exit took > mmap_sem: it would have made it simpler, but that wasn't a change I dared > to make. > > Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() > could fix it? I don't see how. It would shift a picture, but doesn't fix it: exit_mmap() can happen after down_read_trylock() and mm_users check. We would only hide the problem. > But if we were to make a bigger change for this VM_LOCKED issue, and > something more serious makes it worth all the effort, I'd say that > what needs to be done is to give mlock/munlock proper locking (haha). > > I have not yet looked at your mlocked THP patch (sorry), but when I > was doing the same thing for huge tmpfs, what made it so surprisingly > difficult was all the spongy trylocking, which concealed the rules. > > Maybe I'm completely wrong, but I thought a lot of awkwardness might > disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem > throughout instead of mmap_sem. This can be helpful. But the risk is getting scalability regression on other front: long anon_vma chain or highly shared files. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Fri, Sep 11, 2015 at 06:27:14PM -0700, Hugh Dickins wrote: > On Fri, 11 Sep 2015, Kirill A. Shutemov wrote: > > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: > > > Can a vma be shared among a few mm's? > > > > Define "shared". > > > > vma can belong only to one process (mm_struct), but it can be accessed > > from other process like in rmap case below. > > > > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make > > sure that the vma will not disappear under it. > > > > > If yes, then taking current->mm->mmap_sem to protect vma is not enough. > > > > Depends on what protection you are talking about. > > > > > In the first report below both T378 and T398 take > > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be > > > different locks (the addresses are different). > > > > See i_mmap_lock_read() in T398. It will guarantee that vma is there. > > > > > In the second report T309 doesn't take any locks at all, since it > > > assumes that after checking atomic_dec_and_test(>mm_users) the mm > > > has no other users, but then it does a write to vma. > > > > This one is tricky. I *assume* the mm cannot be generally accessible after > > mm_users drops to zero, but I'm not entirely sure about it. > > procfs? ptrace? > > Most of the things (including procfs and ptrace) that need to work on > a foreign mm do take a hold on mm_users with get_task_mm(). swapoff > uses atomic_inc_not_zero(>mm_users). In KSM I managed to get away > with just a hold on the structure itself, atomic_inc(>mm_count), > and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey > might like to turn KSM on: it wouldn't be entirely shocking if he were > to discover an anomaly from that). > > > > > The VMA is still accessible via rmap at this point. And I think it can be > > a problem: > > > > CPU0CPU1 > > exit_mmap() > > // mmap_sem is *not* taken > > munlock_vma_pages_all() > > munlock_vma_pages_range() > > try_to_unmap_one() > > > > down_read_trylock(>vm_mm->mmap_sem)) > > !!(vma->vm_flags & VM_LOCKED) > > == true > > vma->vm_flags &= ~VM_LOCKED; > > > > mlock_vma_page(page); > > // mlocked pages is leaked. > > > > The obvious solution is to take mmap_sem in exit path, but it would cause > > performance regression. > > > > Any comments? > > I'm inclined to echo Vlastimil's comment from earlier in the thread: > sounds like an overkill, unless we find something more serious than this. > > I'm not sure whether we'd actually see a regression from taking mmap_sem > in exit path; but given that it's mmap_sem, yes, history tells us please > not to take it any more than we have to. > > I do remember wishing, when working out KSM's mm handling, that exit took > mmap_sem: it would have made it simpler, but that wasn't a change I dared > to make. > > Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() > could fix it? I don't see how. It would shift a picture, but doesn't fix it: exit_mmap() can happen after down_read_trylock() and mm_users check. We would only hide the problem. > But if we were to make a bigger change for this VM_LOCKED issue, and > something more serious makes it worth all the effort, I'd say that > what needs to be done is to give mlock/munlock proper locking (haha). > > I have not yet looked at your mlocked THP patch (sorry), but when I > was doing the same thing for huge tmpfs, what made it so surprisingly > difficult was all the spongy trylocking, which concealed the rules. > > Maybe I'm completely wrong, but I thought a lot of awkwardness might > disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem > throughout instead of mmap_sem. This can be helpful. But the risk is getting scalability regression on other front: long anon_vma chain or highly shared files. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Fri, 11 Sep 2015, Kirill A. Shutemov wrote: > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: > > Can a vma be shared among a few mm's? > > Define "shared". > > vma can belong only to one process (mm_struct), but it can be accessed > from other process like in rmap case below. > > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make > sure that the vma will not disappear under it. > > > If yes, then taking current->mm->mmap_sem to protect vma is not enough. > > Depends on what protection you are talking about. > > > In the first report below both T378 and T398 take > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be > > different locks (the addresses are different). > > See i_mmap_lock_read() in T398. It will guarantee that vma is there. > > > In the second report T309 doesn't take any locks at all, since it > > assumes that after checking atomic_dec_and_test(>mm_users) the mm > > has no other users, but then it does a write to vma. > > This one is tricky. I *assume* the mm cannot be generally accessible after > mm_users drops to zero, but I'm not entirely sure about it. > procfs? ptrace? Most of the things (including procfs and ptrace) that need to work on a foreign mm do take a hold on mm_users with get_task_mm(). swapoff uses atomic_inc_not_zero(>mm_users). In KSM I managed to get away with just a hold on the structure itself, atomic_inc(>mm_count), and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey might like to turn KSM on: it wouldn't be entirely shocking if he were to discover an anomaly from that). > > The VMA is still accessible via rmap at this point. And I think it can be > a problem: > > CPU0CPU1 > exit_mmap() > // mmap_sem is *not* taken > munlock_vma_pages_all() > munlock_vma_pages_range() > try_to_unmap_one() > > down_read_trylock(>vm_mm->mmap_sem)) > !!(vma->vm_flags & VM_LOCKED) > == true > vma->vm_flags &= ~VM_LOCKED; > > mlock_vma_page(page); > // mlocked pages is leaked. > > The obvious solution is to take mmap_sem in exit path, but it would cause > performance regression. > > Any comments? I'm inclined to echo Vlastimil's comment from earlier in the thread: sounds like an overkill, unless we find something more serious than this. I'm not sure whether we'd actually see a regression from taking mmap_sem in exit path; but given that it's mmap_sem, yes, history tells us please not to take it any more than we have to. I do remember wishing, when working out KSM's mm handling, that exit took mmap_sem: it would have made it simpler, but that wasn't a change I dared to make. Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() could fix it? But if we were to make a bigger change for this VM_LOCKED issue, and something more serious makes it worth all the effort, I'd say that what needs to be done is to give mlock/munlock proper locking (haha). I have not yet looked at your mlocked THP patch (sorry), but when I was doing the same thing for huge tmpfs, what made it so surprisingly difficult was all the spongy trylocking, which concealed the rules. Maybe I'm completely wrong, but I thought a lot of awkwardness might disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem throughout instead of mmap_sem. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/11/2015 05:29 PM, Vlastimil Babka wrote: On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote: On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: Can a vma be shared among a few mm's? Define "shared". vma can belong only to one process (mm_struct), but it can be accessed from other process like in rmap case below. rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make sure that the vma will not disappear under it. If yes, then taking current->mm->mmap_sem to protect vma is not enough. Depends on what protection you are talking about. In the first report below both T378 and T398 take current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be different locks (the addresses are different). See i_mmap_lock_read() in T398. It will guarantee that vma is there. In the second report T309 doesn't take any locks at all, since it assumes that after checking atomic_dec_and_test(>mm_users) the mm has no other users, but then it does a write to vma. This one is tricky. I *assume* the mm cannot be generally accessible after mm_users drops to zero, but I'm not entirely sure about it. procfs? ptrace? The VMA is still accessible via rmap at this point. And I think it can be a problem: CPU0CPU1 exit_mmap() // mmap_sem is *not* taken munlock_vma_pages_all() munlock_vma_pages_range() try_to_unmap_one() down_read_trylock(>vm_mm->mmap_sem)) !!(vma->vm_flags & VM_LOCKED) == true vma->vm_flags &= ~VM_LOCKED; mlock_vma_page(page); // mlocked pages is leaked. The obvious solution is to take mmap_sem in exit path, but it would cause performance regression. Any comments? Just so others don't repeat the paths that I already looked at: - First I thought that try_to_unmap_one() has the page locked and munlock_vma_pages_range() will also lock it... but it doesn't. More precisely, it does (in __munlock_pagevec()), but TestClearPageMlocked(page) doesn't happen under that lock. - Then I thought that exit_mmap() will revisit the page anyway doing actual unmap. It would, if it's the one who has the page mapped, it will clear the mlock (see page_remove_rmap()). If it's not the last one, page will be left locked. So it won't be completely leaked, but still, it will be mlocked when it shouldn't. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote: On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: Can a vma be shared among a few mm's? Define "shared". vma can belong only to one process (mm_struct), but it can be accessed from other process like in rmap case below. rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make sure that the vma will not disappear under it. If yes, then taking current->mm->mmap_sem to protect vma is not enough. Depends on what protection you are talking about. In the first report below both T378 and T398 take current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be different locks (the addresses are different). See i_mmap_lock_read() in T398. It will guarantee that vma is there. In the second report T309 doesn't take any locks at all, since it assumes that after checking atomic_dec_and_test(>mm_users) the mm has no other users, but then it does a write to vma. This one is tricky. I *assume* the mm cannot be generally accessible after mm_users drops to zero, but I'm not entirely sure about it. procfs? ptrace? The VMA is still accessible via rmap at this point. And I think it can be a problem: CPU0CPU1 exit_mmap() // mmap_sem is *not* taken munlock_vma_pages_all() munlock_vma_pages_range() try_to_unmap_one() down_read_trylock(>vm_mm->mmap_sem)) !!(vma->vm_flags & VM_LOCKED) == true vma->vm_flags &= ~VM_LOCKED; mlock_vma_page(page); // mlocked pages is leaked. The obvious solution is to take mmap_sem in exit path, but it would cause performance regression. Any comments? Just so others don't repeat the paths that I already looked at: - First I thought that try_to_unmap_one() has the page locked and munlock_vma_pages_range() will also lock it... but it doesn't. - Then I thought that exit_mmap() will revisit the page anyway doing actual unmap. It would, if it's the one who has the page mapped, it will clear the mlock (see page_remove_rmap()). If it's not the last one, page will be left locked. So it won't be completely leaked, but still, it will be mlocked when it shouldn't. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: > Can a vma be shared among a few mm's? Define "shared". vma can belong only to one process (mm_struct), but it can be accessed from other process like in rmap case below. rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make sure that the vma will not disappear under it. > If yes, then taking current->mm->mmap_sem to protect vma is not enough. Depends on what protection you are talking about. > In the first report below both T378 and T398 take > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be > different locks (the addresses are different). See i_mmap_lock_read() in T398. It will guarantee that vma is there. > In the second report T309 doesn't take any locks at all, since it > assumes that after checking atomic_dec_and_test(>mm_users) the mm > has no other users, but then it does a write to vma. This one is tricky. I *assume* the mm cannot be generally accessible after mm_users drops to zero, but I'm not entirely sure about it. procfs? ptrace? The VMA is still accessible via rmap at this point. And I think it can be a problem: CPU0CPU1 exit_mmap() // mmap_sem is *not* taken munlock_vma_pages_all() munlock_vma_pages_range() try_to_unmap_one() down_read_trylock(>vm_mm->mmap_sem)) !!(vma->vm_flags & VM_LOCKED) == true vma->vm_flags &= ~VM_LOCKED; mlock_vma_page(page); // mlocked pages is leaked. The obvious solution is to take mmap_sem in exit path, but it would cause performance regression. Any comments? > > == > ThreadSanitizer: data-race in munlock_vma_pages_range > > Write of size 8 by thread T378 (K2633, CPU3): > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > Locks held by T378: > #0 Lock 25710428 taken here: > [< inlined>] SYSC_munlock mm/mlock.c:650 > [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > Previous read of size 8 by thread T398 (K2623, CPU2): > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > [< inlined>] rmap_walk_file mm/rmap.c:1540 > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > Locks held by T398: > #0 Lock 21b00c68 taken here: > [< inlined>] SYSC_munlock mm/mlock.c:650 > [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > #1 Lock bac2d750 taken here: > [< inlined>] i_mmap_lock_read include/linux/fs.h:509 > [< inlined>] rmap_walk_file mm/rmap.c:1533 > [] rmap_walk+0x78/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > #2 Lock 0895f570 taken here: > [< inlined>] spin_lock include/linux/spinlock.h:312 > [] __page_check_address+0xd9/0x210 mm/rmap.c:681 > [< inlined>] page_check_address include/linux/rmap.h:204 > [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198 > [< inlined>] rmap_walk_file mm/rmap.c:1540 > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > DBG: addr: 880222610e10 >
Re: Multiple potential races on vma->vm_flags
On 09/11/2015 05:29 PM, Vlastimil Babka wrote: On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote: On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: Can a vma be shared among a few mm's? Define "shared". vma can belong only to one process (mm_struct), but it can be accessed from other process like in rmap case below. rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make sure that the vma will not disappear under it. If yes, then taking current->mm->mmap_sem to protect vma is not enough. Depends on what protection you are talking about. In the first report below both T378 and T398 take current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be different locks (the addresses are different). See i_mmap_lock_read() in T398. It will guarantee that vma is there. In the second report T309 doesn't take any locks at all, since it assumes that after checking atomic_dec_and_test(>mm_users) the mm has no other users, but then it does a write to vma. This one is tricky. I *assume* the mm cannot be generally accessible after mm_users drops to zero, but I'm not entirely sure about it. procfs? ptrace? The VMA is still accessible via rmap at this point. And I think it can be a problem: CPU0CPU1 exit_mmap() // mmap_sem is *not* taken munlock_vma_pages_all() munlock_vma_pages_range() try_to_unmap_one() down_read_trylock(>vm_mm->mmap_sem)) !!(vma->vm_flags & VM_LOCKED) == true vma->vm_flags &= ~VM_LOCKED; mlock_vma_page(page); // mlocked pages is leaked. The obvious solution is to take mmap_sem in exit path, but it would cause performance regression. Any comments? Just so others don't repeat the paths that I already looked at: - First I thought that try_to_unmap_one() has the page locked and munlock_vma_pages_range() will also lock it... but it doesn't. More precisely, it does (in __munlock_pagevec()), but TestClearPageMlocked(page) doesn't happen under that lock. - Then I thought that exit_mmap() will revisit the page anyway doing actual unmap. It would, if it's the one who has the page mapped, it will clear the mlock (see page_remove_rmap()). If it's not the last one, page will be left locked. So it won't be completely leaked, but still, it will be mlocked when it shouldn't. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: > Can a vma be shared among a few mm's? Define "shared". vma can belong only to one process (mm_struct), but it can be accessed from other process like in rmap case below. rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make sure that the vma will not disappear under it. > If yes, then taking current->mm->mmap_sem to protect vma is not enough. Depends on what protection you are talking about. > In the first report below both T378 and T398 take > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be > different locks (the addresses are different). See i_mmap_lock_read() in T398. It will guarantee that vma is there. > In the second report T309 doesn't take any locks at all, since it > assumes that after checking atomic_dec_and_test(>mm_users) the mm > has no other users, but then it does a write to vma. This one is tricky. I *assume* the mm cannot be generally accessible after mm_users drops to zero, but I'm not entirely sure about it. procfs? ptrace? The VMA is still accessible via rmap at this point. And I think it can be a problem: CPU0CPU1 exit_mmap() // mmap_sem is *not* taken munlock_vma_pages_all() munlock_vma_pages_range() try_to_unmap_one() down_read_trylock(>vm_mm->mmap_sem)) !!(vma->vm_flags & VM_LOCKED) == true vma->vm_flags &= ~VM_LOCKED; mlock_vma_page(page); // mlocked pages is leaked. The obvious solution is to take mmap_sem in exit path, but it would cause performance regression. Any comments? > > == > ThreadSanitizer: data-race in munlock_vma_pages_range > > Write of size 8 by thread T378 (K2633, CPU3): > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > Locks held by T378: > #0 Lock 25710428 taken here: > [< inlined>] SYSC_munlock mm/mlock.c:650 > [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > Previous read of size 8 by thread T398 (K2623, CPU2): > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > [< inlined>] rmap_walk_file mm/rmap.c:1540 > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > Locks held by T398: > #0 Lock 21b00c68 taken here: > [< inlined>] SYSC_munlock mm/mlock.c:650 > [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > #1 Lock bac2d750 taken here: > [< inlined>] i_mmap_lock_read include/linux/fs.h:509 > [< inlined>] rmap_walk_file mm/rmap.c:1533 > [] rmap_walk+0x78/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > #2 Lock 0895f570 taken here: > [< inlined>] spin_lock include/linux/spinlock.h:312 > [] __page_check_address+0xd9/0x210 mm/rmap.c:681 > [< inlined>] page_check_address include/linux/rmap.h:204 > [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198 > [< inlined>] rmap_walk_file mm/rmap.c:1540 > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 > > DBG: addr: 880222610e10 >
Re: Multiple potential races on vma->vm_flags
On Fri, 11 Sep 2015, Kirill A. Shutemov wrote: > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: > > Can a vma be shared among a few mm's? > > Define "shared". > > vma can belong only to one process (mm_struct), but it can be accessed > from other process like in rmap case below. > > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make > sure that the vma will not disappear under it. > > > If yes, then taking current->mm->mmap_sem to protect vma is not enough. > > Depends on what protection you are talking about. > > > In the first report below both T378 and T398 take > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be > > different locks (the addresses are different). > > See i_mmap_lock_read() in T398. It will guarantee that vma is there. > > > In the second report T309 doesn't take any locks at all, since it > > assumes that after checking atomic_dec_and_test(>mm_users) the mm > > has no other users, but then it does a write to vma. > > This one is tricky. I *assume* the mm cannot be generally accessible after > mm_users drops to zero, but I'm not entirely sure about it. > procfs? ptrace? Most of the things (including procfs and ptrace) that need to work on a foreign mm do take a hold on mm_users with get_task_mm(). swapoff uses atomic_inc_not_zero(>mm_users). In KSM I managed to get away with just a hold on the structure itself, atomic_inc(>mm_count), and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey might like to turn KSM on: it wouldn't be entirely shocking if he were to discover an anomaly from that). > > The VMA is still accessible via rmap at this point. And I think it can be > a problem: > > CPU0CPU1 > exit_mmap() > // mmap_sem is *not* taken > munlock_vma_pages_all() > munlock_vma_pages_range() > try_to_unmap_one() > > down_read_trylock(>vm_mm->mmap_sem)) > !!(vma->vm_flags & VM_LOCKED) > == true > vma->vm_flags &= ~VM_LOCKED; > > mlock_vma_page(page); > // mlocked pages is leaked. > > The obvious solution is to take mmap_sem in exit path, but it would cause > performance regression. > > Any comments? I'm inclined to echo Vlastimil's comment from earlier in the thread: sounds like an overkill, unless we find something more serious than this. I'm not sure whether we'd actually see a regression from taking mmap_sem in exit path; but given that it's mmap_sem, yes, history tells us please not to take it any more than we have to. I do remember wishing, when working out KSM's mm handling, that exit took mmap_sem: it would have made it simpler, but that wasn't a change I dared to make. Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() could fix it? But if we were to make a bigger change for this VM_LOCKED issue, and something more serious makes it worth all the effort, I'd say that what needs to be done is to give mlock/munlock proper locking (haha). I have not yet looked at your mlocked THP patch (sorry), but when I was doing the same thing for huge tmpfs, what made it so surprisingly difficult was all the spongy trylocking, which concealed the rules. Maybe I'm completely wrong, but I thought a lot of awkwardness might disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem throughout instead of mmap_sem. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote: On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote: Can a vma be shared among a few mm's? Define "shared". vma can belong only to one process (mm_struct), but it can be accessed from other process like in rmap case below. rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make sure that the vma will not disappear under it. If yes, then taking current->mm->mmap_sem to protect vma is not enough. Depends on what protection you are talking about. In the first report below both T378 and T398 take current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be different locks (the addresses are different). See i_mmap_lock_read() in T398. It will guarantee that vma is there. In the second report T309 doesn't take any locks at all, since it assumes that after checking atomic_dec_and_test(>mm_users) the mm has no other users, but then it does a write to vma. This one is tricky. I *assume* the mm cannot be generally accessible after mm_users drops to zero, but I'm not entirely sure about it. procfs? ptrace? The VMA is still accessible via rmap at this point. And I think it can be a problem: CPU0CPU1 exit_mmap() // mmap_sem is *not* taken munlock_vma_pages_all() munlock_vma_pages_range() try_to_unmap_one() down_read_trylock(>vm_mm->mmap_sem)) !!(vma->vm_flags & VM_LOCKED) == true vma->vm_flags &= ~VM_LOCKED; mlock_vma_page(page); // mlocked pages is leaked. The obvious solution is to take mmap_sem in exit path, but it would cause performance regression. Any comments? Just so others don't repeat the paths that I already looked at: - First I thought that try_to_unmap_one() has the page locked and munlock_vma_pages_range() will also lock it... but it doesn't. - Then I thought that exit_mmap() will revisit the page anyway doing actual unmap. It would, if it's the one who has the page mapped, it will clear the mlock (see page_remove_rmap()). If it's not the last one, page will be left locked. So it won't be completely leaked, but still, it will be mlocked when it shouldn't. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
Can a vma be shared among a few mm's? If yes, then taking current->mm->mmap_sem to protect vma is not enough. In the first report below both T378 and T398 take current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be different locks (the addresses are different). In the second report T309 doesn't take any locks at all, since it assumes that after checking atomic_dec_and_test(>mm_users) the mm has no other users, but then it does a write to vma. == ThreadSanitizer: data-race in munlock_vma_pages_range Write of size 8 by thread T378 (K2633, CPU3): [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Locks held by T378: #0 Lock 25710428 taken here: [< inlined>] SYSC_munlock mm/mlock.c:650 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Previous read of size 8 by thread T398 (K2623, CPU2): [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 [< inlined>] rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Locks held by T398: #0 Lock 21b00c68 taken here: [< inlined>] SYSC_munlock mm/mlock.c:650 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 #1 Lock bac2d750 taken here: [< inlined>] i_mmap_lock_read include/linux/fs.h:509 [< inlined>] rmap_walk_file mm/rmap.c:1533 [] rmap_walk+0x78/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 #2 Lock 0895f570 taken here: [< inlined>] spin_lock include/linux/spinlock.h:312 [] __page_check_address+0xd9/0x210 mm/rmap.c:681 [< inlined>] page_check_address include/linux/rmap.h:204 [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198 [< inlined>] rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 DBG: addr: 880222610e10 DBG: first offset: 0, second offset: 0 DBG: T378 clock: {T378: 4486533, T398: 2405850} DBG: T398 clock: {T398: 2406009} == == ThreadSanitizer: data-race in munlock_vma_pages_range Write of size 8 by thread T309 (K2577, CPU0): [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [< inlined>] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inlined>] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1420 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874 [< inlined>] SYSC_exit_group kernel/exit.c:885 [] __wake_up_parent+0x0/0x50 kernel/exit.c:883 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Locks held by T309: Previous read of size 8 by thread T293 (K2573, CPU3): [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 [< inlined>] rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [< inlined>] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inlined>] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1420 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote: > On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote: > > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: > >> > == > >> > ThreadSanitizer: data-race in munlock_vma_pages_range > >> > > >> > Write of size 8 by thread T378 (K2633, CPU3): > >> > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 > >> > arch/x86/entry/entry_64.S:186 > > ... > > > >> > Previous read of size 8 by thread T398 (K2623, CPU2): > >> > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > >> > [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 > >> > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > >> > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > >> > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > >> > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > >> > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 > >> > arch/x86/entry/entry_64.S:186 > > Okay, the detected race is mlock/munlock vs. rmap. > > > > On rmap side we check vma->vm_flags in few places without taking > > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem > > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under > > us. > > > > In this particular case, speculative check in beginning of > > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in > > the function. > > So you're suggesting that this isn't the cause of the bad page flags > error observed by Andrey and myself? I don't see it, but who knows. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote: > On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote: > > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: > >> > == > >> > ThreadSanitizer: data-race in munlock_vma_pages_range > >> > > >> > Write of size 8 by thread T378 (K2633, CPU3): > >> > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 > >> > arch/x86/entry/entry_64.S:186 > > ... > > > >> > Previous read of size 8 by thread T398 (K2623, CPU2): > >> > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > >> > [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 > >> > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > >> > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > >> > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > >> > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > >> > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 > >> > arch/x86/entry/entry_64.S:186 > > Okay, the detected race is mlock/munlock vs. rmap. > > > > On rmap side we check vma->vm_flags in few places without taking > > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem > > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under > > us. > > > > In this particular case, speculative check in beginning of > > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in > > the function. > > So you're suggesting that this isn't the cause of the bad page flags > error observed by Andrey and myself? I don't see it, but who knows. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
Can a vma be shared among a few mm's? If yes, then taking current->mm->mmap_sem to protect vma is not enough. In the first report below both T378 and T398 take current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be different locks (the addresses are different). In the second report T309 doesn't take any locks at all, since it assumes that after checking atomic_dec_and_test(>mm_users) the mm has no other users, but then it does a write to vma. == ThreadSanitizer: data-race in munlock_vma_pages_range Write of size 8 by thread T378 (K2633, CPU3): [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Locks held by T378: #0 Lock 25710428 taken here: [< inlined>] SYSC_munlock mm/mlock.c:650 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Previous read of size 8 by thread T398 (K2623, CPU2): [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 [< inlined>] rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Locks held by T398: #0 Lock 21b00c68 taken here: [< inlined>] SYSC_munlock mm/mlock.c:650 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 #1 Lock bac2d750 taken here: [< inlined>] i_mmap_lock_read include/linux/fs.h:509 [< inlined>] rmap_walk_file mm/rmap.c:1533 [] rmap_walk+0x78/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 #2 Lock 0895f570 taken here: [< inlined>] spin_lock include/linux/spinlock.h:312 [] __page_check_address+0xd9/0x210 mm/rmap.c:681 [< inlined>] page_check_address include/linux/rmap.h:204 [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198 [< inlined>] rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 DBG: addr: 880222610e10 DBG: first offset: 0, second offset: 0 DBG: T378 clock: {T378: 4486533, T398: 2405850} DBG: T398 clock: {T398: 2406009} == == ThreadSanitizer: data-race in munlock_vma_pages_range Write of size 8 by thread T309 (K2577, CPU0): [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [< inlined>] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inlined>] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1420 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874 [< inlined>] SYSC_exit_group kernel/exit.c:885 [] __wake_up_parent+0x0/0x50 kernel/exit.c:883 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Locks held by T309: Previous read of size 8 by thread T293 (K2573, CPU3): [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 [< inlined>] rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [< inlined>] munlock_vma_pages_all mm/internal.h:252 [] exit_mmap+0x163/0x190 mm/mmap.c:2824 [] mmput+0x65/0x190 kernel/fork.c:708 [< inlined>] exit_mm kernel/exit.c:437 [] do_exit+0x457/0x1420 kernel/exit.c:733 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
Re: Multiple potential races on vma->vm_flags
On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote: > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: >> > == >> > ThreadSanitizer: data-race in munlock_vma_pages_range >> > >> > Write of size 8 by thread T378 (K2633, CPU3): >> > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 >> > arch/x86/entry/entry_64.S:186 > ... > >> > Previous read of size 8 by thread T398 (K2623, CPU2): >> > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 >> > [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 >> > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 >> > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 >> > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 >> > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 >> > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 >> > arch/x86/entry/entry_64.S:186 > Okay, the detected race is mlock/munlock vs. rmap. > > On rmap side we check vma->vm_flags in few places without taking > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under > us. > > In this particular case, speculative check in beginning of > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in > the function. So you're suggesting that this isn't the cause of the bad page flags error observed by Andrey and myself? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 09, 2015 at 05:27:16PM +0200, Vlastimil Babka wrote: > On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote: > >On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: > >>== > >>ThreadSanitizer: data-race in munlock_vma_pages_range > >> > >>Write of size 8 by thread T378 (K2633, CPU3): > >> [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > >> [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> [] entry_SYSCALL_64_fastpath+0x12/0x71 > >>arch/x86/entry/entry_64.S:186 > > > >... > > > >>Previous read of size 8 by thread T398 (K2623, CPU2): > >> [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > >> [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 > >> [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > >> [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > >> [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > >> [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > >> [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > >> [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> [] entry_SYSCALL_64_fastpath+0x12/0x71 > >>arch/x86/entry/entry_64.S:186 > > > >Okay, the detected race is mlock/munlock vs. rmap. > > > >On rmap side we check vma->vm_flags in few places without taking > >vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem > >or anon_vma_lock, but nothing prevent vma->vm_flags from changing under > >us. > > > >In this particular case, speculative check in beginning of > >try_to_unmap_one() is fine, since we re-check it under mmap_sem later in > >the function. > > > >False-negative is fine too here, since we will mlock the page in > >__mm_populate() on mlock side after mlock_fixup(). > > > >BUT. > > > >We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to > >avoid all compiler trickery, like duplication vm_flags access with > >inconsistent results. > > Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a > compiler barrier forcing vm_flags to be re-read? Yes, but it doesn't prevent compiler from generation multiple reads from vma->vm_flags and it may blow up if two values doesn't match. > >I looked only on VM_LOCKED checks, but there are few other flags checked > >in rmap. All of them must be handled carefully. At least READ_ONCE() is > >required. > > > >Other solution would be to introduce per-vma spinlock to protect > >vma->vm_flags and probably other vma fields and offload this duty > >from mmap_sem. > >But that's much bigger project. > > Sounds like an overkill, unless we find something more serious than this. May be... -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote: On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: == ThreadSanitizer: data-race in munlock_vma_pages_range Write of size 8 by thread T378 (K2633, CPU3): [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 ... Previous read of size 8 by thread T398 (K2623, CPU2): [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Okay, the detected race is mlock/munlock vs. rmap. On rmap side we check vma->vm_flags in few places without taking vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem or anon_vma_lock, but nothing prevent vma->vm_flags from changing under us. In this particular case, speculative check in beginning of try_to_unmap_one() is fine, since we re-check it under mmap_sem later in the function. False-negative is fine too here, since we will mlock the page in __mm_populate() on mlock side after mlock_fixup(). BUT. We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to avoid all compiler trickery, like duplication vm_flags access with inconsistent results. Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a compiler barrier forcing vm_flags to be re-read? I looked only on VM_LOCKED checks, but there are few other flags checked in rmap. All of them must be handled carefully. At least READ_ONCE() is required. Other solution would be to introduce per-vma spinlock to protect vma->vm_flags and probably other vma fields and offload this duty from mmap_sem. But that's much bigger project. Sounds like an overkill, unless we find something more serious than this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On Wed, Sep 09, 2015 at 05:27:16PM +0200, Vlastimil Babka wrote: > On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote: > >On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: > >>== > >>ThreadSanitizer: data-race in munlock_vma_pages_range > >> > >>Write of size 8 by thread T378 (K2633, CPU3): > >> [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > >> [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> [] entry_SYSCALL_64_fastpath+0x12/0x71 > >>arch/x86/entry/entry_64.S:186 > > > >... > > > >>Previous read of size 8 by thread T398 (K2623, CPU2): > >> [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > >> [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 > >> [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > >> [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > >> [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > >> [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > >> [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > >> [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > >> [] do_mlock+0x14c/0x180 mm/mlock.c:589 > >> [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > >> [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > >> [] entry_SYSCALL_64_fastpath+0x12/0x71 > >>arch/x86/entry/entry_64.S:186 > > > >Okay, the detected race is mlock/munlock vs. rmap. > > > >On rmap side we check vma->vm_flags in few places without taking > >vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem > >or anon_vma_lock, but nothing prevent vma->vm_flags from changing under > >us. > > > >In this particular case, speculative check in beginning of > >try_to_unmap_one() is fine, since we re-check it under mmap_sem later in > >the function. > > > >False-negative is fine too here, since we will mlock the page in > >__mm_populate() on mlock side after mlock_fixup(). > > > >BUT. > > > >We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to > >avoid all compiler trickery, like duplication vm_flags access with > >inconsistent results. > > Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a > compiler barrier forcing vm_flags to be re-read? Yes, but it doesn't prevent compiler from generation multiple reads from vma->vm_flags and it may blow up if two values doesn't match. > >I looked only on VM_LOCKED checks, but there are few other flags checked > >in rmap. All of them must be handled carefully. At least READ_ONCE() is > >required. > > > >Other solution would be to introduce per-vma spinlock to protect > >vma->vm_flags and probably other vma fields and offload this duty > >from mmap_sem. > >But that's much bigger project. > > Sounds like an overkill, unless we find something more serious than this. May be... -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote: On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: == ThreadSanitizer: data-race in munlock_vma_pages_range Write of size 8 by thread T378 (K2633, CPU3): [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 ... Previous read of size 8 by thread T398 (K2623, CPU2): [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 [] rmap_walk+0x147/0x450 mm/rmap.c:1559 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 [] do_mlock+0x14c/0x180 mm/mlock.c:589 [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 [] entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186 Okay, the detected race is mlock/munlock vs. rmap. On rmap side we check vma->vm_flags in few places without taking vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem or anon_vma_lock, but nothing prevent vma->vm_flags from changing under us. In this particular case, speculative check in beginning of try_to_unmap_one() is fine, since we re-check it under mmap_sem later in the function. False-negative is fine too here, since we will mlock the page in __mm_populate() on mlock side after mlock_fixup(). BUT. We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to avoid all compiler trickery, like duplication vm_flags access with inconsistent results. Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a compiler barrier forcing vm_flags to be re-read? I looked only on VM_LOCKED checks, but there are few other flags checked in rmap. All of them must be handled carefully. At least READ_ONCE() is required. Other solution would be to introduce per-vma spinlock to protect vma->vm_flags and probably other vma fields and offload this duty from mmap_sem. But that's much bigger project. Sounds like an overkill, unless we find something more serious than this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multiple potential races on vma->vm_flags
On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote: > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: >> > == >> > ThreadSanitizer: data-race in munlock_vma_pages_range >> > >> > Write of size 8 by thread T378 (K2633, CPU3): >> > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 >> > arch/x86/entry/entry_64.S:186 > ... > >> > Previous read of size 8 by thread T398 (K2623, CPU2): >> > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 >> > [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 >> > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 >> > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 >> > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 >> > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 >> > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 >> > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 >> > [] do_mlock+0x14c/0x180 mm/mlock.c:589 >> > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 >> > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 >> > [] entry_SYSCALL_64_fastpath+0x12/0x71 >> > arch/x86/entry/entry_64.S:186 > Okay, the detected race is mlock/munlock vs. rmap. > > On rmap side we check vma->vm_flags in few places without taking > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under > us. > > In this particular case, speculative check in beginning of > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in > the function. So you're suggesting that this isn't the cause of the bad page flags error observed by Andrey and myself? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Multiple potential races on vma->vm_flags
On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: > == > ThreadSanitizer: data-race in munlock_vma_pages_range > > Write of size 8 by thread T378 (K2633, CPU3): > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 ... > Previous read of size 8 by thread T398 (K2623, CPU2): > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 Okay, the detected race is mlock/munlock vs. rmap. On rmap side we check vma->vm_flags in few places without taking vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem or anon_vma_lock, but nothing prevent vma->vm_flags from changing under us. In this particular case, speculative check in beginning of try_to_unmap_one() is fine, since we re-check it under mmap_sem later in the function. False-negative is fine too here, since we will mlock the page in __mm_populate() on mlock side after mlock_fixup(). BUT. We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to avoid all compiler trickery, like duplication vm_flags access with inconsistent results. I looked only on VM_LOCKED checks, but there are few other flags checked in rmap. All of them must be handled carefully. At least READ_ONCE() is required. Other solution would be to introduce per-vma spinlock to protect vma->vm_flags and probably other vma fields and offload this duty from mmap_sem. But that's much bigger project. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Multiple potential races on vma->vm_flags
On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote: > == > ThreadSanitizer: data-race in munlock_vma_pages_range > > Write of size 8 by thread T378 (K2633, CPU3): > [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 ... > Previous read of size 8 by thread T398 (K2623, CPU2): > [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208 > [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540 > [] rmap_walk+0x147/0x450 mm/rmap.c:1559 > [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423 > [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129 > [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331 > [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476 > [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549 > [] do_mlock+0x14c/0x180 mm/mlock.c:589 > [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651 > [] SyS_munlock+0x74/0xb0 mm/mlock.c:643 > [] entry_SYSCALL_64_fastpath+0x12/0x71 > arch/x86/entry/entry_64.S:186 Okay, the detected race is mlock/munlock vs. rmap. On rmap side we check vma->vm_flags in few places without taking vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem or anon_vma_lock, but nothing prevent vma->vm_flags from changing under us. In this particular case, speculative check in beginning of try_to_unmap_one() is fine, since we re-check it under mmap_sem later in the function. False-negative is fine too here, since we will mlock the page in __mm_populate() on mlock side after mlock_fixup(). BUT. We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to avoid all compiler trickery, like duplication vm_flags access with inconsistent results. I looked only on VM_LOCKED checks, but there are few other flags checked in rmap. All of them must be handled carefully. At least READ_ONCE() is required. Other solution would be to introduce per-vma spinlock to protect vma->vm_flags and probably other vma fields and offload this duty from mmap_sem. But that's much bigger project. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/