subject:"Multiple potential races on vma\->vm

Re: Multiple potential races on vma->vm_flags

2015-10-15 Thread Andrey Konovalov

On Wed, Oct 14, 2015 at 12:33 AM, Hugh Dickins  wrote:
> I think I've found the answer to that at last: we were indeed
> all looking in the wrong direction.  Your ktsan tree shows
>
> static __always_inline int atomic_add_negative(int i, atomic_t *v)
> {
> #ifndef CONFIG_KTSAN
> GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
> #else
> return (ktsan_atomic32_fetch_add((void *)v, i,
> ktsan_memory_order_acq_rel) + i) < 0;
> #endif
> }
>
> but ktsan_atomic32_fetch_add() returns u32: so it looks like
> your implementation of atomic_add_negative() always returns 0,
> and page_remove_file_rmap() never calls clear_page_mlock(), as
> it ought when an Mlocked page has been truncated or punched out.
>
> /proc/meminfo gives you crazy AnonPages and Mapped too, yes?

Yes, you're correct, there was a bug in KTSAN annotations.
Thank you for finding this.

Fixing that bug fixes the bad page reports.
Sorry for troubling you with this.

The race reports are still there though.

>
>>
>> It seems that your patch doesn't fix the race from the report below, since 
>> pte
>> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
>> is being executed. (Line numbers are from kernel with your patch applied.)
>
> I was not trying to "fix" that with my patch, because I couldn't find
> any problem with the way it reads vm_flags there; I can't even see any
> need for READ_ONCE or more barriers, we have sufficient locking already.
>
> Sure, try_to_unmap_one() may read vm_flags an instant before or after
> a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED;
> but the syscalls (or exit) then work their way up the address space to
> establish the final state, no problem.
>
> But I am glad you drew attention to the inadequacy of the
> down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting
> that patch (doing the mlock_vma_page under pt lock instead), I have
> identifed one case that it would fix - though it clearly wasn't
> involved in your stacktrace (it's a race with truncating COWed pages,
> but your trace was holepunching, which leaves the COWs alone).
>
> I'll go forward with that patch, but it rather falls into a series
> I was preparing, must finish up all their comments before posting.
>
> Hugh
>
>>
>> ===
>> ThreadSanitizer: data-race in munlock_vma_pages_range
>>
>> Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2:
>>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>>  [] mmput+0x65/0x190 kernel/fork.c:708
>>  [< inline >] exit_mm kernel/exit.c:437
>>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
>>
>> Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1:
>>  [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
>>  [< inline >] rmap_walk_file mm/rmap.c:1522
>>  [] rmap_walk+0x147/0x450 mm/rmap.c:1541
>>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
>>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>>  [] mmput+0x65/0x190 kernel/fork.c:708
>>  [< inline >] exit_mm kernel/exit.c:437
>>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
>> ===
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-10-15 Thread Andrey Konovalov

On Wed, Oct 14, 2015 at 12:33 AM, Hugh Dickins  wrote:
> I think I've found the answer to that at last: we were indeed
> all looking in the wrong direction.  Your ktsan tree shows
>
> static __always_inline int atomic_add_negative(int i, atomic_t *v)
> {
> #ifndef CONFIG_KTSAN
> GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
> #else
> return (ktsan_atomic32_fetch_add((void *)v, i,
> ktsan_memory_order_acq_rel) + i) < 0;
> #endif
> }
>
> but ktsan_atomic32_fetch_add() returns u32: so it looks like
> your implementation of atomic_add_negative() always returns 0,
> and page_remove_file_rmap() never calls clear_page_mlock(), as
> it ought when an Mlocked page has been truncated or punched out.
>
> /proc/meminfo gives you crazy AnonPages and Mapped too, yes?

Yes, you're correct, there was a bug in KTSAN annotations.
Thank you for finding this.

Fixing that bug fixes the bad page reports.
Sorry for troubling you with this.

The race reports are still there though.

>
>>
>> It seems that your patch doesn't fix the race from the report below, since 
>> pte
>> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
>> is being executed. (Line numbers are from kernel with your patch applied.)
>
> I was not trying to "fix" that with my patch, because I couldn't find
> any problem with the way it reads vm_flags there; I can't even see any
> need for READ_ONCE or more barriers, we have sufficient locking already.
>
> Sure, try_to_unmap_one() may read vm_flags an instant before or after
> a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED;
> but the syscalls (or exit) then work their way up the address space to
> establish the final state, no problem.
>
> But I am glad you drew attention to the inadequacy of the
> down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting
> that patch (doing the mlock_vma_page under pt lock instead), I have
> identifed one case that it would fix - though it clearly wasn't
> involved in your stacktrace (it's a race with truncating COWed pages,
> but your trace was holepunching, which leaves the COWs alone).
>
> I'll go forward with that patch, but it rather falls into a series
> I was preparing, must finish up all their comments before posting.
>
> Hugh
>
>>
>> ===
>> ThreadSanitizer: data-race in munlock_vma_pages_range
>>
>> Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2:
>>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>>  [] mmput+0x65/0x190 kernel/fork.c:708
>>  [< inline >] exit_mm kernel/exit.c:437
>>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
>>
>> Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1:
>>  [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
>>  [< inline >] rmap_walk_file mm/rmap.c:1522
>>  [] rmap_walk+0x147/0x450 mm/rmap.c:1541
>>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
>>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>>  [] mmput+0x65/0x190 kernel/fork.c:708
>>  [< inline >] exit_mm kernel/exit.c:437
>>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
>> ===
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-10-13 Thread Hugh Dickins

On Wed, 23 Sep 2015, Sasha Levin wrote:
> On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> >> > This is totally untested, and one of you may quickly prove me wrong;
> >> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> >> > then couldn't see why it would need mmap_sem at all, given how mlock
> >> > and munlock first assert intention by setting or clearing VM_LOCKED
> >> > in vm_flags, then work their way up the vma, taking pte locks.
> >> >
> >> > Calling mlock_vma_page() under pte lock may look suspicious
> >> > at first: but what it does is similar to clear_page_mlock(),
> >> > which we regularly call under pte lock from page_remove_rmap().
> >> >
> >> > I'd rather wait to hear whether this appears to work in practice,
> >> > and whether you agree that it should work in theory, before writing
> >> > the proper description.  I'd love to lose that down_read_trylock.
> > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> > 
> > It seems that your patch doesn't fix the race from the report below, since 
> > pte
> > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> > is being executed. (Line numbers are from kernel with your patch applied.)
> 
> I've fired up my HZ_1 patch, and this seems to be a real race that is
> somewhat easy to reproduce under those conditions.
> 
> Here's a fresh backtrace from my VMs:
> 
> [1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
> [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping:  
> (null) index:0x1e00 compound_mapcount: 0
> [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked)
> [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [1935109.888197] bad because of flags:
> [1935109.888759] flags: 0x10(mlocked)
> [1935109.889525] Modules linked in:
> [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 
> 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1935109.891876]  16445448 e5dca494 8803f7657708 
> a70402da
> [1935109.893504]  ea000f288000 8803f7657738 a56e522b 
> 022f80144008
> [1935109.894947]  ea000f288020 ea000f288000  
> 8803f76577a8
> [1935109.896413] Call Trace:
> [1935109.899102]  [] dump_stack+0x4e/0x84
> [1935109.899821]  [] bad_page+0x17b/0x210
> [1935109.900469]  [] free_pages_prepare+0xb48/0x1110
> [1935109.902127]  [] __free_pages_ok+0x21/0x260
> [1935109.904435]  [] free_compound_page+0x63/0x80
> [1935109.905614]  [] free_transhuge_page+0x6e/0x80

free_transhuge_page belongs to Kirill's THP refcounting patchset,
it's not in 4.3-rc or 4.3.0-rc2-next-20150923 or mmotm.
Well worth testing, thank you, but please make it clear what you
are testing: I'll not spend longer on this one, not at this time.

Hugh

> [1935109.906752]  [] __put_compound_page+0x76/0xa0
> [1935109.907884]  [] release_pages+0x4d5/0x9f0
> [1935109.913027]  [] tlb_flush_mmu_free+0x8a/0x120
> [1935109.913957]  [] unmap_page_range+0xe73/0x1460
> [1935109.915737]  [] unmap_single_vma+0x126/0x2f0
> [1935109.916646]  [] unmap_vmas+0xdd/0x190
> [1935109.917454]  [] exit_mmap+0x221/0x430
> [1935109.921176]  [] mmput+0xb1/0x240
> [1935109.921919]  [] do_exit+0x732/0x27c0
> [1935109.928561]  [] do_group_exit+0xf9/0x300
> [1935109.929786]  [] SyS_exit_group+0x1d/0x20
> [1935109.930617]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
> 
> 
> Thanks,
> Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-10-13 Thread Hugh Dickins

On Wed, 23 Sep 2015, Andrey Konovalov wrote:
> On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> > This is totally untested, and one of you may quickly prove me wrong;
> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> > then couldn't see why it would need mmap_sem at all, given how mlock
> > and munlock first assert intention by setting or clearing VM_LOCKED
> > in vm_flags, then work their way up the vma, taking pte locks.
> >
> > Calling mlock_vma_page() under pte lock may look suspicious
> > at first: but what it does is similar to clear_page_mlock(),
> > which we regularly call under pte lock from page_remove_rmap().
> >
> > I'd rather wait to hear whether this appears to work in practice,
> > and whether you agree that it should work in theory, before writing
> > the proper description.  I'd love to lose that down_read_trylock.
> 
> No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".

I think I've found the answer to that at last: we were indeed
all looking in the wrong direction.  Your ktsan tree shows

static __always_inline int atomic_add_negative(int i, atomic_t *v)
{
#ifndef CONFIG_KTSAN
GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
#else
return (ktsan_atomic32_fetch_add((void *)v, i,
ktsan_memory_order_acq_rel) + i) < 0;
#endif
}

but ktsan_atomic32_fetch_add() returns u32: so it looks like
your implementation of atomic_add_negative() always returns 0,
and page_remove_file_rmap() never calls clear_page_mlock(), as
it ought when an Mlocked page has been truncated or punched out.

/proc/meminfo gives you crazy AnonPages and Mapped too, yes?

> 
> It seems that your patch doesn't fix the race from the report below, since pte
> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> is being executed. (Line numbers are from kernel with your patch applied.)

I was not trying to "fix" that with my patch, because I couldn't find
any problem with the way it reads vm_flags there; I can't even see any
need for READ_ONCE or more barriers, we have sufficient locking already.

Sure, try_to_unmap_one() may read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED;
but the syscalls (or exit) then work their way up the address space to
establish the final state, no problem.

But I am glad you drew attention to the inadequacy of the
down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting
that patch (doing the mlock_vma_page under pt lock instead), I have
identifed one case that it would fix - though it clearly wasn't
involved in your stacktrace (it's a race with truncating COWed pages,
but your trace was holepunching, which leaves the COWs alone).

I'll go forward with that patch, but it rather falls into a series
I was preparing, must finish up all their comments before posting.

Hugh

> 
> ===
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2:
>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [] mmput+0x65/0x190 kernel/fork.c:708
>  [< inline >] exit_mm kernel/exit.c:437
>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
> 
> Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1:
>  [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
>  [< inline >] rmap_walk_file mm/rmap.c:1522
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1541
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [] mmput+0x65/0x190 kernel/fork.c:708
>  [< inline >] exit_mm kernel/exit.c:437
>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
> ===
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-10-13 Thread Hugh Dickins

On Wed, 23 Sep 2015, Sasha Levin wrote:
> On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> >> > This is totally untested, and one of you may quickly prove me wrong;
> >> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> >> > then couldn't see why it would need mmap_sem at all, given how mlock
> >> > and munlock first assert intention by setting or clearing VM_LOCKED
> >> > in vm_flags, then work their way up the vma, taking pte locks.
> >> >
> >> > Calling mlock_vma_page() under pte lock may look suspicious
> >> > at first: but what it does is similar to clear_page_mlock(),
> >> > which we regularly call under pte lock from page_remove_rmap().
> >> >
> >> > I'd rather wait to hear whether this appears to work in practice,
> >> > and whether you agree that it should work in theory, before writing
> >> > the proper description.  I'd love to lose that down_read_trylock.
> > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> > 
> > It seems that your patch doesn't fix the race from the report below, since 
> > pte
> > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> > is being executed. (Line numbers are from kernel with your patch applied.)
> 
> I've fired up my HZ_1 patch, and this seems to be a real race that is
> somewhat easy to reproduce under those conditions.
> 
> Here's a fresh backtrace from my VMs:
> 
> [1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
> [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping:  
> (null) index:0x1e00 compound_mapcount: 0
> [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked)
> [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [1935109.888197] bad because of flags:
> [1935109.888759] flags: 0x10(mlocked)
> [1935109.889525] Modules linked in:
> [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 
> 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1935109.891876]  16445448 e5dca494 8803f7657708 
> a70402da
> [1935109.893504]  ea000f288000 8803f7657738 a56e522b 
> 022f80144008
> [1935109.894947]  ea000f288020 ea000f288000  
> 8803f76577a8
> [1935109.896413] Call Trace:
> [1935109.899102]  [] dump_stack+0x4e/0x84
> [1935109.899821]  [] bad_page+0x17b/0x210
> [1935109.900469]  [] free_pages_prepare+0xb48/0x1110
> [1935109.902127]  [] __free_pages_ok+0x21/0x260
> [1935109.904435]  [] free_compound_page+0x63/0x80
> [1935109.905614]  [] free_transhuge_page+0x6e/0x80

free_transhuge_page belongs to Kirill's THP refcounting patchset,
it's not in 4.3-rc or 4.3.0-rc2-next-20150923 or mmotm.
Well worth testing, thank you, but please make it clear what you
are testing: I'll not spend longer on this one, not at this time.

Hugh

> [1935109.906752]  [] __put_compound_page+0x76/0xa0
> [1935109.907884]  [] release_pages+0x4d5/0x9f0
> [1935109.913027]  [] tlb_flush_mmu_free+0x8a/0x120
> [1935109.913957]  [] unmap_page_range+0xe73/0x1460
> [1935109.915737]  [] unmap_single_vma+0x126/0x2f0
> [1935109.916646]  [] unmap_vmas+0xdd/0x190
> [1935109.917454]  [] exit_mmap+0x221/0x430
> [1935109.921176]  [] mmput+0xb1/0x240
> [1935109.921919]  [] do_exit+0x732/0x27c0
> [1935109.928561]  [] do_group_exit+0xf9/0x300
> [1935109.929786]  [] SyS_exit_group+0x1d/0x20
> [1935109.930617]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
> 
> 
> Thanks,
> Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-10-13 Thread Hugh Dickins

On Wed, 23 Sep 2015, Andrey Konovalov wrote:
> On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> > This is totally untested, and one of you may quickly prove me wrong;
> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> > then couldn't see why it would need mmap_sem at all, given how mlock
> > and munlock first assert intention by setting or clearing VM_LOCKED
> > in vm_flags, then work their way up the vma, taking pte locks.
> >
> > Calling mlock_vma_page() under pte lock may look suspicious
> > at first: but what it does is similar to clear_page_mlock(),
> > which we regularly call under pte lock from page_remove_rmap().
> >
> > I'd rather wait to hear whether this appears to work in practice,
> > and whether you agree that it should work in theory, before writing
> > the proper description.  I'd love to lose that down_read_trylock.
> 
> No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".

I think I've found the answer to that at last: we were indeed
all looking in the wrong direction.  Your ktsan tree shows

static __always_inline int atomic_add_negative(int i, atomic_t *v)
{
#ifndef CONFIG_KTSAN
GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
#else
return (ktsan_atomic32_fetch_add((void *)v, i,
ktsan_memory_order_acq_rel) + i) < 0;
#endif
}

but ktsan_atomic32_fetch_add() returns u32: so it looks like
your implementation of atomic_add_negative() always returns 0,
and page_remove_file_rmap() never calls clear_page_mlock(), as
it ought when an Mlocked page has been truncated or punched out.

/proc/meminfo gives you crazy AnonPages and Mapped too, yes?

> 
> It seems that your patch doesn't fix the race from the report below, since pte
> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> is being executed. (Line numbers are from kernel with your patch applied.)

I was not trying to "fix" that with my patch, because I couldn't find
any problem with the way it reads vm_flags there; I can't even see any
need for READ_ONCE or more barriers, we have sufficient locking already.

Sure, try_to_unmap_one() may read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED;
but the syscalls (or exit) then work their way up the address space to
establish the final state, no problem.

But I am glad you drew attention to the inadequacy of the
down_read_trylock(mmap_sem) in try_to_unmap_one(), and since posting
that patch (doing the mlock_vma_page under pt lock instead), I have
identifed one case that it would fix - though it clearly wasn't
involved in your stacktrace (it's a race with truncating COWed pages,
but your trace was holepunching, which leaves the COWs alone).

I'll go forward with that patch, but it rather falls into a series
I was preparing, must finish up all their comments before posting.

Hugh

> 
> ===
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2:
>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [] mmput+0x65/0x190 kernel/fork.c:708
>  [< inline >] exit_mm kernel/exit.c:437
>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
> 
> Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1:
>  [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
>  [< inline >] rmap_walk_file mm/rmap.c:1522
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1541
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [< inline >] munlock_vma_pages_all mm/internal.h:252
>  [] exit_mmap+0x163/0x190 mm/mmap.c:2824
>  [] mmput+0x65/0x190 kernel/fork.c:708
>  [< inline >] exit_mm kernel/exit.c:437
>  [] do_exit+0x457/0x1400 kernel/exit.c:733
>  [] do_group_exit+0x7f/0x140 kernel/exit.c:874
>  [] get_signal+0x375/0xa70 kernel/signal.c:2353
>  [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
>  [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
>  [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
> ===
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-25 Thread Kirill A. Shutemov

On Wed, Sep 23, 2015 at 08:42:26PM -0400, Sasha Levin wrote:
> On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> >> > This is totally untested, and one of you may quickly prove me wrong;
> >> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> >> > then couldn't see why it would need mmap_sem at all, given how mlock
> >> > and munlock first assert intention by setting or clearing VM_LOCKED
> >> > in vm_flags, then work their way up the vma, taking pte locks.
> >> >
> >> > Calling mlock_vma_page() under pte lock may look suspicious
> >> > at first: but what it does is similar to clear_page_mlock(),
> >> > which we regularly call under pte lock from page_remove_rmap().
> >> >
> >> > I'd rather wait to hear whether this appears to work in practice,
> >> > and whether you agree that it should work in theory, before writing
> >> > the proper description.  I'd love to lose that down_read_trylock.
> > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> > 
> > It seems that your patch doesn't fix the race from the report below, since 
> > pte
> > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> > is being executed. (Line numbers are from kernel with your patch applied.)
> 
> I've fired up my HZ_1 patch,

Can we make HZ_1 thing into upstream? Under KERNEL_DEBUG, or
something?

> and this seems to be a real race that is
> somewhat easy to reproduce under those conditions.
> 
> Here's a fresh backtrace from my VMs:
> 
> [1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
> [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping:  
> (null) index:0x1e00 compound_mapcount: 0
> [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked)
> [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [1935109.888197] bad because of flags:
> [1935109.888759] flags: 0x10(mlocked)
> [1935109.889525] Modules linked in:
> [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 
> 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1935109.891876]  16445448 e5dca494 8803f7657708 
> a70402da
> [1935109.893504]  ea000f288000 8803f7657738 a56e522b 
> 022f80144008
> [1935109.894947]  ea000f288020 ea000f288000  
> 8803f76577a8
> [1935109.896413] Call Trace:
> [1935109.899102]  [] dump_stack+0x4e/0x84
> [1935109.899821]  [] bad_page+0x17b/0x210
> [1935109.900469]  [] free_pages_prepare+0xb48/0x1110
> [1935109.902127]  [] __free_pages_ok+0x21/0x260
> [1935109.904435]  [] free_compound_page+0x63/0x80
> [1935109.905614]  [] free_transhuge_page+0x6e/0x80
> [1935109.906752]  [] __put_compound_page+0x76/0xa0
> [1935109.907884]  [] release_pages+0x4d5/0x9f0
> [1935109.913027]  [] tlb_flush_mmu_free+0x8a/0x120
> [1935109.913957]  [] unmap_page_range+0xe73/0x1460
> [1935109.915737]  [] unmap_single_vma+0x126/0x2f0
> [1935109.916646]  [] unmap_vmas+0xdd/0x190
> [1935109.917454]  [] exit_mmap+0x221/0x430
> [1935109.921176]  [] mmput+0xb1/0x240
> [1935109.921919]  [] do_exit+0x732/0x27c0
> [1935109.928561]  [] do_group_exit+0xf9/0x300
> [1935109.929786]  [] SyS_exit_group+0x1d/0x20
> [1935109.930617]  [] entry_SYSCALL_64_fastpath+0x16/0x7a

Would it make any difference if you'll add mmap_sem protection in
exit_mmap?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-25 Thread Oleg Nesterov

On 09/23, Sasha Levin wrote:
>
> Another similar trace where we see a problem during process exit:
>
> [1922964.887922] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1922964.890234] Modules linked in:
> [1922964.890844] CPU: 1 PID: 21477 Comm: trinity-c161 Tainted: GW 
>   4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1922964.892584] task: 880251858000 ti: 88009f258000 task.ti: 
> 88009f258000
> [1922964.893723] RIP: acct_collect (kernel/acct.c:542)

   530  void acct_collect(long exitcode, int group_dead)
   531  {
   532  struct pacct_struct *pacct = >signal->pacct;
   533  cputime_t utime, stime;
   534  unsigned long vsize = 0;
   535
   536  if (group_dead && current->mm) {
   537  struct vm_area_struct *vma;
   538
   539  down_read(>mm->mmap_sem);
   540  vma = current->mm->mmap;
   541  while (vma) {
   542  vsize += vma->vm_end - vma->vm_start; // 

   543  vma = vma->vm_next;
   544  }
   545  up_read(>mm->mmap_sem);
   546  }


> [1922964.895105] RSP: :88009f25f908  EFLAGS: 00010207
> [1922964.895935] RAX: dc00 RBX: 2ce0 RCX: 
> 
> [1922964.897008] RDX: 2152b153 RSI: 059c2000 RDI: 
> 2ce10007
> [1922964.898091] RBP: 88009f25f9e8 R08: 0001 R09: 
> 03ef
> [1922964.899178] R10: ed014d7a3a01 R11: 0001 R12: 
> 880082b485c0
> [1922964.901643] R13: 2152b153 R14: 110013e4bf24 R15: 
> 88009f25f9c0
...
>0:   02 00   add(%rax),%al
>2:   0f 85 9d 05 00 00   jne0x5a5
>8:   48 8b 1bmov(%rbx),%rbx
>b:   48 85 dbtest   %rbx,%rbx
>e:   0f 84 7b 05 00 00   je 0x58f

Probably "mov (%rbx),%rbx" is "vma = mm->mmap",

>   14:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>   1b:   fc ff df
>   1e:   31 d2   xor%edx,%edx
>   20:   48 8d 7b 08 lea0x8(%rbx),%rdi

and this loads the addr of vma->vm_end for kasan,

>   24:   48 89 femov%rdi,%rsi
>   27:   48 c1 ee 03 shr$0x3,%rsi
>   2b:*  80 3c 06 00 cmpb   $0x0,(%rsi,%rax,1)   <-- 
> trapping instruction

which reporst the error. But in this case this is not NULL-deref,
note that $rbx = 2ce0 and this is below __PAGE_OFFSET
(but above TASK_SIZE_MAX). It seems it is not even canonical. In
any case this odd value can't be valid.

Again, looks like mm->mmap pointer was corrupted. Perhaps you can
re-test with the stupid patch below. But unlikely it will help. If
mm was freed we would probably see something else.

Oleg.
---

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -672,6 +672,7 @@ struct mm_struct *mm_alloc(void)
 void __mmdrop(struct mm_struct *mm)
 {
BUG_ON(mm == _mm);
+   BUG_ON(atomic_read(>mm_users));
mm_free_pgd(mm);
destroy_context(mm);
mmu_notifier_mm_destroy(mm);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-25 Thread Oleg Nesterov

On 09/24, Andrey Ryabinin wrote:
>
> 2015-09-24 20:26 GMT+03:00 Oleg Nesterov :
> > On 09/24, Sasha Levin wrote:
> >>
> >> void unmap_vmas(struct mmu_gather *tlb,
> >> struct vm_area_struct *vma, unsigned long start_addr,
> >> unsigned long end_addr)
> >> {
> >> struct mm_struct *mm = vma->vm_mm;
> >>
> >> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> >> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); 
> >> <--- this
> >> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> >> }
> >
> > And I do not see any dereference at this line,
> >
>
> I noticed, that addr2line sometimes doesn't work reliably on
> compiler-instrumented code.
> I've seen couple times that it points to the next line of code.

Yes, I know that we can't trust it. That is why I think (at least in
this particular case) function+offset would be more helpful. And we
need more asm probably.

> >> >>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
> >> >>6:   85 22   test   %esp,(%rdx)
> >> >>8:   01 00   add%eax,(%rax)
> >> >>a:   00 48 8badd%cl,-0x75(%rax)
> >> >>d:   43  rex.XB
> >> >>e:   40  rex
> >> >>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
> >> >>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
> >> >>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
> >> >>   21:   fc ff df
> >> >>   24:   48 89 famov%rdi,%rdx
> >> >>   27:   48 c1 ea 03 shr$0x3,%rdx
> >> >>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   
> >> >> <-- trapping instruction
> >> >>   2f:   0f 85 ee 00 00 00   jne0x123
> >> >>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
> >> >>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
> >> >>   40:   00
> >> >
> >> > And I do not see anything similar in "objdump -d". So could you at least
> >> > show mm/memory.c:1337 in your tree?
> >> >
> >> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
> >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> >> > (I can't check, my gcc is very old). Or what?
> >>
> >> This is indeed kasan code. 0xdc00 is the shadow base, and you 
> >> see
> >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.
> >
> > and thus this asm can't help, right?
> >
>
> I think it can.
>
> > So how can we figure out where exactly the kernel hits NULL ? And what
> > exactly it tries to dereference?
>
> So we tried to dereference 0x4c8.  That 0x4c8 is probably offset in some 
> struct.
> The only big struct here is mm_struct.
> So I think that we tried to derefernce null mm, and this asm:
>  > cmpq   $0x0,0x4c8(%rax)
>
> is likely from inlined mm_has_notifiers():
> static inline int mm_has_notifiers(struct mm_struct *mm)
> {
>  return unlikely(mm->mmu_notifier_mm);
> }

Looks reasonable... Thanks.

I was going to say that this is impossible because the caller should have
crashed if ->mm == NULL. But unmap_vmas() uses mm = vma->vm_mm, so it looks
like this vma or mm->mmap was corrupted...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-25 Thread Kirill A. Shutemov

On Wed, Sep 23, 2015 at 08:42:26PM -0400, Sasha Levin wrote:
> On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> > On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> >> > This is totally untested, and one of you may quickly prove me wrong;
> >> > but I went in to fix your "Bad page state (mlocked)" by holding pte
> >> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> >> > then couldn't see why it would need mmap_sem at all, given how mlock
> >> > and munlock first assert intention by setting or clearing VM_LOCKED
> >> > in vm_flags, then work their way up the vma, taking pte locks.
> >> >
> >> > Calling mlock_vma_page() under pte lock may look suspicious
> >> > at first: but what it does is similar to clear_page_mlock(),
> >> > which we regularly call under pte lock from page_remove_rmap().
> >> >
> >> > I'd rather wait to hear whether this appears to work in practice,
> >> > and whether you agree that it should work in theory, before writing
> >> > the proper description.  I'd love to lose that down_read_trylock.
> > No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> > 
> > It seems that your patch doesn't fix the race from the report below, since 
> > pte
> > lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> > is being executed. (Line numbers are from kernel with your patch applied.)
> 
> I've fired up my HZ_1 patch,

Can we make HZ_1 thing into upstream? Under KERNEL_DEBUG, or
something?

> and this seems to be a real race that is
> somewhat easy to reproduce under those conditions.
> 
> Here's a fresh backtrace from my VMs:
> 
> [1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
> [1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping:  
> (null) index:0x1e00 compound_mapcount: 0
> [1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked)
> [1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [1935109.888197] bad because of flags:
> [1935109.888759] flags: 0x10(mlocked)
> [1935109.889525] Modules linked in:
> [1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 
> 4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1935109.891876]  16445448 e5dca494 8803f7657708 
> a70402da
> [1935109.893504]  ea000f288000 8803f7657738 a56e522b 
> 022f80144008
> [1935109.894947]  ea000f288020 ea000f288000  
> 8803f76577a8
> [1935109.896413] Call Trace:
> [1935109.899102]  [] dump_stack+0x4e/0x84
> [1935109.899821]  [] bad_page+0x17b/0x210
> [1935109.900469]  [] free_pages_prepare+0xb48/0x1110
> [1935109.902127]  [] __free_pages_ok+0x21/0x260
> [1935109.904435]  [] free_compound_page+0x63/0x80
> [1935109.905614]  [] free_transhuge_page+0x6e/0x80
> [1935109.906752]  [] __put_compound_page+0x76/0xa0
> [1935109.907884]  [] release_pages+0x4d5/0x9f0
> [1935109.913027]  [] tlb_flush_mmu_free+0x8a/0x120
> [1935109.913957]  [] unmap_page_range+0xe73/0x1460
> [1935109.915737]  [] unmap_single_vma+0x126/0x2f0
> [1935109.916646]  [] unmap_vmas+0xdd/0x190
> [1935109.917454]  [] exit_mmap+0x221/0x430
> [1935109.921176]  [] mmput+0xb1/0x240
> [1935109.921919]  [] do_exit+0x732/0x27c0
> [1935109.928561]  [] do_group_exit+0xf9/0x300
> [1935109.929786]  [] SyS_exit_group+0x1d/0x20
> [1935109.930617]  [] entry_SYSCALL_64_fastpath+0x16/0x7a

Would it make any difference if you'll add mmap_sem protection in
exit_mmap?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-25 Thread Oleg Nesterov

On 09/24, Andrey Ryabinin wrote:
>
> 2015-09-24 20:26 GMT+03:00 Oleg Nesterov :
> > On 09/24, Sasha Levin wrote:
> >>
> >> void unmap_vmas(struct mmu_gather *tlb,
> >> struct vm_area_struct *vma, unsigned long start_addr,
> >> unsigned long end_addr)
> >> {
> >> struct mm_struct *mm = vma->vm_mm;
> >>
> >> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> >> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> >> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); 
> >> <--- this
> >> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> >> }
> >
> > And I do not see any dereference at this line,
> >
>
> I noticed, that addr2line sometimes doesn't work reliably on
> compiler-instrumented code.
> I've seen couple times that it points to the next line of code.

Yes, I know that we can't trust it. That is why I think (at least in
this particular case) function+offset would be more helpful. And we
need more asm probably.

> >> >>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
> >> >>6:   85 22   test   %esp,(%rdx)
> >> >>8:   01 00   add%eax,(%rax)
> >> >>a:   00 48 8badd%cl,-0x75(%rax)
> >> >>d:   43  rex.XB
> >> >>e:   40  rex
> >> >>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
> >> >>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
> >> >>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
> >> >>   21:   fc ff df
> >> >>   24:   48 89 famov%rdi,%rdx
> >> >>   27:   48 c1 ea 03 shr$0x3,%rdx
> >> >>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   
> >> >> <-- trapping instruction
> >> >>   2f:   0f 85 ee 00 00 00   jne0x123
> >> >>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
> >> >>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
> >> >>   40:   00
> >> >
> >> > And I do not see anything similar in "objdump -d". So could you at least
> >> > show mm/memory.c:1337 in your tree?
> >> >
> >> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
> >> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> >> > (I can't check, my gcc is very old). Or what?
> >>
> >> This is indeed kasan code. 0xdc00 is the shadow base, and you 
> >> see
> >> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.
> >
> > and thus this asm can't help, right?
> >
>
> I think it can.
>
> > So how can we figure out where exactly the kernel hits NULL ? And what
> > exactly it tries to dereference?
>
> So we tried to dereference 0x4c8.  That 0x4c8 is probably offset in some 
> struct.
> The only big struct here is mm_struct.
> So I think that we tried to derefernce null mm, and this asm:
>  > cmpq   $0x0,0x4c8(%rax)
>
> is likely from inlined mm_has_notifiers():
> static inline int mm_has_notifiers(struct mm_struct *mm)
> {
>  return unlikely(mm->mmu_notifier_mm);
> }

Looks reasonable... Thanks.

I was going to say that this is impossible because the caller should have
crashed if ->mm == NULL. But unmap_vmas() uses mm = vma->vm_mm, so it looks
like this vma or mm->mmap was corrupted...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-25 Thread Oleg Nesterov

On 09/23, Sasha Levin wrote:
>
> Another similar trace where we see a problem during process exit:
>
> [1922964.887922] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1922964.890234] Modules linked in:
> [1922964.890844] CPU: 1 PID: 21477 Comm: trinity-c161 Tainted: GW 
>   4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
> [1922964.892584] task: 880251858000 ti: 88009f258000 task.ti: 
> 88009f258000
> [1922964.893723] RIP: acct_collect (kernel/acct.c:542)

   530  void acct_collect(long exitcode, int group_dead)
   531  {
   532  struct pacct_struct *pacct = >signal->pacct;
   533  cputime_t utime, stime;
   534  unsigned long vsize = 0;
   535
   536  if (group_dead && current->mm) {
   537  struct vm_area_struct *vma;
   538
   539  down_read(>mm->mmap_sem);
   540  vma = current->mm->mmap;
   541  while (vma) {
   542  vsize += vma->vm_end - vma->vm_start; // 

   543  vma = vma->vm_next;
   544  }
   545  up_read(>mm->mmap_sem);
   546  }


> [1922964.895105] RSP: :88009f25f908  EFLAGS: 00010207
> [1922964.895935] RAX: dc00 RBX: 2ce0 RCX: 
> 
> [1922964.897008] RDX: 2152b153 RSI: 059c2000 RDI: 
> 2ce10007
> [1922964.898091] RBP: 88009f25f9e8 R08: 0001 R09: 
> 03ef
> [1922964.899178] R10: ed014d7a3a01 R11: 0001 R12: 
> 880082b485c0
> [1922964.901643] R13: 2152b153 R14: 110013e4bf24 R15: 
> 88009f25f9c0
...
>0:   02 00   add(%rax),%al
>2:   0f 85 9d 05 00 00   jne0x5a5
>8:   48 8b 1bmov(%rbx),%rbx
>b:   48 85 dbtest   %rbx,%rbx
>e:   0f 84 7b 05 00 00   je 0x58f

Probably "mov (%rbx),%rbx" is "vma = mm->mmap",

>   14:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>   1b:   fc ff df
>   1e:   31 d2   xor%edx,%edx
>   20:   48 8d 7b 08 lea0x8(%rbx),%rdi

and this loads the addr of vma->vm_end for kasan,

>   24:   48 89 femov%rdi,%rsi
>   27:   48 c1 ee 03 shr$0x3,%rsi
>   2b:*  80 3c 06 00 cmpb   $0x0,(%rsi,%rax,1)   <-- 
> trapping instruction

which reporst the error. But in this case this is not NULL-deref,
note that $rbx = 2ce0 and this is below __PAGE_OFFSET
(but above TASK_SIZE_MAX). It seems it is not even canonical. In
any case this odd value can't be valid.

Again, looks like mm->mmap pointer was corrupted. Perhaps you can
re-test with the stupid patch below. But unlikely it will help. If
mm was freed we would probably see something else.

Oleg.
---

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -672,6 +672,7 @@ struct mm_struct *mm_alloc(void)
 void __mmdrop(struct mm_struct *mm)
 {
BUG_ON(mm == _mm);
+   BUG_ON(atomic_read(>mm_users));
mm_free_pgd(mm);
destroy_context(mm);
mmu_notifier_mm_destroy(mm);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Sasha Levin

On 09/24/2015 02:52 PM, Andrey Ryabinin wrote:
> Sasha, could you confirm that in your kernel mmu_notifier_mm field has
> 0x4c8 offset?
> I would use gdb for that:
> gdb vmlinux
> (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)

(gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)
$1 = 0x4c8


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Andrey Ryabinin

2015-09-24 20:26 GMT+03:00 Oleg Nesterov :
> On 09/24, Sasha Levin wrote:
>>
>> On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
>> >
>> > Well, I know absolutely nothing about kasan, to the point I can't even
>> > unserstand where does this message come from. grep didn't help. But this
>> > doesn't matter...
>>
>> The reason behind this message is that NULL ptr derefs when using kasan are
>> manifested as GFPs. This is because in order to validate an access to a given
>> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in 
>> the case of
>> a NULL it would try to access shadow_base + 0, which would GFP.
>
> OK, so this just means the kernele derefs the NULL pointer,
>
>> I'm running -next + Kirill's THP patchset.
>>
>> > struct mm_struct *mm = vma->vm_mm;
>>
>> void unmap_vmas(struct mmu_gather *tlb,
>> struct vm_area_struct *vma, unsigned long start_addr,
>> unsigned long end_addr)
>> {
>> struct mm_struct *mm = vma->vm_mm;
>>
>> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
>> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- 
>> this
>> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
>> }
>
> And I do not see any dereference at this line,
>

I noticed, that addr2line sometimes doesn't work reliably on
compiler-instrumented code.
I've seen couple times that it points to the next line of code.


>> >>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>> >>6:   85 22   test   %esp,(%rdx)
>> >>8:   01 00   add%eax,(%rax)
>> >>a:   00 48 8badd%cl,-0x75(%rax)
>> >>d:   43  rex.XB
>> >>e:   40  rex
>> >>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
>> >>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
>> >>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>> >>   21:   fc ff df
>> >>   24:   48 89 famov%rdi,%rdx
>> >>   27:   48 c1 ea 03 shr$0x3,%rdx
>> >>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   
>> >> <-- trapping instruction
>> >>   2f:   0f 85 ee 00 00 00   jne0x123
>> >>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
>> >>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
>> >>   40:   00
>> >
>> > And I do not see anything similar in "objdump -d". So could you at least
>> > show mm/memory.c:1337 in your tree?
>> >
>> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
>> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
>> > (I can't check, my gcc is very old). Or what?
>>
>> This is indeed kasan code. 0xdc00 is the shadow base, and you see
>> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.
>
> and thus this asm can't help, right?
>

I think it can.

> So how can we figure out where exactly the kernel hits NULL ? And what
> exactly it tries to dereference?

So we tried to dereference 0x4c8.  That 0x4c8 is probably offset in some struct.
The only big struct here is mm_struct.
So I think that we tried to derefernce null mm, and this asm:
 > cmpq   $0x0,0x4c8(%rax)

is likely from inlined mm_has_notifiers():
static inline int mm_has_notifiers(struct mm_struct *mm)
{
 return unlikely(mm->mmu_notifier_mm);
}


Sasha, could you confirm that in your kernel mmu_notifier_mm field has
0x4c8 offset?
I would use gdb for that:
gdb vmlinux
(gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Oleg Nesterov

On 09/24, Sasha Levin wrote:
>
> On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
> >
> > Well, I know absolutely nothing about kasan, to the point I can't even
> > unserstand where does this message come from. grep didn't help. But this
> > doesn't matter...
>
> The reason behind this message is that NULL ptr derefs when using kasan are
> manifested as GFPs. This is because in order to validate an access to a given
> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the 
> case of
> a NULL it would try to access shadow_base + 0, which would GFP.

OK, so this just means the kernele derefs the NULL pointer,

> I'm running -next + Kirill's THP patchset.
>
> > struct mm_struct *mm = vma->vm_mm;
>
> void unmap_vmas(struct mmu_gather *tlb,
> struct vm_area_struct *vma, unsigned long start_addr,
> unsigned long end_addr)
> {
> struct mm_struct *mm = vma->vm_mm;
>
> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- 
> this
> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> }

And I do not see any dereference at this line,

> >>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
> >>6:   85 22   test   %esp,(%rdx)
> >>8:   01 00   add%eax,(%rax)
> >>a:   00 48 8badd%cl,-0x75(%rax)
> >>d:   43  rex.XB
> >>e:   40  rex
> >>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
> >>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
> >>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
> >>   21:   fc ff df
> >>   24:   48 89 famov%rdi,%rdx
> >>   27:   48 c1 ea 03 shr$0x3,%rdx
> >>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   
> >> <-- trapping instruction
> >>   2f:   0f 85 ee 00 00 00   jne0x123
> >>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
> >>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
> >>   40:   00
> >
> > And I do not see anything similar in "objdump -d". So could you at least
> > show mm/memory.c:1337 in your tree?
> >
> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> > (I can't check, my gcc is very old). Or what?
>
> This is indeed kasan code. 0xdc00 is the shadow base, and you see
> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.

and thus this asm can't help, right?

So how can we figure out where exactly the kernel hits NULL ? And what
exactly it tries to dereference?

> I hope the information above helped, please let me know if it didn't and you
> need anything else.

Thanks a lot, it does help, but I am still confused.


Looks like, "function+offset" is more useful than the line numbers,
at least we could look at mm/memory.s.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Sasha Levin

On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
> On 09/15, Sasha Levin wrote:
>>
>> I've modified my tests to stress the exit path of processes with many vmas,
>> and hit the following NULL ptr deref (not sure if it's related to the 
>> original issue):
> 
> I am shy to ask. Looks like I am the only stupid one who needs
> more info...
> 
>> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
>> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> 
> Well, I know absolutely nothing about kasan, to the point I can't even
> unserstand where does this message come from. grep didn't help. But this
> doesn't matter...

The reason behind this message is that NULL ptr derefs when using kasan are
manifested as GFPs. This is because in order to validate an access to a given
memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the 
case of
a NULL it would try to access shadow_base + 0, which would GFP.

>> [1181047.937223] Modules linked in:
>> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
>> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
>> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
>> 880433f0
>> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
> 
> I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 
> is

I'm running -next + Kirill's THP patchset.

>   struct mm_struct *mm = vma->vm_mm;

void unmap_vmas(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;

mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- 
this
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
}

> but this doesn't match the asm below,
> 
>>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>>6:   85 22   test   %esp,(%rdx)
>>8:   01 00   add%eax,(%rax)
>>a:   00 48 8badd%cl,-0x75(%rax)
>>d:   43  rex.XB
>>e:   40  rex
>>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
>>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
>>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>>   21:   fc ff df
>>   24:   48 89 famov%rdi,%rdx
>>   27:   48 c1 ea 03 shr$0x3,%rdx
>>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   <-- 
>> trapping instruction
>>   2f:   0f 85 ee 00 00 00   jne0x123
>>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
>>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
>>   40:   00
> 
> And I do not see anything similar in "objdump -d". So could you at least
> show mm/memory.c:1337 in your tree?
> 
> Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
> like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> (I can't check, my gcc is very old). Or what?

This is indeed kasan code. 0xdc00 is the shadow base, and you see
kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.

Looking at the assembly, the address we were trying to access was:

 RDI: 04c8

> Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas?

I hope the information above helped, please let me know if it didn't and you
need anything else.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Oleg Nesterov

On 09/15, Sasha Levin wrote:
>
> I've modified my tests to stress the exit path of processes with many vmas,
> and hit the following NULL ptr deref (not sure if it's related to the 
> original issue):

I am shy to ask. Looks like I am the only stupid one who needs
more info...

> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN

Well, I know absolutely nothing about kasan, to the point I can't even
unserstand where does this message come from. grep didn't help. But this
doesn't matter...

> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
> 880433f0
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)

I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 is

struct mm_struct *mm = vma->vm_mm;

but this doesn't match the asm below,

>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>6:   85 22   test   %esp,(%rdx)
>8:   01 00   add%eax,(%rax)
>a:   00 48 8badd%cl,-0x75(%rax)
>d:   43  rex.XB
>e:   40  rex
>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>   21:   fc ff df
>   24:   48 89 famov%rdi,%rdx
>   27:   48 c1 ea 03 shr$0x3,%rdx
>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   <-- 
> trapping instruction
>   2f:   0f 85 ee 00 00 00   jne0x123
>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
>   40:   00

And I do not see anything similar in "objdump -d". So could you at least
show mm/memory.c:1337 in your tree?

Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
(I can't check, my gcc is very old). Or what?

Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Oleg Nesterov

On 09/15, Sasha Levin wrote:
>
> I've modified my tests to stress the exit path of processes with many vmas,
> and hit the following NULL ptr deref (not sure if it's related to the 
> original issue):

I am shy to ask. Looks like I am the only stupid one who needs
more info...

> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN

Well, I know absolutely nothing about kasan, to the point I can't even
unserstand where does this message come from. grep didn't help. But this
doesn't matter...

> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
> 880433f0
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)

I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 is

struct mm_struct *mm = vma->vm_mm;

but this doesn't match the asm below,

>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>6:   85 22   test   %esp,(%rdx)
>8:   01 00   add%eax,(%rax)
>a:   00 48 8badd%cl,-0x75(%rax)
>d:   43  rex.XB
>e:   40  rex
>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>   21:   fc ff df
>   24:   48 89 famov%rdi,%rdx
>   27:   48 c1 ea 03 shr$0x3,%rdx
>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   <-- 
> trapping instruction
>   2f:   0f 85 ee 00 00 00   jne0x123
>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
>   40:   00

And I do not see anything similar in "objdump -d". So could you at least
show mm/memory.c:1337 in your tree?

Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
(I can't check, my gcc is very old). Or what?

Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Sasha Levin

On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
> On 09/15, Sasha Levin wrote:
>>
>> I've modified my tests to stress the exit path of processes with many vmas,
>> and hit the following NULL ptr deref (not sure if it's related to the 
>> original issue):
> 
> I am shy to ask. Looks like I am the only stupid one who needs
> more info...
> 
>> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
>> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> 
> Well, I know absolutely nothing about kasan, to the point I can't even
> unserstand where does this message come from. grep didn't help. But this
> doesn't matter...

The reason behind this message is that NULL ptr derefs when using kasan are
manifested as GFPs. This is because in order to validate an access to a given
memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the 
case of
a NULL it would try to access shadow_base + 0, which would GFP.

>> [1181047.937223] Modules linked in:
>> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
>> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
>> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
>> 880433f0
>> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
> 
> I do not know which tree/branch do you use. In Linus's tree mm/memory.c:1337 
> is

I'm running -next + Kirill's THP patchset.

>   struct mm_struct *mm = vma->vm_mm;

void unmap_vmas(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;

mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- 
this
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
}

> but this doesn't match the asm below,
> 
>>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>>6:   85 22   test   %esp,(%rdx)
>>8:   01 00   add%eax,(%rax)
>>a:   00 48 8badd%cl,-0x75(%rax)
>>d:   43  rex.XB
>>e:   40  rex
>>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
>>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
>>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>>   21:   fc ff df
>>   24:   48 89 famov%rdi,%rdx
>>   27:   48 c1 ea 03 shr$0x3,%rdx
>>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   <-- 
>> trapping instruction
>>   2f:   0f 85 ee 00 00 00   jne0x123
>>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
>>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
>>   40:   00
> 
> And I do not see anything similar in "objdump -d". So could you at least
> show mm/memory.c:1337 in your tree?
> 
> Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
> like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> (I can't check, my gcc is very old). Or what?

This is indeed kasan code. 0xdc00 is the shadow base, and you see
kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.

Looking at the assembly, the address we were trying to access was:

 RDI: 04c8

> Any chance you can tell us where exactly we hit NULL-deref in unmap_vmas?

I hope the information above helped, please let me know if it didn't and you
need anything else.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Oleg Nesterov

On 09/24, Sasha Levin wrote:
>
> On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
> >
> > Well, I know absolutely nothing about kasan, to the point I can't even
> > unserstand where does this message come from. grep didn't help. But this
> > doesn't matter...
>
> The reason behind this message is that NULL ptr derefs when using kasan are
> manifested as GFPs. This is because in order to validate an access to a given
> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in the 
> case of
> a NULL it would try to access shadow_base + 0, which would GFP.

OK, so this just means the kernele derefs the NULL pointer,

> I'm running -next + Kirill's THP patchset.
>
> > struct mm_struct *mm = vma->vm_mm;
>
> void unmap_vmas(struct mmu_gather *tlb,
> struct vm_area_struct *vma, unsigned long start_addr,
> unsigned long end_addr)
> {
> struct mm_struct *mm = vma->vm_mm;
>
> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- 
> this
> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> }

And I do not see any dereference at this line,

> >>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
> >>6:   85 22   test   %esp,(%rdx)
> >>8:   01 00   add%eax,(%rax)
> >>a:   00 48 8badd%cl,-0x75(%rax)
> >>d:   43  rex.XB
> >>e:   40  rex
> >>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
> >>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
> >>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
> >>   21:   fc ff df
> >>   24:   48 89 famov%rdi,%rdx
> >>   27:   48 c1 ea 03 shr$0x3,%rdx
> >>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   
> >> <-- trapping instruction
> >>   2f:   0f 85 ee 00 00 00   jne0x123
> >>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
> >>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
> >>   40:   00
> >
> > And I do not see anything similar in "objdump -d". So could you at least
> > show mm/memory.c:1337 in your tree?
> >
> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
> > (I can't check, my gcc is very old). Or what?
>
> This is indeed kasan code. 0xdc00 is the shadow base, and you see
> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.

and thus this asm can't help, right?

So how can we figure out where exactly the kernel hits NULL ? And what
exactly it tries to dereference?

> I hope the information above helped, please let me know if it didn't and you
> need anything else.

Thanks a lot, it does help, but I am still confused.


Looks like, "function+offset" is more useful than the line numbers,
at least we could look at mm/memory.s.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Andrey Ryabinin

2015-09-24 20:26 GMT+03:00 Oleg Nesterov :
> On 09/24, Sasha Levin wrote:
>>
>> On 09/24/2015 09:11 AM, Oleg Nesterov wrote:
>> >
>> > Well, I know absolutely nothing about kasan, to the point I can't even
>> > unserstand where does this message come from. grep didn't help. But this
>> > doesn't matter...
>>
>> The reason behind this message is that NULL ptr derefs when using kasan are
>> manifested as GFPs. This is because in order to validate an access to a given
>> memory address kasan would check (shadow_base + (mem_offset >> 3)), so in 
>> the case of
>> a NULL it would try to access shadow_base + 0, which would GFP.
>
> OK, so this just means the kernele derefs the NULL pointer,
>
>> I'm running -next + Kirill's THP patchset.
>>
>> > struct mm_struct *mm = vma->vm_mm;
>>
>> void unmap_vmas(struct mmu_gather *tlb,
>> struct vm_area_struct *vma, unsigned long start_addr,
>> unsigned long end_addr)
>> {
>> struct mm_struct *mm = vma->vm_mm;
>>
>> mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
>> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>> unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); <--- 
>> this
>> mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
>> }
>
> And I do not see any dereference at this line,
>

I noticed, that addr2line sometimes doesn't work reliably on
compiler-instrumented code.
I've seen couple times that it points to the next line of code.


>> >>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>> >>6:   85 22   test   %esp,(%rdx)
>> >>8:   01 00   add%eax,(%rax)
>> >>a:   00 48 8badd%cl,-0x75(%rax)
>> >>d:   43  rex.XB
>> >>e:   40  rex
>> >>f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi
>> >>   16:   48 89 45 d0 mov%rax,-0x30(%rbp)
>> >>   1a:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
>> >>   21:   fc ff df
>> >>   24:   48 89 famov%rdi,%rdx
>> >>   27:   48 c1 ea 03 shr$0x3,%rdx
>> >>   2b:*  80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)   
>> >> <-- trapping instruction
>> >>   2f:   0f 85 ee 00 00 00   jne0x123
>> >>   35:   48 8b 45 d0 mov-0x30(%rbp),%rax
>> >>   39:   48 83 b8 c8 04 00 00cmpq   $0x0,0x4c8(%rax)
>> >>   40:   00
>> >
>> > And I do not see anything similar in "objdump -d". So could you at least
>> > show mm/memory.c:1337 in your tree?
>> >
>> > Hmm. movabs $0xdc00,%rax above looks suspicious, this looks
>> > like kasan_mem_to_shadow(). So perhaps this code was generated by kasan?
>> > (I can't check, my gcc is very old). Or what?
>>
>> This is indeed kasan code. 0xdc00 is the shadow base, and you see
>> kasan trying to access shadow base + (ptr >> 3), which is why we get GFP.
>
> and thus this asm can't help, right?
>

I think it can.

> So how can we figure out where exactly the kernel hits NULL ? And what
> exactly it tries to dereference?

So we tried to dereference 0x4c8.  That 0x4c8 is probably offset in some struct.
The only big struct here is mm_struct.
So I think that we tried to derefernce null mm, and this asm:
 > cmpq   $0x0,0x4c8(%rax)

is likely from inlined mm_has_notifiers():
static inline int mm_has_notifiers(struct mm_struct *mm)
{
 return unlikely(mm->mmu_notifier_mm);
}


Sasha, could you confirm that in your kernel mmu_notifier_mm field has
0x4c8 offset?
I would use gdb for that:
gdb vmlinux
(gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-24 Thread Sasha Levin

On 09/24/2015 02:52 PM, Andrey Ryabinin wrote:
> Sasha, could you confirm that in your kernel mmu_notifier_mm field has
> 0x4c8 offset?
> I would use gdb for that:
> gdb vmlinux
> (gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)

(gdb) p/x &(((struct mm_struct*)0)->mmu_notifier_mm)
$1 = 0x4c8


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Sasha Levin

On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
>> > This is totally untested, and one of you may quickly prove me wrong;
>> > but I went in to fix your "Bad page state (mlocked)" by holding pte
>> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
>> > then couldn't see why it would need mmap_sem at all, given how mlock
>> > and munlock first assert intention by setting or clearing VM_LOCKED
>> > in vm_flags, then work their way up the vma, taking pte locks.
>> >
>> > Calling mlock_vma_page() under pte lock may look suspicious
>> > at first: but what it does is similar to clear_page_mlock(),
>> > which we regularly call under pte lock from page_remove_rmap().
>> >
>> > I'd rather wait to hear whether this appears to work in practice,
>> > and whether you agree that it should work in theory, before writing
>> > the proper description.  I'd love to lose that down_read_trylock.
> No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> 
> It seems that your patch doesn't fix the race from the report below, since pte
> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> is being executed. (Line numbers are from kernel with your patch applied.)

I've fired up my HZ_1 patch, and this seems to be a real race that is
somewhat easy to reproduce under those conditions.

Here's a fresh backtrace from my VMs:

[1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
[1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping:  
(null) index:0x1e00 compound_mapcount: 0
[1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked)
[1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[1935109.888197] bad because of flags:
[1935109.888759] flags: 0x10(mlocked)
[1935109.889525] Modules linked in:
[1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 
4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
[1935109.891876]  16445448 e5dca494 8803f7657708 
a70402da
[1935109.893504]  ea000f288000 8803f7657738 a56e522b 
022f80144008
[1935109.894947]  ea000f288020 ea000f288000  
8803f76577a8
[1935109.896413] Call Trace:
[1935109.899102]  [] dump_stack+0x4e/0x84
[1935109.899821]  [] bad_page+0x17b/0x210
[1935109.900469]  [] free_pages_prepare+0xb48/0x1110
[1935109.902127]  [] __free_pages_ok+0x21/0x260
[1935109.904435]  [] free_compound_page+0x63/0x80
[1935109.905614]  [] free_transhuge_page+0x6e/0x80
[1935109.906752]  [] __put_compound_page+0x76/0xa0
[1935109.907884]  [] release_pages+0x4d5/0x9f0
[1935109.913027]  [] tlb_flush_mmu_free+0x8a/0x120
[1935109.913957]  [] unmap_page_range+0xe73/0x1460
[1935109.915737]  [] unmap_single_vma+0x126/0x2f0
[1935109.916646]  [] unmap_vmas+0xdd/0x190
[1935109.917454]  [] exit_mmap+0x221/0x430
[1935109.921176]  [] mmput+0xb1/0x240
[1935109.921919]  [] do_exit+0x732/0x27c0
[1935109.928561]  [] do_group_exit+0xf9/0x300
[1935109.929786]  [] SyS_exit_group+0x1d/0x20
[1935109.930617]  [] entry_SYSCALL_64_fastpath+0x16/0x7a


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Davidlohr Bueso


On Wed, 23 Sep 2015, Kirill A. Shutemov wrote:


On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:

[...]
I'd rather wait to hear whether this appears to work in practice,
and whether you agree that it should work in theory, before writing
the proper description.  I'd love to lose that down_read_trylock.

You mention how Sasha hit the "Bad page state (mlocked)" back in
November: that was one of the reasons we reverted Davidlohr's
i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
without understanding why it was needed.  Yes, it would lock out
a concurrent try_to_unmap(), whose setting of PageMlocked was not
sufficiently serialized by the down_read_trylock of mmap_sem.

But I don't remember the other reasons for that revert (and
haven't looked very hard as yet): anyone else remember?


Yeah, I don't think this was ever resolved, but ultimately the patch
got reverted[1] because it exposed issues in the form of bad pages
(shmem, vmsplice) and corrupted vm_flags while in untrack_pfn() causing,
for example, vm_file to dissapear.


I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...


Actually no, the change that got reverted was something we got in very
last minute, just because it made sense and had the blessing of some
key people. The main winner of the series was migration (rmap), which
later Hugh addressed more specifically for unmapped pages:

https://lkml.org/lkml/2014/11/30/349

So I really didn't care about the reverted patch, and therefore was never
on my radar.

[1] https://lkml.org/lkml/2014/12/22/375

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Sasha Levin

On 09/15/2015 01:36 PM, Sasha Levin wrote:
> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
>> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
>> > sounds like an overkill, unless we find something more serious than this.
> I've modified my tests to stress the exit path of processes with many vmas,
> and hit the following NULL ptr deref (not sure if it's related to the 
> original issue):
> 
> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
> 880433f0
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
> [1181047.941842] RSP: :880433f078a8  EFLAGS: 00010206
> [1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: 
> 
> [1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: 
> 04c8
> [1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: 
> 15c59e01
> [1181047.944604] R10:  R11: 0001 R12: 
> 
> [1181047.944833] pps pps0: PPS event at 21837.866101174
> [1181047.944838] pps pps0: capture assert seq #7188
> [1181047.946261] R13:  R14: 880433f07910 R15: 
> 2e0d
> [1181047.947005] FS:  () GS:88025200() 
> knlGS:
> [1181047.947779] CS:  0010 DS:  ES:  CR0: 8005003b
> [1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: 
> 06a0
> [1181047.949085] Stack:
> [1181047.949350]   880433f07910 1100867e0f1e 
> dc00
> [1181047.950164]  8801d825d000 2e0d 880433f079d0 
> 9276c4ab
> [1181047.951070]  88041acd000a 41b58ab3 9ecd1a43 
> 9276c2a0
> [1181047.951906] Call Trace:
> [1181047.952201] exit_mmap (mm/mmap.c:2856)
> [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826)
> [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
> include/linux/sched.h:2557 mm/huge_memory.c:2169)
> [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109)
> [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 
> mm/slub.c:2746)
> [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
> include/linux/sched.h:2557 mm/huge_memory.c:2169)
> [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 
> kernel/fork.c:702)
> [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 
> include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
> [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298)
> [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654)
> [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249)
> [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298)
> [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641)
> [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 
> kernel/exit.c:859)
> [1181047.965569] get_signal (kernel/signal.c:2353)
> [1181047.966430] do_signal (arch/x86/kernel/signal.c:709)
> [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 
> fs/read_write.c:821)
> [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407)
> [1181047.969126] ? vfs_write (fs/read_write.c:777)
> [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
> [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 
> include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
> [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 
> kernel/time/itimer.c:227)
> [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18)
> [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
> [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318)
> [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
> [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 
> c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 
> <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00
> All code
> 
>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>6:   85 22   test   %esp,(%rdx)
>8:   01 00

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Oleg Nesterov

On 09/11, Kirill A. Shutemov wrote:
>
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?

Well, all I can say is that proc/ptrace look fine afaics...

This is off-topic, but how about the patch below? Different threads can
expand different vma's at the same time under read_lock(mmap_sem), so
vma_lock_anon_vma() can't help to serialize "locked_vm += grow".

Oleg.

--- x/mm/mmap.c
+++ x/mm/mmap.c
@@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a
if (security_vm_enough_memory_mm(mm, grow))
return -ENOMEM;
 
-   /* Ok, everything looks good - let it rip */
-   if (vma->vm_flags & VM_LOCKED)
-   mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
return 0;
 }
@@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct
 * against concurrent vma expansions.
 */
spin_lock(>vm_mm->page_table_lock);
+   if (vma->vm_flags & VM_LOCKED)
+   mm->locked_vm += grow;
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
anon_vma_interval_tree_post_update_vma(vma);
@@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru
 * against concurrent vma expansions.
 */
spin_lock(>vm_mm->page_table_lock);
+   if (vma->vm_flags & VM_LOCKED)
+   mm->locked_vm += grow;
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_start = address;
vma->vm_pgoff -= grow;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Oleg Nesterov

On 09/23, Oleg Nesterov wrote:
>
> On 09/11, Kirill A. Shutemov wrote:
> >
> > This one is tricky. I *assume* the mm cannot be generally accessible after
> > mm_users drops to zero, but I'm not entirely sure about it.
> > procfs? ptrace?
>
> Well, all I can say is that proc/ptrace look fine afaics...
>
> This is off-topic, but how about the patch below? Different threads can
> expand different vma's at the same time under read_lock(mmap_sem), so
> vma_lock_anon_vma() can't help to serialize "locked_vm += grow".

perhaps vm_stat_account() should be moved too, but total_vm/etc is less
important.

Or I missed something?

Oleg.

> --- x/mm/mmap.c
> +++ x/mm/mmap.c
> @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a
>   if (security_vm_enough_memory_mm(mm, grow))
>   return -ENOMEM;
>  
> - /* Ok, everything looks good - let it rip */
> - if (vma->vm_flags & VM_LOCKED)
> - mm->locked_vm += grow;
>   vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
>   return 0;
>  }
> @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct
>* against concurrent vma expansions.
>*/
>   spin_lock(>vm_mm->page_table_lock);
> + if (vma->vm_flags & VM_LOCKED)
> + mm->locked_vm += grow;
>   anon_vma_interval_tree_pre_update_vma(vma);
>   vma->vm_end = address;
>   anon_vma_interval_tree_post_update_vma(vma);
> @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru
>* against concurrent vma expansions.
>*/
>   spin_lock(>vm_mm->page_table_lock);
> + if (vma->vm_flags & VM_LOCKED)
> + mm->locked_vm += grow;
>   anon_vma_interval_tree_pre_update_vma(vma);
>   vma->vm_start = address;
>   vma->vm_pgoff -= grow;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Andrey Konovalov

On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
>
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().
>
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.

No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".

It seems that your patch doesn't fix the race from the report below, since pte
lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
is being executed. (Line numbers are from kernel with your patch applied.)

===
ThreadSanitizer: data-race in munlock_vma_pages_range

Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2:
 [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [< inline >] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inline >] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1400 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [] get_signal+0x375/0xa70 kernel/signal.c:2353
 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329

Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1:
 [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
 [< inline >] rmap_walk_file mm/rmap.c:1522
 [] rmap_walk+0x147/0x450 mm/rmap.c:1541
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [< inline >] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inline >] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1400 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [] get_signal+0x375/0xa70 kernel/signal.c:2353
 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
===
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Kirill A. Shutemov

On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins  wrote:
> > > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > >> If anybody comes up with a patch to fix the original issue I easily
> > >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> > >> fuzzing with KTSAN and Trinity.
> > >
> > > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > > your report of it: please repost - thanks - though on seeing it, I may
> > > well end up with no ideas.
> > 
> > The report is below.
> 
> Thanks.
> 
> > 
> > I get it after a few seconds of running Trinity on a kernel with KTSAN
> > and targeting mlock, munlock and madvise syscalls.
> > Sasha also observed a very similar crash a while ago
> > (https://lkml.org/lkml/2014/11/6/1055).
> > I didn't manage to reproduce this in a kernel build without KTSAN though.
> > The idea was that data races KTSAN reports might be the explanation of
> > these crashes.
> > 
> > BUG: Bad page state in process trinity-c15  pfn:281999
> > page:ea000a066640 count:0 mapcount:0 mapping:  (null) index:0xd
> > flags: 0x228000c(referenced|uptodate|swapbacked|mlocked)
> > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > bad because of flags:
> > flags: 0x20(mlocked)
> > Modules linked in:
> > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >  821c3b70  00014741 8800b857f948
> >  81e9926c 0003 ea000a066640 8800b857f978
> >  811ce045 821c3b70 ea000a066640 0001
> > Call Trace:
> >  [< inline >] __dump_stack lib/dump_stack.c:15
> >  [] dump_stack+0x63/0x81 lib/dump_stack.c:50
> >  [] bad_page+0x115/0x1a0 mm/page_alloc.c:409
> >  [< inline >] free_pages_check mm/page_alloc.c:731
> >  [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
> >  [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
> >  [] free_hot_cold_page_list+0x5f/0x100
> > mm/page_alloc.c:1956 (discriminator 3)
> >  [] release_pages+0x151/0x300 mm/swap.c:967
> >  [] __pagevec_release+0x43/0x60 mm/swap.c:984
> >  [< inline >] pagevec_release include/linux/pagevec.h:69
> >  [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
> >  [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
> >  [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
> >  [] vfs_fallocate+0x1e0/0x310 fs/open.c:303
> >  [< inline >] madvise_remove mm/madvise.c:326
> >  [< inline >] madvise_vma mm/madvise.c:378
> >  [< inline >] SYSC_madvise mm/madvise.c:528
> >  [] SyS_madvise+0x378/0x760 mm/madvise.c:459
> >  [] ? kt_atomic64_store+0x76/0x130 
> > mm/ktsan/sync_atomic.c:161
> >  [] entry_SYSCALL_64_fastpath+0x31/0x95
> > arch/x86/entry/entry_64.S:188
> > Disabling lock debugging due to kernel taint
> 
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
> 
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().

Indeed. Looks fishy. But probably will work.

That was not obvious for me what makes clearing VM_LOCKED visible in
try_to_unmap_one() without mmap_sem.

After looking some more it seems we would rely on page_check_address()
in try_to_unmap_one() having acquire semantics and follow_page_mask() in
munlock_vma_pages_range() having release semantics.

But I would prefer to have something more explicit.

That's mess.

> 
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.
> 
> You mention how Sasha hit the "Bad page state (mlocked)" back in
> November: that was one of the reasons we reverted Davidlohr's
> i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
> without understanding why it was needed.  Yes, it would lock out
> a concurrent try_to_unmap(), whose setting of PageMlocked was not
> sufficiently serialized by the down_read_trylock of mmap_sem.
> 
> But I don't remember the other reasons for that revert (and
> haven't looked very hard as yet): anyone else remember?

I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...

-- 
 Kirill A. Shutemov
--
To

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Kirill A. Shutemov

On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins  wrote:
> > > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> > >> If anybody comes up with a patch to fix the original issue I easily
> > >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> > >> fuzzing with KTSAN and Trinity.
> > >
> > > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > > your report of it: please repost - thanks - though on seeing it, I may
> > > well end up with no ideas.
> > 
> > The report is below.
> 
> Thanks.
> 
> > 
> > I get it after a few seconds of running Trinity on a kernel with KTSAN
> > and targeting mlock, munlock and madvise syscalls.
> > Sasha also observed a very similar crash a while ago
> > (https://lkml.org/lkml/2014/11/6/1055).
> > I didn't manage to reproduce this in a kernel build without KTSAN though.
> > The idea was that data races KTSAN reports might be the explanation of
> > these crashes.
> > 
> > BUG: Bad page state in process trinity-c15  pfn:281999
> > page:ea000a066640 count:0 mapcount:0 mapping:  (null) index:0xd
> > flags: 0x228000c(referenced|uptodate|swapbacked|mlocked)
> > page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > bad because of flags:
> > flags: 0x20(mlocked)
> > Modules linked in:
> > CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >  821c3b70  00014741 8800b857f948
> >  81e9926c 0003 ea000a066640 8800b857f978
> >  811ce045 821c3b70 ea000a066640 0001
> > Call Trace:
> >  [< inline >] __dump_stack lib/dump_stack.c:15
> >  [] dump_stack+0x63/0x81 lib/dump_stack.c:50
> >  [] bad_page+0x115/0x1a0 mm/page_alloc.c:409
> >  [< inline >] free_pages_check mm/page_alloc.c:731
> >  [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
> >  [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
> >  [] free_hot_cold_page_list+0x5f/0x100
> > mm/page_alloc.c:1956 (discriminator 3)
> >  [] release_pages+0x151/0x300 mm/swap.c:967
> >  [] __pagevec_release+0x43/0x60 mm/swap.c:984
> >  [< inline >] pagevec_release include/linux/pagevec.h:69
> >  [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
> >  [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
> >  [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
> >  [] vfs_fallocate+0x1e0/0x310 fs/open.c:303
> >  [< inline >] madvise_remove mm/madvise.c:326
> >  [< inline >] madvise_vma mm/madvise.c:378
> >  [< inline >] SYSC_madvise mm/madvise.c:528
> >  [] SyS_madvise+0x378/0x760 mm/madvise.c:459
> >  [] ? kt_atomic64_store+0x76/0x130 
> > mm/ktsan/sync_atomic.c:161
> >  [] entry_SYSCALL_64_fastpath+0x31/0x95
> > arch/x86/entry/entry_64.S:188
> > Disabling lock debugging due to kernel taint
> 
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
> 
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().

Indeed. Looks fishy. But probably will work.

That was not obvious for me what makes clearing VM_LOCKED visible in
try_to_unmap_one() without mmap_sem.

After looking some more it seems we would rely on page_check_address()
in try_to_unmap_one() having acquire semantics and follow_page_mask() in
munlock_vma_pages_range() having release semantics.

But I would prefer to have something more explicit.

That's mess.

> 
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.
> 
> You mention how Sasha hit the "Bad page state (mlocked)" back in
> November: that was one of the reasons we reverted Davidlohr's
> i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
> without understanding why it was needed.  Yes, it would lock out
> a concurrent try_to_unmap(), whose setting of PageMlocked was not
> sufficiently serialized by the down_read_trylock of mmap_sem.
> 
> But I don't remember the other reasons for that revert (and
> haven't looked very hard as yet): anyone else remember?

I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...

-- 
 Kirill A.

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Andrey Konovalov

On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
> This is totally untested, and one of you may quickly prove me wrong;
> but I went in to fix your "Bad page state (mlocked)" by holding pte
> lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
> then couldn't see why it would need mmap_sem at all, given how mlock
> and munlock first assert intention by setting or clearing VM_LOCKED
> in vm_flags, then work their way up the vma, taking pte locks.
>
> Calling mlock_vma_page() under pte lock may look suspicious
> at first: but what it does is similar to clear_page_mlock(),
> which we regularly call under pte lock from page_remove_rmap().
>
> I'd rather wait to hear whether this appears to work in practice,
> and whether you agree that it should work in theory, before writing
> the proper description.  I'd love to lose that down_read_trylock.

No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".

It seems that your patch doesn't fix the race from the report below, since pte
lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
is being executed. (Line numbers are from kernel with your patch applied.)

===
ThreadSanitizer: data-race in munlock_vma_pages_range

Write at 0x880282a93290 of size 8 by thread 2546 on CPU 2:
 [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [< inline >] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inline >] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1400 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [] get_signal+0x375/0xa70 kernel/signal.c:2353
 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329

Previous read at 0x880282a93290 of size 8 by thread 2545 on CPU 1:
 [] try_to_unmap_one+0x6a/0x450 mm/rmap.c:1208
 [< inline >] rmap_walk_file mm/rmap.c:1522
 [] rmap_walk+0x147/0x450 mm/rmap.c:1541
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1405
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [< inline >] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inline >] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1400 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [] get_signal+0x375/0xa70 kernel/signal.c:2353
 [] do_signal+0x2c/0xad0 arch/x86/kernel/signal.c:704
 [] do_notify_resume+0x7d/0x80 arch/x86/kernel/signal.c:749
 [] int_signal+0x12/0x17 arch/x86/entry/entry_64.S:329
===
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Oleg Nesterov

On 09/23, Oleg Nesterov wrote:
>
> On 09/11, Kirill A. Shutemov wrote:
> >
> > This one is tricky. I *assume* the mm cannot be generally accessible after
> > mm_users drops to zero, but I'm not entirely sure about it.
> > procfs? ptrace?
>
> Well, all I can say is that proc/ptrace look fine afaics...
>
> This is off-topic, but how about the patch below? Different threads can
> expand different vma's at the same time under read_lock(mmap_sem), so
> vma_lock_anon_vma() can't help to serialize "locked_vm += grow".

perhaps vm_stat_account() should be moved too, but total_vm/etc is less
important.

Or I missed something?

Oleg.

> --- x/mm/mmap.c
> +++ x/mm/mmap.c
> @@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a
>   if (security_vm_enough_memory_mm(mm, grow))
>   return -ENOMEM;
>  
> - /* Ok, everything looks good - let it rip */
> - if (vma->vm_flags & VM_LOCKED)
> - mm->locked_vm += grow;
>   vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
>   return 0;
>  }
> @@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct
>* against concurrent vma expansions.
>*/
>   spin_lock(>vm_mm->page_table_lock);
> + if (vma->vm_flags & VM_LOCKED)
> + mm->locked_vm += grow;
>   anon_vma_interval_tree_pre_update_vma(vma);
>   vma->vm_end = address;
>   anon_vma_interval_tree_post_update_vma(vma);
> @@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru
>* against concurrent vma expansions.
>*/
>   spin_lock(>vm_mm->page_table_lock);
> + if (vma->vm_flags & VM_LOCKED)
> + mm->locked_vm += grow;
>   anon_vma_interval_tree_pre_update_vma(vma);
>   vma->vm_start = address;
>   vma->vm_pgoff -= grow;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Oleg Nesterov

On 09/11, Kirill A. Shutemov wrote:
>
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?

Well, all I can say is that proc/ptrace look fine afaics...

This is off-topic, but how about the patch below? Different threads can
expand different vma's at the same time under read_lock(mmap_sem), so
vma_lock_anon_vma() can't help to serialize "locked_vm += grow".

Oleg.

--- x/mm/mmap.c
+++ x/mm/mmap.c
@@ -2146,9 +2146,6 @@ static int acct_stack_growth(struct vm_a
if (security_vm_enough_memory_mm(mm, grow))
return -ENOMEM;
 
-   /* Ok, everything looks good - let it rip */
-   if (vma->vm_flags & VM_LOCKED)
-   mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
return 0;
 }
@@ -2210,6 +2207,8 @@ int expand_upwards(struct vm_area_struct
 * against concurrent vma expansions.
 */
spin_lock(>vm_mm->page_table_lock);
+   if (vma->vm_flags & VM_LOCKED)
+   mm->locked_vm += grow;
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
anon_vma_interval_tree_post_update_vma(vma);
@@ -2281,6 +2280,8 @@ int expand_downwards(struct vm_area_stru
 * against concurrent vma expansions.
 */
spin_lock(>vm_mm->page_table_lock);
+   if (vma->vm_flags & VM_LOCKED)
+   mm->locked_vm += grow;
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_start = address;
vma->vm_pgoff -= grow;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Sasha Levin

On 09/15/2015 01:36 PM, Sasha Levin wrote:
> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
>> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
>> > sounds like an overkill, unless we find something more serious than this.
> I've modified my tests to stress the exit path of processes with many vmas,
> and hit the following NULL ptr deref (not sure if it's related to the 
> original issue):
> 
> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
> 880433f0
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
> [1181047.941842] RSP: :880433f078a8  EFLAGS: 00010206
> [1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: 
> 
> [1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: 
> 04c8
> [1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: 
> 15c59e01
> [1181047.944604] R10:  R11: 0001 R12: 
> 
> [1181047.944833] pps pps0: PPS event at 21837.866101174
> [1181047.944838] pps pps0: capture assert seq #7188
> [1181047.946261] R13:  R14: 880433f07910 R15: 
> 2e0d
> [1181047.947005] FS:  () GS:88025200() 
> knlGS:
> [1181047.947779] CS:  0010 DS:  ES:  CR0: 8005003b
> [1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: 
> 06a0
> [1181047.949085] Stack:
> [1181047.949350]   880433f07910 1100867e0f1e 
> dc00
> [1181047.950164]  8801d825d000 2e0d 880433f079d0 
> 9276c4ab
> [1181047.951070]  88041acd000a 41b58ab3 9ecd1a43 
> 9276c2a0
> [1181047.951906] Call Trace:
> [1181047.952201] exit_mmap (mm/mmap.c:2856)
> [1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826)
> [1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
> include/linux/sched.h:2557 mm/huge_memory.c:2169)
> [1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109)
> [1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 
> mm/slub.c:2746)
> [1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
> include/linux/sched.h:2557 mm/huge_memory.c:2169)
> [1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 
> kernel/fork.c:702)
> [1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 
> include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
> [1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298)
> [1181047.958005] ? mm_update_next_owner (kernel/exit.c:654)
> [1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249)
> [1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298)
> [1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641)
> [1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 
> kernel/exit.c:859)
> [1181047.965569] get_signal (kernel/signal.c:2353)
> [1181047.966430] do_signal (arch/x86/kernel/signal.c:709)
> [1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 
> fs/read_write.c:821)
> [1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407)
> [1181047.969126] ? vfs_write (fs/read_write.c:777)
> [1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
> [1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
> [1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 
> include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
> [1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 
> kernel/time/itimer.c:227)
> [1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18)
> [1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
> [1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318)
> [1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
> [1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 
> c8 04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 
> <80> 3c 02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00
> All code
> 
>0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
>6:   85 22   test   %esp,(%rdx)
>8:   01 00

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Davidlohr Bueso


On Wed, 23 Sep 2015, Kirill A. Shutemov wrote:


On Tue, Sep 22, 2015 at 06:39:52PM -0700, Hugh Dickins wrote:

[...]
I'd rather wait to hear whether this appears to work in practice,
and whether you agree that it should work in theory, before writing
the proper description.  I'd love to lose that down_read_trylock.

You mention how Sasha hit the "Bad page state (mlocked)" back in
November: that was one of the reasons we reverted Davidlohr's
i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
without understanding why it was needed.  Yes, it would lock out
a concurrent try_to_unmap(), whose setting of PageMlocked was not
sufficiently serialized by the down_read_trylock of mmap_sem.

But I don't remember the other reasons for that revert (and
haven't looked very hard as yet): anyone else remember?


Yeah, I don't think this was ever resolved, but ultimately the patch
got reverted[1] because it exposed issues in the form of bad pages
(shmem, vmsplice) and corrupted vm_flags while in untrack_pfn() causing,
for example, vm_file to dissapear.


I hoped Davidlohr will come back with something after the revert, but it
never happend. I think the reverted patch was responsible for most of
scalability boost from rwsem for i_mmap_lock...


Actually no, the change that got reverted was something we got in very
last minute, just because it made sense and had the blessing of some
key people. The main winner of the series was migration (rmap), which
later Hugh addressed more specifically for unmapped pages:

https://lkml.org/lkml/2014/11/30/349

So I really didn't care about the reverted patch, and therefore was never
on my radar.

[1] https://lkml.org/lkml/2014/12/22/375

Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-23 Thread Sasha Levin

On 09/23/2015 09:08 AM, Andrey Konovalov wrote:
> On Wed, Sep 23, 2015 at 3:39 AM, Hugh Dickins  wrote:
>> > This is totally untested, and one of you may quickly prove me wrong;
>> > but I went in to fix your "Bad page state (mlocked)" by holding pte
>> > lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
>> > then couldn't see why it would need mmap_sem at all, given how mlock
>> > and munlock first assert intention by setting or clearing VM_LOCKED
>> > in vm_flags, then work their way up the vma, taking pte locks.
>> >
>> > Calling mlock_vma_page() under pte lock may look suspicious
>> > at first: but what it does is similar to clear_page_mlock(),
>> > which we regularly call under pte lock from page_remove_rmap().
>> >
>> > I'd rather wait to hear whether this appears to work in practice,
>> > and whether you agree that it should work in theory, before writing
>> > the proper description.  I'd love to lose that down_read_trylock.
> No, unfortunately it doesn't work, I still see "Bad page state (mlocked)".
> 
> It seems that your patch doesn't fix the race from the report below, since pte
> lock is not taken when 'vma->vm_flags &= ~VM_LOCKED;' (mlock.c:425)
> is being executed. (Line numbers are from kernel with your patch applied.)

I've fired up my HZ_1 patch, and this seems to be a real race that is
somewhat easy to reproduce under those conditions.

Here's a fresh backtrace from my VMs:

[1935109.882343] BUG: Bad page state in process trinity-subchil  pfn:3ca200
[1935109.884000] page:ea000f288000 count:0 mapcount:0 mapping:  
(null) index:0x1e00 compound_mapcount: 0
[1935109.885772] flags: 0x22f80144008(uptodate|head|swapbacked|mlocked)
[1935109.887174] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[1935109.888197] bad because of flags:
[1935109.888759] flags: 0x10(mlocked)
[1935109.889525] Modules linked in:
[1935109.890165] CPU: 8 PID: 2615 Comm: trinity-subchil Not tainted 
4.3.0-rc2-next-20150923-sasha-00079-gec04207-dirty #2569
[1935109.891876]  16445448 e5dca494 8803f7657708 
a70402da
[1935109.893504]  ea000f288000 8803f7657738 a56e522b 
022f80144008
[1935109.894947]  ea000f288020 ea000f288000  
8803f76577a8
[1935109.896413] Call Trace:
[1935109.899102]  [] dump_stack+0x4e/0x84
[1935109.899821]  [] bad_page+0x17b/0x210
[1935109.900469]  [] free_pages_prepare+0xb48/0x1110
[1935109.902127]  [] __free_pages_ok+0x21/0x260
[1935109.904435]  [] free_compound_page+0x63/0x80
[1935109.905614]  [] free_transhuge_page+0x6e/0x80
[1935109.906752]  [] __put_compound_page+0x76/0xa0
[1935109.907884]  [] release_pages+0x4d5/0x9f0
[1935109.913027]  [] tlb_flush_mmu_free+0x8a/0x120
[1935109.913957]  [] unmap_page_range+0xe73/0x1460
[1935109.915737]  [] unmap_single_vma+0x126/0x2f0
[1935109.916646]  [] unmap_vmas+0xdd/0x190
[1935109.917454]  [] exit_mmap+0x221/0x430
[1935109.921176]  [] mmput+0xb1/0x240
[1935109.921919]  [] do_exit+0x732/0x27c0
[1935109.928561]  [] do_group_exit+0xf9/0x300
[1935109.929786]  [] SyS_exit_group+0x1d/0x20
[1935109.930617]  [] entry_SYSCALL_64_fastpath+0x16/0x7a


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Hugh Dickins

On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins  wrote:
> > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> >> If anybody comes up with a patch to fix the original issue I easily
> >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> >> fuzzing with KTSAN and Trinity.
> >
> > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > your report of it: please repost - thanks - though on seeing it, I may
> > well end up with no ideas.
> 
> The report is below.

Thanks.

> 
> I get it after a few seconds of running Trinity on a kernel with KTSAN
> and targeting mlock, munlock and madvise syscalls.
> Sasha also observed a very similar crash a while ago
> (https://lkml.org/lkml/2014/11/6/1055).
> I didn't manage to reproduce this in a kernel build without KTSAN though.
> The idea was that data races KTSAN reports might be the explanation of
> these crashes.
> 
> BUG: Bad page state in process trinity-c15  pfn:281999
> page:ea000a066640 count:0 mapcount:0 mapping:  (null) index:0xd
> flags: 0x228000c(referenced|uptodate|swapbacked|mlocked)
> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> bad because of flags:
> flags: 0x20(mlocked)
> Modules linked in:
> CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  821c3b70  00014741 8800b857f948
>  81e9926c 0003 ea000a066640 8800b857f978
>  811ce045 821c3b70 ea000a066640 0001
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0x63/0x81 lib/dump_stack.c:50
>  [] bad_page+0x115/0x1a0 mm/page_alloc.c:409
>  [< inline >] free_pages_check mm/page_alloc.c:731
>  [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
>  [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
>  [] free_hot_cold_page_list+0x5f/0x100
> mm/page_alloc.c:1956 (discriminator 3)
>  [] release_pages+0x151/0x300 mm/swap.c:967
>  [] __pagevec_release+0x43/0x60 mm/swap.c:984
>  [< inline >] pagevec_release include/linux/pagevec.h:69
>  [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
>  [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
>  [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
>  [] vfs_fallocate+0x1e0/0x310 fs/open.c:303
>  [< inline >] madvise_remove mm/madvise.c:326
>  [< inline >] madvise_vma mm/madvise.c:378
>  [< inline >] SYSC_madvise mm/madvise.c:528
>  [] SyS_madvise+0x378/0x760 mm/madvise.c:459
>  [] ? kt_atomic64_store+0x76/0x130 
> mm/ktsan/sync_atomic.c:161
>  [] entry_SYSCALL_64_fastpath+0x31/0x95
> arch/x86/entry/entry_64.S:188
> Disabling lock debugging due to kernel taint

This is totally untested, and one of you may quickly prove me wrong;
but I went in to fix your "Bad page state (mlocked)" by holding pte
lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
then couldn't see why it would need mmap_sem at all, given how mlock
and munlock first assert intention by setting or clearing VM_LOCKED
in vm_flags, then work their way up the vma, taking pte locks.

Calling mlock_vma_page() under pte lock may look suspicious
at first: but what it does is similar to clear_page_mlock(),
which we regularly call under pte lock from page_remove_rmap().

I'd rather wait to hear whether this appears to work in practice,
and whether you agree that it should work in theory, before writing
the proper description.  I'd love to lose that down_read_trylock.

You mention how Sasha hit the "Bad page state (mlocked)" back in
November: that was one of the reasons we reverted Davidlohr's
i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
without understanding why it was needed.  Yes, it would lock out
a concurrent try_to_unmap(), whose setting of PageMlocked was not
sufficiently serialized by the down_read_trylock of mmap_sem.

But I don't remember the other reasons for that revert (and
haven't looked very hard as yet): anyone else remember?

Not-yet-Signed-off-by: Hugh Dickins 
---

 mm/rmap.c |   32 +++-
 1 file changed, 7 insertions(+), 25 deletions(-)

--- 4.3-rc2/mm/rmap.c   2015-09-12 18:30:20.857039763 -0700
+++ linux/mm/rmap.c 2015-09-22 17:47:43.489096676 -0700
@@ -1314,9 +1314,12 @@ static int try_to_unmap_one(struct page
 * skipped over this mm) then we should reactivate it.
 */
if (!(flags & TTU_IGNORE_MLOCK)) {
-   if (vma->vm_flags & VM_LOCKED)
-   goto out_mlock;
-
+   if (vma->vm_flags & VM_LOCKED) {
+   /* Holding pte lock, we do *not* need mmap_sem here */
+   mlock_vma_page(page);
+   ret = SWAP_MLOCK;
+   goto out_unmap;
+   }
if (flags & TTU_MUNLOCK)
goto

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Andrey Konovalov

On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins  wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
>> If anybody comes up with a patch to fix the original issue I easily
>> can test it, since I'm hitting "BUG: Bad page state" in a second when
>> fuzzing with KTSAN and Trinity.
>
> This "BUG: Bad page state" sounds more serious, but I cannot track down
> your report of it: please repost - thanks - though on seeing it, I may
> well end up with no ideas.

The report is below.

I get it after a few seconds of running Trinity on a kernel with KTSAN
and targeting mlock, munlock and madvise syscalls.
Sasha also observed a very similar crash a while ago
(https://lkml.org/lkml/2014/11/6/1055).
I didn't manage to reproduce this in a kernel build without KTSAN though.
The idea was that data races KTSAN reports might be the explanation of
these crashes.

BUG: Bad page state in process trinity-c15  pfn:281999
page:ea000a066640 count:0 mapcount:0 mapping:  (null) index:0xd
flags: 0x228000c(referenced|uptodate|swapbacked|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x20(mlocked)
Modules linked in:
CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 821c3b70  00014741 8800b857f948
 81e9926c 0003 ea000a066640 8800b857f978
 811ce045 821c3b70 ea000a066640 0001
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0x63/0x81 lib/dump_stack.c:50
 [] bad_page+0x115/0x1a0 mm/page_alloc.c:409
 [< inline >] free_pages_check mm/page_alloc.c:731
 [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
 [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
 [] free_hot_cold_page_list+0x5f/0x100
mm/page_alloc.c:1956 (discriminator 3)
 [] release_pages+0x151/0x300 mm/swap.c:967
 [] __pagevec_release+0x43/0x60 mm/swap.c:984
 [< inline >] pagevec_release include/linux/pagevec.h:69
 [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
 [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
 [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
 [] vfs_fallocate+0x1e0/0x310 fs/open.c:303
 [< inline >] madvise_remove mm/madvise.c:326
 [< inline >] madvise_vma mm/madvise.c:378
 [< inline >] SYSC_madvise mm/madvise.c:528
 [] SyS_madvise+0x378/0x760 mm/madvise.c:459
 [] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
 [] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188
Disabling lock debugging due to kernel taint

>
> Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Hugh Dickins

On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> If anybody comes up with a patch to fix the original issue I easily
> can test it, since I'm hitting "BUG: Bad page state" in a second when
> fuzzing with KTSAN and Trinity.

This "BUG: Bad page state" sounds more serious, but I cannot track down
your report of it: please repost - thanks - though on seeing it, I may
well end up with no ideas.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Andrey Konovalov

If anybody comes up with a patch to fix the original issue I easily
can test it, since I'm hitting "BUG: Bad page state" in a second when
fuzzing with KTSAN and Trinity.

On Tue, Sep 15, 2015 at 9:01 PM, Kirill A. Shutemov
 wrote:
> On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote:
>> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
>> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
>> > sounds like an overkill, unless we find something more serious than this.
>>
>> I've modified my tests to stress the exit path of processes with many vmas,
>
> Could you share the test?
>
>> and hit the following NULL ptr deref (not sure if it's related to the 
>> original issue):
>>
>> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
>> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
>> [1181047.937223] Modules linked in:
>> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
>> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
>> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
>> 880433f0
>> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
>
> Is it "struct mm_struct *mm = vma->vm_mm;"?
>
> --
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Andrey Konovalov

If anybody comes up with a patch to fix the original issue I easily
can test it, since I'm hitting "BUG: Bad page state" in a second when
fuzzing with KTSAN and Trinity.

On Tue, Sep 15, 2015 at 9:01 PM, Kirill A. Shutemov
 wrote:
> On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote:
>> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
>> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
>> > sounds like an overkill, unless we find something more serious than this.
>>
>> I've modified my tests to stress the exit path of processes with many vmas,
>
> Could you share the test?
>
>> and hit the following NULL ptr deref (not sure if it's related to the 
>> original issue):
>>
>> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
>> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
>> [1181047.937223] Modules linked in:
>> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
>> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
>> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
>> 880433f0
>> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
>
> Is it "struct mm_struct *mm = vma->vm_mm;"?
>
> --
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Hugh Dickins

On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> If anybody comes up with a patch to fix the original issue I easily
> can test it, since I'm hitting "BUG: Bad page state" in a second when
> fuzzing with KTSAN and Trinity.

This "BUG: Bad page state" sounds more serious, but I cannot track down
your report of it: please repost - thanks - though on seeing it, I may
well end up with no ideas.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Andrey Konovalov

On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins  wrote:
> On Tue, 22 Sep 2015, Andrey Konovalov wrote:
>> If anybody comes up with a patch to fix the original issue I easily
>> can test it, since I'm hitting "BUG: Bad page state" in a second when
>> fuzzing with KTSAN and Trinity.
>
> This "BUG: Bad page state" sounds more serious, but I cannot track down
> your report of it: please repost - thanks - though on seeing it, I may
> well end up with no ideas.

The report is below.

I get it after a few seconds of running Trinity on a kernel with KTSAN
and targeting mlock, munlock and madvise syscalls.
Sasha also observed a very similar crash a while ago
(https://lkml.org/lkml/2014/11/6/1055).
I didn't manage to reproduce this in a kernel build without KTSAN though.
The idea was that data races KTSAN reports might be the explanation of
these crashes.

BUG: Bad page state in process trinity-c15  pfn:281999
page:ea000a066640 count:0 mapcount:0 mapping:  (null) index:0xd
flags: 0x228000c(referenced|uptodate|swapbacked|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x20(mlocked)
Modules linked in:
CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 821c3b70  00014741 8800b857f948
 81e9926c 0003 ea000a066640 8800b857f978
 811ce045 821c3b70 ea000a066640 0001
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0x63/0x81 lib/dump_stack.c:50
 [] bad_page+0x115/0x1a0 mm/page_alloc.c:409
 [< inline >] free_pages_check mm/page_alloc.c:731
 [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
 [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
 [] free_hot_cold_page_list+0x5f/0x100
mm/page_alloc.c:1956 (discriminator 3)
 [] release_pages+0x151/0x300 mm/swap.c:967
 [] __pagevec_release+0x43/0x60 mm/swap.c:984
 [< inline >] pagevec_release include/linux/pagevec.h:69
 [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
 [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
 [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
 [] vfs_fallocate+0x1e0/0x310 fs/open.c:303
 [< inline >] madvise_remove mm/madvise.c:326
 [< inline >] madvise_vma mm/madvise.c:378
 [< inline >] SYSC_madvise mm/madvise.c:528
 [] SyS_madvise+0x378/0x760 mm/madvise.c:459
 [] ? kt_atomic64_store+0x76/0x130 mm/ktsan/sync_atomic.c:161
 [] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:188
Disabling lock debugging due to kernel taint

>
> Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-22 Thread Hugh Dickins

On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> On Tue, Sep 22, 2015 at 8:54 PM, Hugh Dickins  wrote:
> > On Tue, 22 Sep 2015, Andrey Konovalov wrote:
> >> If anybody comes up with a patch to fix the original issue I easily
> >> can test it, since I'm hitting "BUG: Bad page state" in a second when
> >> fuzzing with KTSAN and Trinity.
> >
> > This "BUG: Bad page state" sounds more serious, but I cannot track down
> > your report of it: please repost - thanks - though on seeing it, I may
> > well end up with no ideas.
> 
> The report is below.

Thanks.

> 
> I get it after a few seconds of running Trinity on a kernel with KTSAN
> and targeting mlock, munlock and madvise syscalls.
> Sasha also observed a very similar crash a while ago
> (https://lkml.org/lkml/2014/11/6/1055).
> I didn't manage to reproduce this in a kernel build without KTSAN though.
> The idea was that data races KTSAN reports might be the explanation of
> these crashes.
> 
> BUG: Bad page state in process trinity-c15  pfn:281999
> page:ea000a066640 count:0 mapcount:0 mapping:  (null) index:0xd
> flags: 0x228000c(referenced|uptodate|swapbacked|mlocked)
> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> bad because of flags:
> flags: 0x20(mlocked)
> Modules linked in:
> CPU: 3 PID: 11190 Comm: trinity-c15 Not tainted 4.2.0-tsan #1295
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  821c3b70  00014741 8800b857f948
>  81e9926c 0003 ea000a066640 8800b857f978
>  811ce045 821c3b70 ea000a066640 0001
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0x63/0x81 lib/dump_stack.c:50
>  [] bad_page+0x115/0x1a0 mm/page_alloc.c:409
>  [< inline >] free_pages_check mm/page_alloc.c:731
>  [] free_pages_prepare+0x2f8/0x330 mm/page_alloc.c:922
>  [] free_hot_cold_page+0x51/0x2b0 mm/page_alloc.c:1908
>  [] free_hot_cold_page_list+0x5f/0x100
> mm/page_alloc.c:1956 (discriminator 3)
>  [] release_pages+0x151/0x300 mm/swap.c:967
>  [] __pagevec_release+0x43/0x60 mm/swap.c:984
>  [< inline >] pagevec_release include/linux/pagevec.h:69
>  [] shmem_undo_range+0x4fa/0x9d0 mm/shmem.c:446
>  [] shmem_truncate_range+0x2f/0x60 mm/shmem.c:540
>  [] shmem_fallocate+0x555/0x6e0 mm/shmem.c:2086
>  [] vfs_fallocate+0x1e0/0x310 fs/open.c:303
>  [< inline >] madvise_remove mm/madvise.c:326
>  [< inline >] madvise_vma mm/madvise.c:378
>  [< inline >] SYSC_madvise mm/madvise.c:528
>  [] SyS_madvise+0x378/0x760 mm/madvise.c:459
>  [] ? kt_atomic64_store+0x76/0x130 
> mm/ktsan/sync_atomic.c:161
>  [] entry_SYSCALL_64_fastpath+0x31/0x95
> arch/x86/entry/entry_64.S:188
> Disabling lock debugging due to kernel taint

This is totally untested, and one of you may quickly prove me wrong;
but I went in to fix your "Bad page state (mlocked)" by holding pte
lock across the down_read_trylock of mmap_sem in try_to_unmap_one(),
then couldn't see why it would need mmap_sem at all, given how mlock
and munlock first assert intention by setting or clearing VM_LOCKED
in vm_flags, then work their way up the vma, taking pte locks.

Calling mlock_vma_page() under pte lock may look suspicious
at first: but what it does is similar to clear_page_mlock(),
which we regularly call under pte lock from page_remove_rmap().

I'd rather wait to hear whether this appears to work in practice,
and whether you agree that it should work in theory, before writing
the proper description.  I'd love to lose that down_read_trylock.

You mention how Sasha hit the "Bad page state (mlocked)" back in
November: that was one of the reasons we reverted Davidlohr's
i_mmap_lock_read to i_mmap_lock_write in unmap_mapping_range(),
without understanding why it was needed.  Yes, it would lock out
a concurrent try_to_unmap(), whose setting of PageMlocked was not
sufficiently serialized by the down_read_trylock of mmap_sem.

But I don't remember the other reasons for that revert (and
haven't looked very hard as yet): anyone else remember?

Not-yet-Signed-off-by: Hugh Dickins 
---

 mm/rmap.c |   32 +++-
 1 file changed, 7 insertions(+), 25 deletions(-)

--- 4.3-rc2/mm/rmap.c   2015-09-12 18:30:20.857039763 -0700
+++ linux/mm/rmap.c 2015-09-22 17:47:43.489096676 -0700
@@ -1314,9 +1314,12 @@ static int try_to_unmap_one(struct page
 * skipped over this mm) then we should reactivate it.
 */
if (!(flags & TTU_IGNORE_MLOCK)) {
-   if (vma->vm_flags & VM_LOCKED)
-   goto out_mlock;
-
+   if (vma->vm_flags & VM_LOCKED) {
+   /* Holding pte lock, we do *not* need mmap_sem here */
+   mlock_vma_page(page);
+   ret = SWAP_MLOCK;
+   goto out_unmap;
+   }
if (flags &

Re: Multiple potential races on vma->vm_flags

2015-09-15 Thread Kirill A. Shutemov

On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote:
> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
> > sounds like an overkill, unless we find something more serious than this.
> 
> I've modified my tests to stress the exit path of processes with many vmas,

Could you share the test?

> and hit the following NULL ptr deref (not sure if it's related to the 
> original issue):
> 
> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
> 880433f0
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)

Is it "struct mm_struct *mm = vma->vm_mm;"?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-15 Thread Sasha Levin

On 09/11/2015 09:27 PM, Hugh Dickins wrote:
> I'm inclined to echo Vlastimil's comment from earlier in the thread:
> sounds like an overkill, unless we find something more serious than this.

I've modified my tests to stress the exit path of processes with many vmas,
and hit the following NULL ptr deref (not sure if it's related to the original 
issue):

[1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[1181047.937223] Modules linked in:
[1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
[1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
880433f0
[1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
[1181047.941842] RSP: :880433f078a8  EFLAGS: 00010206
[1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: 

[1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: 
04c8
[1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: 
15c59e01
[1181047.944604] R10:  R11: 0001 R12: 

[1181047.944833] pps pps0: PPS event at 21837.866101174
[1181047.944838] pps pps0: capture assert seq #7188
[1181047.946261] R13:  R14: 880433f07910 R15: 
2e0d
[1181047.947005] FS:  () GS:88025200() 
knlGS:
[1181047.947779] CS:  0010 DS:  ES:  CR0: 8005003b
[1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: 
06a0
[1181047.949085] Stack:
[1181047.949350]   880433f07910 1100867e0f1e 
dc00
[1181047.950164]  8801d825d000 2e0d 880433f079d0 
9276c4ab
[1181047.951070]  88041acd000a 41b58ab3 9ecd1a43 
9276c2a0
[1181047.951906] Call Trace:
[1181047.952201] exit_mmap (mm/mmap.c:2856)
[1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826)
[1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
include/linux/sched.h:2557 mm/huge_memory.c:2169)
[1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109)
[1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 
mm/slub.c:2746)
[1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
include/linux/sched.h:2557 mm/huge_memory.c:2169)
[1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 
kernel/fork.c:702)
[1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 
include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
[1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1181047.958005] ? mm_update_next_owner (kernel/exit.c:654)
[1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249)
[1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641)
[1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 
kernel/exit.c:859)
[1181047.965569] get_signal (kernel/signal.c:2353)
[1181047.966430] do_signal (arch/x86/kernel/signal.c:709)
[1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 
fs/read_write.c:821)
[1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407)
[1181047.969126] ? vfs_write (fs/read_write.c:777)
[1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
[1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 
include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
[1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 
kernel/time/itimer.c:227)
[1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18)
[1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
[1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318)
[1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
[1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 c8 
04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 
02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00
All code

   0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
   6:   85 22   test   %esp,(%rdx)
   8:   01 00   add%eax,(%rax)
   a:   00 48 8badd%cl,-0x75(%rax)
   d:   43  rex.XB
   e:   40  rex
   f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi

Re: Multiple potential races on vma->vm_flags

2015-09-15 Thread Kirill A. Shutemov

On Tue, Sep 15, 2015 at 01:36:45PM -0400, Sasha Levin wrote:
> On 09/11/2015 09:27 PM, Hugh Dickins wrote:
> > I'm inclined to echo Vlastimil's comment from earlier in the thread:
> > sounds like an overkill, unless we find something more serious than this.
> 
> I've modified my tests to stress the exit path of processes with many vmas,

Could you share the test?

> and hit the following NULL ptr deref (not sure if it's related to the 
> original issue):
> 
> [1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
> accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [1181047.937223] Modules linked in:
> [1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
> 4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
> [1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
> 880433f0
> [1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)

Is it "struct mm_struct *mm = vma->vm_mm;"?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-15 Thread Sasha Levin

On 09/11/2015 09:27 PM, Hugh Dickins wrote:
> I'm inclined to echo Vlastimil's comment from earlier in the thread:
> sounds like an overkill, unless we find something more serious than this.

I've modified my tests to stress the exit path of processes with many vmas,
and hit the following NULL ptr deref (not sure if it's related to the original 
issue):

[1181047.935563] kasan: GPF could be caused by NULL-ptr deref or user memory 
accessgeneral protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[1181047.937223] Modules linked in:
[1181047.937772] CPU: 4 PID: 21912 Comm: trinity-c341 Not tainted 
4.3.0-rc1-next-20150914-sasha-00043-geddd763-dirty #2554
[1181047.939387] task: 8804195c8000 ti: 880433f0 task.ti: 
880433f0
[1181047.940533] RIP: unmap_vmas (mm/memory.c:1337)
[1181047.941842] RSP: :880433f078a8  EFLAGS: 00010206
[1181047.942383] RAX: dc00 RBX: 88041acd000a RCX: 

[1181047.943091] RDX: 0099 RSI: 88041acd000a RDI: 
04c8
[1181047.943889] RBP: 880433f078d8 R08: 880415c59c58 R09: 
15c59e01
[1181047.944604] R10:  R11: 0001 R12: 

[1181047.944833] pps pps0: PPS event at 21837.866101174
[1181047.944838] pps pps0: capture assert seq #7188
[1181047.946261] R13:  R14: 880433f07910 R15: 
2e0d
[1181047.947005] FS:  () GS:88025200() 
knlGS:
[1181047.947779] CS:  0010 DS:  ES:  CR0: 8005003b
[1181047.948361] CR2: 0097df90 CR3: 00044e08c000 CR4: 
06a0
[1181047.949085] Stack:
[1181047.949350]   880433f07910 1100867e0f1e 
dc00
[1181047.950164]  8801d825d000 2e0d 880433f079d0 
9276c4ab
[1181047.951070]  88041acd000a 41b58ab3 9ecd1a43 
9276c2a0
[1181047.951906] Call Trace:
[1181047.952201] exit_mmap (mm/mmap.c:2856)
[1181047.952751] ? SyS_remap_file_pages (mm/mmap.c:2826)
[1181047.953633] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
include/linux/sched.h:2557 mm/huge_memory.c:2169)
[1181047.954281] ? rcu_read_lock_sched_held (kernel/rcu/update.c:109)
[1181047.954936] ? kmem_cache_free (include/trace/events/kmem.h:143 
mm/slub.c:2746)
[1181047.955535] ? __khugepaged_exit (./arch/x86/include/asm/atomic.h:118 
include/linux/sched.h:2557 mm/huge_memory.c:2169)
[1181047.956204] mmput (include/linux/compiler.h:207 kernel/fork.c:735 
kernel/fork.c:702)
[1181047.956691] do_exit (./arch/x86/include/asm/bitops.h:311 
include/linux/thread_info.h:91 kernel/exit.c:438 kernel/exit.c:733)
[1181047.957241] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1181047.958005] ? mm_update_next_owner (kernel/exit.c:654)
[1181047.959007] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[1181047.959995] ? get_lock_stats (kernel/locking/lockdep.c:249)
[1181047.960885] ? lockdep_init (kernel/locking/lockdep.c:3298)
[1181047.961438] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.962573] ? lock_release (kernel/locking/lockdep.c:3641)
[1181047.963488] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.964704] do_group_exit (./arch/x86/include/asm/current.h:14 
kernel/exit.c:859)
[1181047.965569] get_signal (kernel/signal.c:2353)
[1181047.966430] do_signal (arch/x86/kernel/signal.c:709)
[1181047.967241] ? do_readv_writev (include/linux/fsnotify.h:223 
fs/read_write.c:821)
[1181047.968169] ? v9fs_file_lock_dotl (fs/9p/vfs_file.c:407)
[1181047.969126] ? vfs_write (fs/read_write.c:777)
[1181047.969955] ? setup_sigcontext (arch/x86/kernel/signal.c:706)
[1181047.970916] ? __raw_callee_save___pv_queued_spin_unlock (??:?)
[1181047.972139] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1181047.973489] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:95 
include/linux/spinlock_api_smp.h:171 kernel/locking/spinlock.c:199)
[1181047.974160] ? do_setitimer (include/linux/spinlock.h:357 
kernel/time/itimer.c:227)
[1181047.974818] ? check_preemption_disabled (lib/smp_processor_id.c:18)
[1181047.975480] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[1181047.976142] prepare_exit_to_usermode (arch/x86/entry/common.c:251)
[1181047.976784] syscall_return_slowpath (arch/x86/entry/common.c:318)
[1181047.977473] int_ret_from_sys_call (arch/x86/entry/entry_64.S:282)
[1181047.978116] Code: 08 80 3c 02 00 0f 85 22 01 00 00 48 8b 43 40 48 8d b8 c8 
04 00 00 48 89 45 d0 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 
02 00 0f 85 ee 00 00 00 48 8b 45 d0 48 83 b8 c8 04 00 00
All code

   0:   08 80 3c 02 00 0f   or %al,0xf00023c(%rax)
   6:   85 22   test   %esp,(%rdx)
   8:   01 00   add%eax,(%rax)
   a:   00 48 8badd%cl,-0x75(%rax)
   d:   43  rex.XB
   e:   40  rex
   f:   48 8d b8 c8 04 00 00lea0x4c8(%rax),%rdi

Re: Multiple potential races on vma->vm_flags

2015-09-14 Thread Kirill A. Shutemov

On Fri, Sep 11, 2015 at 06:27:14PM -0700, Hugh Dickins wrote:
> On Fri, 11 Sep 2015, Kirill A. Shutemov wrote:
> > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> > > Can a vma be shared among a few mm's?
> > 
> > Define "shared".
> > 
> > vma can belong only to one process (mm_struct), but it can be accessed
> > from other process like in rmap case below.
> > 
> > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> > sure that the vma will not disappear under it.
> > 
> > > If yes, then taking current->mm->mmap_sem to protect vma is not enough.
> > 
> > Depends on what protection you are talking about.
> >  
> > > In the first report below both T378 and T398 take
> > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> > > different locks (the addresses are different).
> > 
> > See i_mmap_lock_read() in T398. It will guarantee that vma is there.
> > 
> > > In the second report T309 doesn't take any locks at all, since it
> > > assumes that after checking atomic_dec_and_test(>mm_users) the mm
> > > has no other users, but then it does a write to vma.
> > 
> > This one is tricky. I *assume* the mm cannot be generally accessible after
> > mm_users drops to zero, but I'm not entirely sure about it.
> > procfs? ptrace?
> 
> Most of the things (including procfs and ptrace) that need to work on
> a foreign mm do take a hold on mm_users with get_task_mm().  swapoff
> uses atomic_inc_not_zero(>mm_users).  In KSM I managed to get away
> with just a hold on the structure itself, atomic_inc(>mm_count),
> and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey
> might like to turn KSM on: it wouldn't be entirely shocking if he were
> to discover an anomaly from that).
> 
> > 
> > The VMA is still accessible via rmap at this point. And I think it can be
> > a problem:
> > 
> > CPU0CPU1
> > exit_mmap()
> >   // mmap_sem is *not* taken
> >   munlock_vma_pages_all()
> > munlock_vma_pages_range()
> > try_to_unmap_one()
> >   
> > down_read_trylock(>vm_mm->mmap_sem))
> >   !!(vma->vm_flags & VM_LOCKED) 
> > == true
> >   vma->vm_flags &= ~VM_LOCKED;
> >   
> >   mlock_vma_page(page);
> >   // mlocked pages is leaked.
> > 
> > The obvious solution is to take mmap_sem in exit path, but it would cause
> > performance regression.
> > 
> > Any comments?
> 
> I'm inclined to echo Vlastimil's comment from earlier in the thread:
> sounds like an overkill, unless we find something more serious than this.
> 
> I'm not sure whether we'd actually see a regression from taking mmap_sem
> in exit path; but given that it's mmap_sem, yes, history tells us please
> not to take it any more than we have to.
> 
> I do remember wishing, when working out KSM's mm handling, that exit took
> mmap_sem: it would have made it simpler, but that wasn't a change I dared
> to make.
> 
> Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() 
> could fix it?

I don't see how. It would shift a picture, but doesn't fix it: exit_mmap()
can happen after down_read_trylock() and mm_users check.
We would only hide the problem.

> But if we were to make a bigger change for this VM_LOCKED issue, and
> something more serious makes it worth all the effort, I'd say that
> what needs to be done is to give mlock/munlock proper locking (haha).
> 
> I have not yet looked at your mlocked THP patch (sorry), but when I
> was doing the same thing for huge tmpfs, what made it so surprisingly
> difficult was all the spongy trylocking, which concealed the rules.
> 
> Maybe I'm completely wrong, but I thought a lot of awkwardness might
> disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem
> throughout instead of mmap_sem.

This can be helpful. But the risk is getting scalability regression on
other front: long anon_vma chain or highly shared files.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-14 Thread Kirill A. Shutemov

On Fri, Sep 11, 2015 at 06:27:14PM -0700, Hugh Dickins wrote:
> On Fri, 11 Sep 2015, Kirill A. Shutemov wrote:
> > On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> > > Can a vma be shared among a few mm's?
> > 
> > Define "shared".
> > 
> > vma can belong only to one process (mm_struct), but it can be accessed
> > from other process like in rmap case below.
> > 
> > rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> > sure that the vma will not disappear under it.
> > 
> > > If yes, then taking current->mm->mmap_sem to protect vma is not enough.
> > 
> > Depends on what protection you are talking about.
> >  
> > > In the first report below both T378 and T398 take
> > > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> > > different locks (the addresses are different).
> > 
> > See i_mmap_lock_read() in T398. It will guarantee that vma is there.
> > 
> > > In the second report T309 doesn't take any locks at all, since it
> > > assumes that after checking atomic_dec_and_test(>mm_users) the mm
> > > has no other users, but then it does a write to vma.
> > 
> > This one is tricky. I *assume* the mm cannot be generally accessible after
> > mm_users drops to zero, but I'm not entirely sure about it.
> > procfs? ptrace?
> 
> Most of the things (including procfs and ptrace) that need to work on
> a foreign mm do take a hold on mm_users with get_task_mm().  swapoff
> uses atomic_inc_not_zero(>mm_users).  In KSM I managed to get away
> with just a hold on the structure itself, atomic_inc(>mm_count),
> and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey
> might like to turn KSM on: it wouldn't be entirely shocking if he were
> to discover an anomaly from that).
> 
> > 
> > The VMA is still accessible via rmap at this point. And I think it can be
> > a problem:
> > 
> > CPU0CPU1
> > exit_mmap()
> >   // mmap_sem is *not* taken
> >   munlock_vma_pages_all()
> > munlock_vma_pages_range()
> > try_to_unmap_one()
> >   
> > down_read_trylock(>vm_mm->mmap_sem))
> >   !!(vma->vm_flags & VM_LOCKED) 
> > == true
> >   vma->vm_flags &= ~VM_LOCKED;
> >   
> >   mlock_vma_page(page);
> >   // mlocked pages is leaked.
> > 
> > The obvious solution is to take mmap_sem in exit path, but it would cause
> > performance regression.
> > 
> > Any comments?
> 
> I'm inclined to echo Vlastimil's comment from earlier in the thread:
> sounds like an overkill, unless we find something more serious than this.
> 
> I'm not sure whether we'd actually see a regression from taking mmap_sem
> in exit path; but given that it's mmap_sem, yes, history tells us please
> not to take it any more than we have to.
> 
> I do remember wishing, when working out KSM's mm handling, that exit took
> mmap_sem: it would have made it simpler, but that wasn't a change I dared
> to make.
> 
> Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() 
> could fix it?

I don't see how. It would shift a picture, but doesn't fix it: exit_mmap()
can happen after down_read_trylock() and mm_users check.
We would only hide the problem.

> But if we were to make a bigger change for this VM_LOCKED issue, and
> something more serious makes it worth all the effort, I'd say that
> what needs to be done is to give mlock/munlock proper locking (haha).
> 
> I have not yet looked at your mlocked THP patch (sorry), but when I
> was doing the same thing for huge tmpfs, what made it so surprisingly
> difficult was all the spongy trylocking, which concealed the rules.
> 
> Maybe I'm completely wrong, but I thought a lot of awkwardness might
> disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem
> throughout instead of mmap_sem.

This can be helpful. But the risk is getting scalability regression on
other front: long anon_vma chain or highly shared files.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Hugh Dickins

On Fri, 11 Sep 2015, Kirill A. Shutemov wrote:
> On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> > Can a vma be shared among a few mm's?
> 
> Define "shared".
> 
> vma can belong only to one process (mm_struct), but it can be accessed
> from other process like in rmap case below.
> 
> rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> sure that the vma will not disappear under it.
> 
> > If yes, then taking current->mm->mmap_sem to protect vma is not enough.
> 
> Depends on what protection you are talking about.
>  
> > In the first report below both T378 and T398 take
> > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> > different locks (the addresses are different).
> 
> See i_mmap_lock_read() in T398. It will guarantee that vma is there.
> 
> > In the second report T309 doesn't take any locks at all, since it
> > assumes that after checking atomic_dec_and_test(>mm_users) the mm
> > has no other users, but then it does a write to vma.
> 
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?

Most of the things (including procfs and ptrace) that need to work on
a foreign mm do take a hold on mm_users with get_task_mm().  swapoff
uses atomic_inc_not_zero(>mm_users).  In KSM I managed to get away
with just a hold on the structure itself, atomic_inc(>mm_count),
and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey
might like to turn KSM on: it wouldn't be entirely shocking if he were
to discover an anomaly from that).

> 
> The VMA is still accessible via rmap at this point. And I think it can be
> a problem:
> 
>   CPU0CPU1
> exit_mmap()
>   // mmap_sem is *not* taken
>   munlock_vma_pages_all()
> munlock_vma_pages_range()
>   try_to_unmap_one()
> 
> down_read_trylock(>vm_mm->mmap_sem))
> !!(vma->vm_flags & VM_LOCKED) 
> == true
>   vma->vm_flags &= ~VM_LOCKED;
>   
> mlock_vma_page(page);
> // mlocked pages is leaked.
> 
> The obvious solution is to take mmap_sem in exit path, but it would cause
> performance regression.
> 
> Any comments?

I'm inclined to echo Vlastimil's comment from earlier in the thread:
sounds like an overkill, unless we find something more serious than this.

I'm not sure whether we'd actually see a regression from taking mmap_sem
in exit path; but given that it's mmap_sem, yes, history tells us please
not to take it any more than we have to.

I do remember wishing, when working out KSM's mm handling, that exit took
mmap_sem: it would have made it simpler, but that wasn't a change I dared
to make.

Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() 
could fix it?

But if we were to make a bigger change for this VM_LOCKED issue, and
something more serious makes it worth all the effort, I'd say that
what needs to be done is to give mlock/munlock proper locking (haha).

I have not yet looked at your mlocked THP patch (sorry), but when I
was doing the same thing for huge tmpfs, what made it so surprisingly
difficult was all the spongy trylocking, which concealed the rules.

Maybe I'm completely wrong, but I thought a lot of awkwardness might
disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem
throughout instead of mmap_sem.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Vlastimil Babka


On 09/11/2015 05:29 PM, Vlastimil Babka wrote:

On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote:

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:

Can a vma be shared among a few mm's?


Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.


If yes, then taking current->mm->mmap_sem to protect vma is not enough.


Depends on what protection you are talking about.


In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).


See i_mmap_lock_read() in T398. It will guarantee that vma is there.


In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(>mm_users) the mm
has no other users, but then it does a write to vma.


This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

CPU0CPU1
exit_mmap()
// mmap_sem is *not* taken
munlock_vma_pages_all()
  munlock_vma_pages_range()
try_to_unmap_one()
  
down_read_trylock(>vm_mm->mmap_sem))
  !!(vma->vm_flags & VM_LOCKED) 
== true
vma->vm_flags &= ~VM_LOCKED;

  mlock_vma_page(page);
  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?


Just so others don't repeat the paths that I already looked at:

- First I thought that try_to_unmap_one() has the page locked and
munlock_vma_pages_range() will also lock it... but it doesn't.


More precisely, it does (in __munlock_pagevec()), but 
TestClearPageMlocked(page) doesn't happen under that lock.



- Then I thought that exit_mmap() will revisit the page anyway doing
actual unmap. It would, if it's the one who has the page mapped, it will
clear the mlock (see page_remove_rmap()). If it's not the last one, page
will be left locked. So it won't be completely leaked, but still, it
will be mlocked when it shouldn't.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Vlastimil Babka


On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote:

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:

Can a vma be shared among a few mm's?


Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.


If yes, then taking current->mm->mmap_sem to protect vma is not enough.


Depends on what protection you are talking about.


In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).


See i_mmap_lock_read() in T398. It will guarantee that vma is there.


In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(>mm_users) the mm
has no other users, but then it does a write to vma.


This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

CPU0CPU1
exit_mmap()
   // mmap_sem is *not* taken
   munlock_vma_pages_all()
 munlock_vma_pages_range()
try_to_unmap_one()
  
down_read_trylock(>vm_mm->mmap_sem))
  !!(vma->vm_flags & VM_LOCKED) 
== true
   vma->vm_flags &= ~VM_LOCKED;
   
  mlock_vma_page(page);
  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?


Just so others don't repeat the paths that I already looked at:

- First I thought that try_to_unmap_one() has the page locked and 
munlock_vma_pages_range() will also lock it... but it doesn't.
- Then I thought that exit_mmap() will revisit the page anyway doing 
actual unmap. It would, if it's the one who has the page mapped, it will 
clear the mlock (see page_remove_rmap()). If it's not the last one, page 
will be left locked. So it won't be completely leaked, but still, it 
will be mlocked when it shouldn't.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Kirill A. Shutemov

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> Can a vma be shared among a few mm's?

Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.

> If yes, then taking current->mm->mmap_sem to protect vma is not enough.

Depends on what protection you are talking about.
 
> In the first report below both T378 and T398 take
> current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> different locks (the addresses are different).

See i_mmap_lock_read() in T398. It will guarantee that vma is there.

> In the second report T309 doesn't take any locks at all, since it
> assumes that after checking atomic_dec_and_test(>mm_users) the mm
> has no other users, but then it does a write to vma.

This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

CPU0CPU1
exit_mmap()
  // mmap_sem is *not* taken
  munlock_vma_pages_all()
munlock_vma_pages_range()
try_to_unmap_one()
  
down_read_trylock(>vm_mm->mmap_sem))
  !!(vma->vm_flags & VM_LOCKED) 
== true
  vma->vm_flags &= ~VM_LOCKED;
  
  mlock_vma_page(page);
  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?

> 
> ==
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T378 (K2633, CPU3):
>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T378:
> #0 Lock 25710428 taken here:
>  [< inlined>] SYSC_munlock mm/mlock.c:650
>  [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Previous read of size 8 by thread T398 (K2623, CPU2):
>  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [< inlined>] rmap_walk_file mm/rmap.c:1540
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T398:
> #0 Lock 21b00c68 taken here:
>  [< inlined>] SYSC_munlock mm/mlock.c:650
>  [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #1 Lock bac2d750 taken here:
>  [< inlined>] i_mmap_lock_read include/linux/fs.h:509
>  [< inlined>] rmap_walk_file mm/rmap.c:1533
>  [] rmap_walk+0x78/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #2 Lock 0895f570 taken here:
>  [< inlined>] spin_lock include/linux/spinlock.h:312
>  [] __page_check_address+0xd9/0x210 mm/rmap.c:681
>  [< inlined>] page_check_address include/linux/rmap.h:204
>  [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
>  [< inlined>] rmap_walk_file mm/rmap.c:1540
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> DBG: addr: 880222610e10
>

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Vlastimil Babka


On 09/11/2015 05:29 PM, Vlastimil Babka wrote:

On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote:

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:

Can a vma be shared among a few mm's?


Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.


If yes, then taking current->mm->mmap_sem to protect vma is not enough.


Depends on what protection you are talking about.


In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).


See i_mmap_lock_read() in T398. It will guarantee that vma is there.


In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(>mm_users) the mm
has no other users, but then it does a write to vma.


This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

CPU0CPU1
exit_mmap()
// mmap_sem is *not* taken
munlock_vma_pages_all()
  munlock_vma_pages_range()
try_to_unmap_one()
  
down_read_trylock(>vm_mm->mmap_sem))
  !!(vma->vm_flags & VM_LOCKED) 
== true
vma->vm_flags &= ~VM_LOCKED;

  mlock_vma_page(page);
  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?


Just so others don't repeat the paths that I already looked at:

- First I thought that try_to_unmap_one() has the page locked and
munlock_vma_pages_range() will also lock it... but it doesn't.


More precisely, it does (in __munlock_pagevec()), but 
TestClearPageMlocked(page) doesn't happen under that lock.



- Then I thought that exit_mmap() will revisit the page anyway doing
actual unmap. It would, if it's the one who has the page mapped, it will
clear the mlock (see page_remove_rmap()). If it's not the last one, page
will be left locked. So it won't be completely leaked, but still, it
will be mlocked when it shouldn't.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Kirill A. Shutemov

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> Can a vma be shared among a few mm's?

Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.

> If yes, then taking current->mm->mmap_sem to protect vma is not enough.

Depends on what protection you are talking about.
 
> In the first report below both T378 and T398 take
> current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> different locks (the addresses are different).

See i_mmap_lock_read() in T398. It will guarantee that vma is there.

> In the second report T309 doesn't take any locks at all, since it
> assumes that after checking atomic_dec_and_test(>mm_users) the mm
> has no other users, but then it does a write to vma.

This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

CPU0CPU1
exit_mmap()
  // mmap_sem is *not* taken
  munlock_vma_pages_all()
munlock_vma_pages_range()
try_to_unmap_one()
  
down_read_trylock(>vm_mm->mmap_sem))
  !!(vma->vm_flags & VM_LOCKED) 
== true
  vma->vm_flags &= ~VM_LOCKED;
  
  mlock_vma_page(page);
  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?

> 
> ==
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T378 (K2633, CPU3):
>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T378:
> #0 Lock 25710428 taken here:
>  [< inlined>] SYSC_munlock mm/mlock.c:650
>  [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Previous read of size 8 by thread T398 (K2623, CPU2):
>  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [< inlined>] rmap_walk_file mm/rmap.c:1540
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> Locks held by T398:
> #0 Lock 21b00c68 taken here:
>  [< inlined>] SYSC_munlock mm/mlock.c:650
>  [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #1 Lock bac2d750 taken here:
>  [< inlined>] i_mmap_lock_read include/linux/fs.h:509
>  [< inlined>] rmap_walk_file mm/rmap.c:1533
>  [] rmap_walk+0x78/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> #2 Lock 0895f570 taken here:
>  [< inlined>] spin_lock include/linux/spinlock.h:312
>  [] __page_check_address+0xd9/0x210 mm/rmap.c:681
>  [< inlined>] page_check_address include/linux/rmap.h:204
>  [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
>  [< inlined>] rmap_walk_file mm/rmap.c:1540
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186
> 
> DBG: addr: 880222610e10
>

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Hugh Dickins

On Fri, 11 Sep 2015, Kirill A. Shutemov wrote:
> On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:
> > Can a vma be shared among a few mm's?
> 
> Define "shared".
> 
> vma can belong only to one process (mm_struct), but it can be accessed
> from other process like in rmap case below.
> 
> rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
> sure that the vma will not disappear under it.
> 
> > If yes, then taking current->mm->mmap_sem to protect vma is not enough.
> 
> Depends on what protection you are talking about.
>  
> > In the first report below both T378 and T398 take
> > current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
> > different locks (the addresses are different).
> 
> See i_mmap_lock_read() in T398. It will guarantee that vma is there.
> 
> > In the second report T309 doesn't take any locks at all, since it
> > assumes that after checking atomic_dec_and_test(>mm_users) the mm
> > has no other users, but then it does a write to vma.
> 
> This one is tricky. I *assume* the mm cannot be generally accessible after
> mm_users drops to zero, but I'm not entirely sure about it.
> procfs? ptrace?

Most of the things (including procfs and ptrace) that need to work on
a foreign mm do take a hold on mm_users with get_task_mm().  swapoff
uses atomic_inc_not_zero(>mm_users).  In KSM I managed to get away
with just a hold on the structure itself, atomic_inc(>mm_count),
and a check for mm_users 0 wherever it down_reads mmap_sem (but Andrey
might like to turn KSM on: it wouldn't be entirely shocking if he were
to discover an anomaly from that).

> 
> The VMA is still accessible via rmap at this point. And I think it can be
> a problem:
> 
>   CPU0CPU1
> exit_mmap()
>   // mmap_sem is *not* taken
>   munlock_vma_pages_all()
> munlock_vma_pages_range()
>   try_to_unmap_one()
> 
> down_read_trylock(>vm_mm->mmap_sem))
> !!(vma->vm_flags & VM_LOCKED) 
> == true
>   vma->vm_flags &= ~VM_LOCKED;
>   
> mlock_vma_page(page);
> // mlocked pages is leaked.
> 
> The obvious solution is to take mmap_sem in exit path, but it would cause
> performance regression.
> 
> Any comments?

I'm inclined to echo Vlastimil's comment from earlier in the thread:
sounds like an overkill, unless we find something more serious than this.

I'm not sure whether we'd actually see a regression from taking mmap_sem
in exit path; but given that it's mmap_sem, yes, history tells us please
not to take it any more than we have to.

I do remember wishing, when working out KSM's mm handling, that exit took
mmap_sem: it would have made it simpler, but that wasn't a change I dared
to make.

Maybe an mm_users 0 check after down_read_trylock in try_to_unmap_one() 
could fix it?

But if we were to make a bigger change for this VM_LOCKED issue, and
something more serious makes it worth all the effort, I'd say that
what needs to be done is to give mlock/munlock proper locking (haha).

I have not yet looked at your mlocked THP patch (sorry), but when I
was doing the same thing for huge tmpfs, what made it so surprisingly
difficult was all the spongy trylocking, which concealed the rules.

Maybe I'm completely wrong, but I thought a lot of awkwardness might
disappear if they were relying on anon_vma->rwsem and i_mmap_rwsem
throughout instead of mmap_sem.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-11 Thread Vlastimil Babka


On 09/11/2015 12:39 PM, Kirill A. Shutemov wrote:

On Thu, Sep 10, 2015 at 03:27:59PM +0200, Andrey Konovalov wrote:

Can a vma be shared among a few mm's?


Define "shared".

vma can belong only to one process (mm_struct), but it can be accessed
from other process like in rmap case below.

rmap uses anon_vma_lock for anon vma and i_mmap_rwsem for file vma to make
sure that the vma will not disappear under it.


If yes, then taking current->mm->mmap_sem to protect vma is not enough.


Depends on what protection you are talking about.


In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).


See i_mmap_lock_read() in T398. It will guarantee that vma is there.


In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(>mm_users) the mm
has no other users, but then it does a write to vma.


This one is tricky. I *assume* the mm cannot be generally accessible after
mm_users drops to zero, but I'm not entirely sure about it.
procfs? ptrace?

The VMA is still accessible via rmap at this point. And I think it can be
a problem:

CPU0CPU1
exit_mmap()
   // mmap_sem is *not* taken
   munlock_vma_pages_all()
 munlock_vma_pages_range()
try_to_unmap_one()
  
down_read_trylock(>vm_mm->mmap_sem))
  !!(vma->vm_flags & VM_LOCKED) 
== true
   vma->vm_flags &= ~VM_LOCKED;
   
  mlock_vma_page(page);
  // mlocked pages is leaked.

The obvious solution is to take mmap_sem in exit path, but it would cause
performance regression.

Any comments?


Just so others don't repeat the paths that I already looked at:

- First I thought that try_to_unmap_one() has the page locked and 
munlock_vma_pages_range() will also lock it... but it doesn't.
- Then I thought that exit_mmap() will revisit the page anyway doing 
actual unmap. It would, if it's the one who has the page mapped, it will 
clear the mlock (see page_remove_rmap()). If it's not the last one, page 
will be left locked. So it won't be completely leaked, but still, it 
will be mlocked when it shouldn't.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-10 Thread Andrey Konovalov

Can a vma be shared among a few mm's?
If yes, then taking current->mm->mmap_sem to protect vma is not enough.

In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).
In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(>mm_users) the mm
has no other users, but then it does a write to vma.

==
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T378 (K2633, CPU3):
 [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T378:
#0 Lock 25710428 taken here:
 [< inlined>] SYSC_munlock mm/mlock.c:650
 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Previous read of size 8 by thread T398 (K2623, CPU2):
 [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
 [< inlined>] rmap_walk_file mm/rmap.c:1540
 [] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T398:
#0 Lock 21b00c68 taken here:
 [< inlined>] SYSC_munlock mm/mlock.c:650
 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#1 Lock bac2d750 taken here:
 [< inlined>] i_mmap_lock_read include/linux/fs.h:509
 [< inlined>] rmap_walk_file mm/rmap.c:1533
 [] rmap_walk+0x78/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#2 Lock 0895f570 taken here:
 [< inlined>] spin_lock include/linux/spinlock.h:312
 [] __page_check_address+0xd9/0x210 mm/rmap.c:681
 [< inlined>] page_check_address include/linux/rmap.h:204
 [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
 [< inlined>] rmap_walk_file mm/rmap.c:1540
 [] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

DBG: addr: 880222610e10
DBG: first offset: 0, second offset: 0
DBG: T378 clock: {T378: 4486533, T398: 2405850}
DBG: T398 clock: {T398: 2406009}
==

==
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T309 (K2577, CPU0):
 [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [< inlined>] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inlined>] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1420 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [< inlined>] SYSC_exit_group kernel/exit.c:885
 [] __wake_up_parent+0x0/0x50 kernel/exit.c:883
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T309:

Previous read of size 8 by thread T293 (K2573, CPU3):
 [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
 [< inlined>] rmap_walk_file mm/rmap.c:1540
 [] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [< inlined>] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inlined>] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1420 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874

Re: Multiple potential races on vma->vm_flags

2015-09-10 Thread Kirill A. Shutemov

On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote:
> On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >> > ==
> >> > ThreadSanitizer: data-race in munlock_vma_pages_range
> >> > 
> >> > Write of size 8 by thread T378 (K2633, CPU3):
> >> >  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >> > arch/x86/entry/entry_64.S:186
> > ...
> > 
> >> > Previous read of size 8 by thread T398 (K2623, CPU2):
> >> >  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >> >  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >> >  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >> >  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >> >  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >> >  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >> >  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >> > arch/x86/entry/entry_64.S:186
> > Okay, the detected race is mlock/munlock vs. rmap.
> > 
> > On rmap side we check vma->vm_flags in few places without taking
> > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> > us.
> > 
> > In this particular case, speculative check in beginning of
> > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> > the function.
> 
> So you're suggesting that this isn't the cause of the bad page flags
> error observed by Andrey and myself?

I don't see it, but who knows.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-10 Thread Kirill A. Shutemov

On Wed, Sep 09, 2015 at 08:58:26PM -0400, Sasha Levin wrote:
> On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> > On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >> > ==
> >> > ThreadSanitizer: data-race in munlock_vma_pages_range
> >> > 
> >> > Write of size 8 by thread T378 (K2633, CPU3):
> >> >  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >> > arch/x86/entry/entry_64.S:186
> > ...
> > 
> >> > Previous read of size 8 by thread T398 (K2623, CPU2):
> >> >  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >> >  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >> >  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >> >  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >> >  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >> >  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >> >  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >> > arch/x86/entry/entry_64.S:186
> > Okay, the detected race is mlock/munlock vs. rmap.
> > 
> > On rmap side we check vma->vm_flags in few places without taking
> > vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> > or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> > us.
> > 
> > In this particular case, speculative check in beginning of
> > try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> > the function.
> 
> So you're suggesting that this isn't the cause of the bad page flags
> error observed by Andrey and myself?

I don't see it, but who knows.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-10 Thread Andrey Konovalov

Can a vma be shared among a few mm's?
If yes, then taking current->mm->mmap_sem to protect vma is not enough.

In the first report below both T378 and T398 take
current->mm->mmap_sem at mm/mlock.c:650, but they turn out to be
different locks (the addresses are different).
In the second report T309 doesn't take any locks at all, since it
assumes that after checking atomic_dec_and_test(>mm_users) the mm
has no other users, but then it does a write to vma.

==
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T378 (K2633, CPU3):
 [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T378:
#0 Lock 25710428 taken here:
 [< inlined>] SYSC_munlock mm/mlock.c:650
 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Previous read of size 8 by thread T398 (K2623, CPU2):
 [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
 [< inlined>] rmap_walk_file mm/rmap.c:1540
 [] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T398:
#0 Lock 21b00c68 taken here:
 [< inlined>] SYSC_munlock mm/mlock.c:650
 [] SyS_munlock+0x4c/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#1 Lock bac2d750 taken here:
 [< inlined>] i_mmap_lock_read include/linux/fs.h:509
 [< inlined>] rmap_walk_file mm/rmap.c:1533
 [] rmap_walk+0x78/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186
#2 Lock 0895f570 taken here:
 [< inlined>] spin_lock include/linux/spinlock.h:312
 [] __page_check_address+0xd9/0x210 mm/rmap.c:681
 [< inlined>] page_check_address include/linux/rmap.h:204
 [] try_to_unmap_one+0x53/0x4f0 mm/rmap.c:1198
 [< inlined>] rmap_walk_file mm/rmap.c:1540
 [] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
 [] do_mlock+0x14c/0x180 mm/mlock.c:589
 [< inlined>] SYSC_munlock mm/mlock.c:651
 [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

DBG: addr: 880222610e10
DBG: first offset: 0, second offset: 0
DBG: T378 clock: {T378: 4486533, T398: 2405850}
DBG: T398 clock: {T398: 2406009}
==

==
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T309 (K2577, CPU0):
 [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
 [< inlined>] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inlined>] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1420 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874
 [< inlined>] SYSC_exit_group kernel/exit.c:885
 [] __wake_up_parent+0x0/0x50 kernel/exit.c:883
 [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186

Locks held by T309:

Previous read of size 8 by thread T293 (K2573, CPU3):
 [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
 [< inlined>] rmap_walk_file mm/rmap.c:1540
 [] rmap_walk+0x147/0x450 mm/rmap.c:1559
 [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
 [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
 [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
 [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
 [< inlined>] munlock_vma_pages_all mm/internal.h:252
 [] exit_mmap+0x163/0x190 mm/mmap.c:2824
 [] mmput+0x65/0x190 kernel/fork.c:708
 [< inlined>] exit_mm kernel/exit.c:437
 [] do_exit+0x457/0x1420 kernel/exit.c:733
 [] do_group_exit+0x7f/0x140 kernel/exit.c:874

Re: Multiple potential races on vma->vm_flags

2015-09-09 Thread Sasha Levin

On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
>> > ==
>> > ThreadSanitizer: data-race in munlock_vma_pages_range
>> > 
>> > Write of size 8 by thread T378 (K2633, CPU3):
>> >  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
>> > arch/x86/entry/entry_64.S:186
> ...
> 
>> > Previous read of size 8 by thread T398 (K2623, CPU2):
>> >  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>> >  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>> >  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>> >  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>> >  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>> >  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>> >  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
>> > arch/x86/entry/entry_64.S:186
> Okay, the detected race is mlock/munlock vs. rmap.
> 
> On rmap side we check vma->vm_flags in few places without taking
> vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> us.
> 
> In this particular case, speculative check in beginning of
> try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> the function.

So you're suggesting that this isn't the cause of the bad page flags
error observed by Andrey and myself?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-09 Thread Kirill A. Shutemov

On Wed, Sep 09, 2015 at 05:27:16PM +0200, Vlastimil Babka wrote:
> On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote:
> >On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >>==
> >>ThreadSanitizer: data-race in munlock_vma_pages_range
> >>
> >>Write of size 8 by thread T378 (K2633, CPU3):
> >>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >>arch/x86/entry/entry_64.S:186
> >
> >...
> >
> >>Previous read of size 8 by thread T398 (K2623, CPU2):
> >>  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >>  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >>arch/x86/entry/entry_64.S:186
> >
> >Okay, the detected race is mlock/munlock vs. rmap.
> >
> >On rmap side we check vma->vm_flags in few places without taking
> >vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> >or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> >us.
> >
> >In this particular case, speculative check in beginning of
> >try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> >the function.
> >
> >False-negative is fine too here, since we will mlock the page in
> >__mm_populate() on mlock side after mlock_fixup().
> >
> >BUT.
> >
> >We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
> >avoid all compiler trickery, like duplication vm_flags access with
> >inconsistent results.
> 
> Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a
> compiler barrier forcing vm_flags to be re-read?

Yes, but it doesn't prevent compiler from generation multiple reads from
vma->vm_flags and it may blow up if two values doesn't match.

> >I looked only on VM_LOCKED checks, but there are few other flags checked
> >in rmap. All of them must be handled carefully. At least READ_ONCE() is
> >required.
> >
> >Other solution would be to introduce per-vma spinlock to protect
> >vma->vm_flags and probably other vma fields and offload this duty
> >from mmap_sem.
> >But that's much bigger project.
> 
> Sounds like an overkill, unless we find something more serious than this.

May be...

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-09 Thread Vlastimil Babka


On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote:

On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:

==
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T378 (K2633, CPU3):
  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
  [] do_mlock+0x14c/0x180 mm/mlock.c:589
  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
  [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186


...


Previous read of size 8 by thread T398 (K2623, CPU2):
  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
  [] do_mlock+0x14c/0x180 mm/mlock.c:589
  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
  [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186


Okay, the detected race is mlock/munlock vs. rmap.

On rmap side we check vma->vm_flags in few places without taking
vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
us.

In this particular case, speculative check in beginning of
try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
the function.

False-negative is fine too here, since we will mlock the page in
__mm_populate() on mlock side after mlock_fixup().

BUT.

We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
avoid all compiler trickery, like duplication vm_flags access with
inconsistent results.


Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a 
compiler barrier forcing vm_flags to be re-read?



I looked only on VM_LOCKED checks, but there are few other flags checked
in rmap. All of them must be handled carefully. At least READ_ONCE() is
required.

Other solution would be to introduce per-vma spinlock to protect
vma->vm_flags and probably other vma fields and offload this duty
from mmap_sem.
But that's much bigger project.


Sounds like an overkill, unless we find something more serious than this.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-09 Thread Kirill A. Shutemov

On Wed, Sep 09, 2015 at 05:27:16PM +0200, Vlastimil Babka wrote:
> On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote:
> >On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> >>==
> >>ThreadSanitizer: data-race in munlock_vma_pages_range
> >>
> >>Write of size 8 by thread T378 (K2633, CPU3):
> >>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
> >>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >>arch/x86/entry/entry_64.S:186
> >
> >...
> >
> >>Previous read of size 8 by thread T398 (K2623, CPU2):
> >>  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
> >>  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
> >>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
> >>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
> >>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
> >>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
> >>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
> >>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
> >>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
> >>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
> >>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
> >>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> >>arch/x86/entry/entry_64.S:186
> >
> >Okay, the detected race is mlock/munlock vs. rmap.
> >
> >On rmap side we check vma->vm_flags in few places without taking
> >vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> >or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> >us.
> >
> >In this particular case, speculative check in beginning of
> >try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> >the function.
> >
> >False-negative is fine too here, since we will mlock the page in
> >__mm_populate() on mlock side after mlock_fixup().
> >
> >BUT.
> >
> >We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
> >avoid all compiler trickery, like duplication vm_flags access with
> >inconsistent results.
> 
> Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a
> compiler barrier forcing vm_flags to be re-read?

Yes, but it doesn't prevent compiler from generation multiple reads from
vma->vm_flags and it may blow up if two values doesn't match.

> >I looked only on VM_LOCKED checks, but there are few other flags checked
> >in rmap. All of them must be handled carefully. At least READ_ONCE() is
> >required.
> >
> >Other solution would be to introduce per-vma spinlock to protect
> >vma->vm_flags and probably other vma fields and offload this duty
> >from mmap_sem.
> >But that's much bigger project.
> 
> Sounds like an overkill, unless we find something more serious than this.

May be...

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-09 Thread Vlastimil Babka


On 09/07/2015 01:40 PM, Kirill A. Shutemov wrote:

On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:

==
ThreadSanitizer: data-race in munlock_vma_pages_range

Write of size 8 by thread T378 (K2633, CPU3):
  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
  [] do_mlock+0x14c/0x180 mm/mlock.c:589
  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
  [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186


...


Previous read of size 8 by thread T398 (K2623, CPU2):
  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
  [] do_mlock+0x14c/0x180 mm/mlock.c:589
  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
  [] entry_SYSCALL_64_fastpath+0x12/0x71
arch/x86/entry/entry_64.S:186


Okay, the detected race is mlock/munlock vs. rmap.

On rmap side we check vma->vm_flags in few places without taking
vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
us.

In this particular case, speculative check in beginning of
try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
the function.

False-negative is fine too here, since we will mlock the page in
__mm_populate() on mlock side after mlock_fixup().

BUT.

We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
avoid all compiler trickery, like duplication vm_flags access with
inconsistent results.


Doesn't taking a semaphore, as in try_to_unmap_one(), already imply a 
compiler barrier forcing vm_flags to be re-read?



I looked only on VM_LOCKED checks, but there are few other flags checked
in rmap. All of them must be handled carefully. At least READ_ONCE() is
required.

Other solution would be to introduce per-vma spinlock to protect
vma->vm_flags and probably other vma fields and offload this duty
from mmap_sem.
But that's much bigger project.


Sounds like an overkill, unless we find something more serious than this.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Multiple potential races on vma->vm_flags

2015-09-09 Thread Sasha Levin

On 09/07/2015 07:40 AM, Kirill A. Shutemov wrote:
> On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
>> > ==
>> > ThreadSanitizer: data-race in munlock_vma_pages_range
>> > 
>> > Write of size 8 by thread T378 (K2633, CPU3):
>> >  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
>> > arch/x86/entry/entry_64.S:186
> ...
> 
>> > Previous read of size 8 by thread T398 (K2623, CPU2):
>> >  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>> >  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>> >  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>> >  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>> >  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>> >  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>> >  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>> >  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>> >  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>> >  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>> >  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>> >  [] entry_SYSCALL_64_fastpath+0x12/0x71
>> > arch/x86/entry/entry_64.S:186
> Okay, the detected race is mlock/munlock vs. rmap.
> 
> On rmap side we check vma->vm_flags in few places without taking
> vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
> or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
> us.
> 
> In this particular case, speculative check in beginning of
> try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
> the function.

So you're suggesting that this isn't the cause of the bad page flags
error observed by Andrey and myself?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Multiple potential races on vma->vm_flags

2015-09-07 Thread Kirill A. Shutemov

On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> ==
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T378 (K2633, CPU3):
>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186

...

> Previous read of size 8 by thread T398 (K2623, CPU2):
>  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186

Okay, the detected race is mlock/munlock vs. rmap.

On rmap side we check vma->vm_flags in few places without taking
vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
us.

In this particular case, speculative check in beginning of
try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
the function.

False-negative is fine too here, since we will mlock the page in
__mm_populate() on mlock side after mlock_fixup().

BUT.

We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
avoid all compiler trickery, like duplication vm_flags access with
inconsistent results.

I looked only on VM_LOCKED checks, but there are few other flags checked
in rmap. All of them must be handled carefully. At least READ_ONCE() is
required.

Other solution would be to introduce per-vma spinlock to protect
vma->vm_flags and probably other vma fields and offload this duty
from mmap_sem.
But that's much bigger project.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Multiple potential races on vma->vm_flags

2015-09-07 Thread Kirill A. Shutemov

On Sun, Sep 06, 2015 at 03:21:05PM -0400, Sasha Levin wrote:
> ==
> ThreadSanitizer: data-race in munlock_vma_pages_range
> 
> Write of size 8 by thread T378 (K2633, CPU3):
>  [] munlock_vma_pages_range+0x59/0x3e0 mm/mlock.c:425
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186

...

> Previous read of size 8 by thread T398 (K2623, CPU2):
>  [] try_to_unmap_one+0x78/0x4f0 mm/rmap.c:1208
>  [< inlined>] rmap_walk+0x147/0x450 rmap_walk_file mm/rmap.c:1540
>  [] rmap_walk+0x147/0x450 mm/rmap.c:1559
>  [] try_to_munlock+0xa2/0xc0 mm/rmap.c:1423
>  [] __munlock_isolated_page+0x30/0x60 mm/mlock.c:129
>  [] __munlock_pagevec+0x236/0x3f0 mm/mlock.c:331
>  [] munlock_vma_pages_range+0x380/0x3e0 mm/mlock.c:476
>  [] mlock_fixup+0x1c9/0x280 mm/mlock.c:549
>  [] do_mlock+0x14c/0x180 mm/mlock.c:589
>  [< inlined>] SyS_munlock+0x74/0xb0 SYSC_munlock mm/mlock.c:651
>  [] SyS_munlock+0x74/0xb0 mm/mlock.c:643
>  [] entry_SYSCALL_64_fastpath+0x12/0x71
> arch/x86/entry/entry_64.S:186

Okay, the detected race is mlock/munlock vs. rmap.

On rmap side we check vma->vm_flags in few places without taking
vma->vm_mm->mmap_sem. The vma cannot be freed since we hold i_mmap_rwsem
or anon_vma_lock, but nothing prevent vma->vm_flags from changing under
us.

In this particular case, speculative check in beginning of
try_to_unmap_one() is fine, since we re-check it under mmap_sem later in
the function.

False-negative is fine too here, since we will mlock the page in
__mm_populate() on mlock side after mlock_fixup().

BUT.

We *must* have all speculative vm_flags accesses wrapped READ_ONCE() to
avoid all compiler trickery, like duplication vm_flags access with
inconsistent results.

I looked only on VM_LOCKED checks, but there are few other flags checked
in rmap. All of them must be handled carefully. At least READ_ONCE() is
required.

Other solution would be to introduce per-vma spinlock to protect
vma->vm_flags and probably other vma fields and offload this duty
from mmap_sem.
But that's much bigger project.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

70 matches

Mail list logo