Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> On Thu, 27 Nov 2014, David Hildenbrand wrote: > > > OTOH, there is no reason why we need to disable preemption over that > > > page_fault_disabled() region. There are code pathes which really do > > > not require to disable preemption for that. > > > > > > We have that seperated in preempt-rt for obvious reasons and IIRC > > > Peter Zijlstra tried to distangle it in mainline some time ago. I > > > forgot why that never got merged. > > > > > > > Of course, we can completely separate that in our page fault code by doing > > pagefault_disabled() checks instead of in_atomic() checks (even in add on > > patches later). > > > > > We tie way too much stuff on the preemption count already, which is a > > > mightmare because we have no clear distinction of protection > > > scopes. > > > > Although it might not be optimal, but keeping a separate counter for > > pagefault_disable() as part of the preemption counter seems to be the only > > doable thing right now. > > It needs to be seperate, if it should be useful. Otherwise we just > have a extra accounting in preempt_count() which does exactly the same > thing as we have now: disabling preemption. > > Now you might say, that we could mask out that part when checking > preempt_count, but that wont work on x86 as x86 has the preempt > counter as a per cpu variable and not as a per thread one. Ah right, it's per cpu on x86. So it really belongs to a thread if we want to demangle preemption and pagefault_disable. Would work for now, but for x86 not on the long run. > > But if you want to distangle pagefault disable from preempt disable > then you must move it to the thread, because it is a property of the > thread. preempt count is very much a per cpu counter as you can only > go through schedule when it becomes 0. Thinking about it, this makes perfect sense! > > Btw, I find the x86 representation way more clear, because it > documents that preempt count is a per cpu BKL and not a magic thread > property. And sadly that is how preempt count is used ... > > > I am not sure if a completely separated counter is even possible, > > increasing the size of thread_info. > > And adding a ulong to thread_info is going to create exactly which > problem? If we're allowed to increase the size of thread_info - absolutely fine with me! (I am not sure if some archs have special constraints on the size) Will see what I can come up with. Thanks! > > Thanks, > > tglx > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Thu, 27 Nov 2014, David Hildenbrand wrote: > > OTOH, there is no reason why we need to disable preemption over that > > page_fault_disabled() region. There are code pathes which really do > > not require to disable preemption for that. > > > > We have that seperated in preempt-rt for obvious reasons and IIRC > > Peter Zijlstra tried to distangle it in mainline some time ago. I > > forgot why that never got merged. > > > > Of course, we can completely separate that in our page fault code by doing > pagefault_disabled() checks instead of in_atomic() checks (even in add on > patches later). > > > We tie way too much stuff on the preemption count already, which is a > > mightmare because we have no clear distinction of protection > > scopes. > > Although it might not be optimal, but keeping a separate counter for > pagefault_disable() as part of the preemption counter seems to be the only > doable thing right now. It needs to be seperate, if it should be useful. Otherwise we just have a extra accounting in preempt_count() which does exactly the same thing as we have now: disabling preemption. Now you might say, that we could mask out that part when checking preempt_count, but that wont work on x86 as x86 has the preempt counter as a per cpu variable and not as a per thread one. But if you want to distangle pagefault disable from preempt disable then you must move it to the thread, because it is a property of the thread. preempt count is very much a per cpu counter as you can only go through schedule when it becomes 0. Btw, I find the x86 representation way more clear, because it documents that preempt count is a per cpu BKL and not a magic thread property. And sadly that is how preempt count is used ... > I am not sure if a completely separated counter is even possible, > increasing the size of thread_info. And adding a ulong to thread_info is going to create exactly which problem? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> From: David Hildenbrand [mailto:d...@linux.vnet.ibm.com] > > > From: David Hildenbrand > > > ... > > > > Although it might not be optimal, but keeping a separate counter for > > > > pagefault_disable() as part of the preemption counter seems to be the > > > > only > > > > doable thing right now. I am not sure if a completely separated counter > > > > is even > > > > possible, increasing the size of thread_info. > > > > > > What about adding (say) 0x1 for the more restrictive test? > > > > > > David > > > > > > > You mean as part of the preempt counter? > > > > The current layout (on my branch) is > > > > * PREEMPT_MASK:0x00ff > > * SOFTIRQ_MASK:0xff00 > > * HARDIRQ_MASK:0x000f > > * NMI_MASK:0x0010 > > * PREEMPT_ACTIVE: 0x0020 > > > > I would have added > > * PAGEFAULT_MASK: 0x03C0 > > I'm not sure where you'd need to add the bits. > > I think the above works because disabling 'HARDIRQ' implicitly > disables 'SOFTIRQ' and 'PREEMPT' (etc), so if 256+ threads > disable PREEMPT everything still works. AFAIK 256+ levels of preempt will break the system :) Therefore with CONFIG_DEBUG_PREEMPT we verify that we don't have any over/underflows. But such bugs can only be found with CONFIG_DEBUG_PREEMPT enabled. > > So if disabling pagefaults implies that pre-emption is disabled > (but SOFTIRQ is still allowed) then you need to insert your bit(s) > between 0xff00 and 0x00ff. > OTOH if disabling pre-emption implies that pagefaults are disabled > then you'd need to use the lsb and change all the above values. > > Which makes me think that 'PREEMPT_ACTIVE' isn't right at all. > Two threads disabling NMIs (or 32 disabling HARDIRQ) won't DTRT. With threads you mean levels? This is a per thread information. > > OTOH I'm only guessing at how this is used. > > David > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
From: David Hildenbrand [mailto:d...@linux.vnet.ibm.com] > > From: David Hildenbrand > > ... > > > Although it might not be optimal, but keeping a separate counter for > > > pagefault_disable() as part of the preemption counter seems to be the only > > > doable thing right now. I am not sure if a completely separated counter > > > is even > > > possible, increasing the size of thread_info. > > > > What about adding (say) 0x1 for the more restrictive test? > > > > David > > > > You mean as part of the preempt counter? > > The current layout (on my branch) is > > * PREEMPT_MASK:0x00ff > * SOFTIRQ_MASK:0xff00 > * HARDIRQ_MASK:0x000f > * NMI_MASK:0x0010 > * PREEMPT_ACTIVE: 0x0020 > > I would have added > * PAGEFAULT_MASK: 0x03C0 I'm not sure where you'd need to add the bits. I think the above works because disabling 'HARDIRQ' implicitly disables 'SOFTIRQ' and 'PREEMPT' (etc), so if 256+ threads disable PREEMPT everything still works. So if disabling pagefaults implies that pre-emption is disabled (but SOFTIRQ is still allowed) then you need to insert your bit(s) between 0xff00 and 0x00ff. OTOH if disabling pre-emption implies that pagefaults are disabled then you'd need to use the lsb and change all the above values. Which makes me think that 'PREEMPT_ACTIVE' isn't right at all. Two threads disabling NMIs (or 32 disabling HARDIRQ) won't DTRT. OTOH I'm only guessing at how this is used. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> From: David Hildenbrand > ... > > Although it might not be optimal, but keeping a separate counter for > > pagefault_disable() as part of the preemption counter seems to be the only > > doable thing right now. I am not sure if a completely separated counter is > > even > > possible, increasing the size of thread_info. > > What about adding (say) 0x1 for the more restrictive test? > > David > You mean as part of the preempt counter? The current layout (on my branch) is * PREEMPT_MASK:0x00ff * SOFTIRQ_MASK:0xff00 * HARDIRQ_MASK:0x000f * NMI_MASK:0x0010 * PREEMPT_ACTIVE: 0x0020 I would have added * PAGEFAULT_MASK: 0x03C0 So 4 bit == 16 levels (tbd) By implementing scope checks in the debug case like done for the regular preempt_count_inc() preempt_count_dec(), we could catch over/underflows. Thanks, David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
From: David Hildenbrand ... > Although it might not be optimal, but keeping a separate counter for > pagefault_disable() as part of the preemption counter seems to be the only > doable thing right now. I am not sure if a completely separated counter is > even > possible, increasing the size of thread_info. What about adding (say) 0x1 for the more restrictive test? David N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> OTOH, there is no reason why we need to disable preemption over that > page_fault_disabled() region. There are code pathes which really do > not require to disable preemption for that. > > We have that seperated in preempt-rt for obvious reasons and IIRC > Peter Zijlstra tried to distangle it in mainline some time ago. I > forgot why that never got merged. > Of course, we can completely separate that in our page fault code by doing pagefault_disabled() checks instead of in_atomic() checks (even in add on patches later). > We tie way too much stuff on the preemption count already, which is a > mightmare because we have no clear distinction of protection > scopes. Although it might not be optimal, but keeping a separate counter for pagefault_disable() as part of the preemption counter seems to be the only doable thing right now. I am not sure if a completely separated counter is even possible, increasing the size of thread_info. I am working on a prototype right now. Thanks! > > Thanks, > > tglx > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Thu, 27 Nov 2014, Heiko Carstens wrote: > On Thu, Nov 27, 2014 at 09:03:01AM +0100, David Hildenbrand wrote: > > > Code like > > > spin_lock(&lock); > > > if (copy_to_user(...)) > > > rc = ... > > > spin_unlock(&lock); > > > really *should* generate warnings like it did before. > > > > > > And *only* code like > > > spin_lock(&lock); > > > > Is only code like this valid or also with the spin_lock() dropped? > > (e.g. the access in patch1 if I remember correctly) > > > > So should page_fault_disable() increment the pagefault counter and the > > preempt > > counter or only the first one? > > Given that a sequence like > > page_fault_disable(); > if (copy_to_user(...)) > rc = ... > page_fault_enable(); > > is correct code right now I think page_fault_disable() should increase both. > No need for surprising semantic changes. OTOH, there is no reason why we need to disable preemption over that page_fault_disabled() region. There are code pathes which really do not require to disable preemption for that. We have that seperated in preempt-rt for obvious reasons and IIRC Peter Zijlstra tried to distangle it in mainline some time ago. I forgot why that never got merged. We tie way too much stuff on the preemption count already, which is a mightmare because we have no clear distinction of protection scopes. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> On Thu, Nov 27, 2014 at 09:03:01AM +0100, David Hildenbrand wrote: > > > Code like > > > spin_lock(&lock); > > > if (copy_to_user(...)) > > > rc = ... > > > spin_unlock(&lock); > > > really *should* generate warnings like it did before. > > > > > > And *only* code like > > > spin_lock(&lock); > > > > Is only code like this valid or also with the spin_lock() dropped? > > (e.g. the access in patch1 if I remember correctly) > > > > So should page_fault_disable() increment the pagefault counter and the > > preempt > > counter or only the first one? > > Given that a sequence like > > page_fault_disable(); > if (copy_to_user(...)) > rc = ... > page_fault_enable(); > > is correct code right now I think page_fault_disable() should increase both. > No need for surprising semantic changes. > > > So we would have pagefault code rely on: > > > > in_disabled_pagefault() ( pagefault_disabled() ... whatever ) instead of > > in_atomic(). > > No, let's be more defensive: the page fault handler should do nothing if > in_atomic() just like now. But it could have a quick check and emit a one > time warning if page faults aren't disabled in addition. > That might help debugging but keeps the system more likely alive. Sounds sane if we increase both counters! > > might_fault() however should call might_sleep() if page faults aren't > disabled, but that's what you proposed anyway I think. Jap, sounds good to me. Will see if I can come up with something. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Thu, Nov 27, 2014 at 09:03:01AM +0100, David Hildenbrand wrote: > > Code like > > spin_lock(&lock); > > if (copy_to_user(...)) > > rc = ... > > spin_unlock(&lock); > > really *should* generate warnings like it did before. > > > > And *only* code like > > spin_lock(&lock); > > Is only code like this valid or also with the spin_lock() dropped? > (e.g. the access in patch1 if I remember correctly) > > So should page_fault_disable() increment the pagefault counter and the preempt > counter or only the first one? Given that a sequence like page_fault_disable(); if (copy_to_user(...)) rc = ... page_fault_enable(); is correct code right now I think page_fault_disable() should increase both. No need for surprising semantic changes. > So we would have pagefault code rely on: > > in_disabled_pagefault() ( pagefault_disabled() ... whatever ) instead of > in_atomic(). No, let's be more defensive: the page fault handler should do nothing if in_atomic() just like now. But it could have a quick check and emit a one time warning if page faults aren't disabled in addition. That might help debugging but keeps the system more likely alive. might_fault() however should call might_sleep() if page faults aren't disabled, but that's what you proposed anyway I think. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> Code like > spin_lock(&lock); > if (copy_to_user(...)) > rc = ... > spin_unlock(&lock); > really *should* generate warnings like it did before. > > And *only* code like > spin_lock(&lock); Is only code like this valid or also with the spin_lock() dropped? (e.g. the access in patch1 if I remember correctly) So should page_fault_disable() increment the pagefault counter and the preempt counter or only the first one? > page_fault_disable(); > if (copy_to_user(...)) > rc = ... > page_fault_enable(); > spin_unlock(&lock); > should not generate warnings, since the author hopefully knew what he did. > > We could achieve that by e.g. adding a couple of pagefault disabled bits > within current_thread_info()->preempt_count, which would allow > pagefault_disable() and pagefault_enable() to modify a different part of > preempt_count than it does now, so there is a way to tell if pagefaults have > been explicitly disabled or are just a side effect of preemption being > disabled. > This would allow might_fault() to restore its old sane behaviour for the > !page_fault_disabled() case. So we would have pagefault code rely on: in_disabled_pagefault() ( pagefault_disabled() ... whatever ) instead of in_atomic(). I agree with this approach, as this is basically what I suggested in one of my previous mails. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Thu, Nov 27, 2014 at 08:09:19AM +0100, Heiko Carstens wrote: > On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote: > > On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: > > > > But this one was > giving users in field false positives. > > > > > > So lets try to fix those, ok? If we cant, then tough luck. > > > > Sure. > > I think the simplest way might be to make spinlock disable > > premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. > > > > As a result, userspace access will fail and caller will > > get a nice error. > > Yes, _userspace_ now sees unpredictable behaviour, instead of that the > kernel emits a big loud warning to the console. So I don't object to adding more debugging at all. Sure, would be nice. But the fix is not an unconditional might_sleep within might_fault, this would trigger false positives. Rather, detect that you took a spinlock without disabling preemption. > Please consider this simple example: > > int bar(char __user *ptr) > { > ... > if (copy_to_user(ptr, ...) > return -EFAULT; > ... > } > > SYSCALL_DEFINE1(foo, char __user *, ptr) > { > int rc; > > ... > rc = bar(ptr); > if (rc) > goto out; > ... > out: > return rc; > } > > The above simple system call just works fine, with and without your change, > however if somebody (incorrectly) changes sys_foo() to the code below: > > spin_lock(&lock); > rc = bar(ptr); > if (rc) > goto out; > out: > spin_unlock(&lock); > return rc; > > Broken code like above used to generate warnings. With your change we won't > see any warnings anymore. Instead we get random and bad behaviour: > > For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see > a fault, potentially schedule and potentially deadlock on &lock. > Without _any_ warning anymore. > > For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if > the page is not mapped, userspace now all of the sudden will see an invalid(!) > -EFAULT return code, instead of that the kernel resolved the page fault. > Yes, the kernel can't resolve the fault since we hold a spinlock. But the > above bogus code did give warnings to give you an idea that something probably > is not correct. > > Who on earth is supposed to debug crap like this??? > > What we really want is: > > Code like > spin_lock(&lock); > if (copy_to_user(...)) > rc = ... > spin_unlock(&lock); > really *should* generate warnings like it did before. > > And *only* code like > spin_lock(&lock); > page_fault_disable(); > if (copy_to_user(...)) > rc = ... > page_fault_enable(); > spin_unlock(&lock); > should not generate warnings, since the author hopefully knew what he did. > > We could achieve that by e.g. adding a couple of pagefault disabled bits > within current_thread_info()->preempt_count, which would allow > pagefault_disable() and pagefault_enable() to modify a different part of > preempt_count than it does now, so there is a way to tell if pagefaults have > been explicitly disabled or are just a side effect of preemption being > disabled. > This would allow might_fault() to restore its old sane behaviour for the > !page_fault_disabled() case. Exactly. I agree, that would be a useful debugging tool. In fact this comment in mm/memory.c hints at this: * it would be nicer only to annotate paths which are not under * pagefault_disable, it further says * however that requires a larger audit and * providing helpers like get_user_atomic. but I think that what you outline is a better way to do this. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote: > On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: > > > But this one was > giving users in field false positives. > > > > So lets try to fix those, ok? If we cant, then tough luck. > > Sure. > I think the simplest way might be to make spinlock disable > premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. > > As a result, userspace access will fail and caller will > get a nice error. Yes, _userspace_ now sees unpredictable behaviour, instead of that the kernel emits a big loud warning to the console. Please consider this simple example: int bar(char __user *ptr) { ... if (copy_to_user(ptr, ...) return -EFAULT; ... } SYSCALL_DEFINE1(foo, char __user *, ptr) { int rc; ... rc = bar(ptr); if (rc) goto out; ... out: return rc; } The above simple system call just works fine, with and without your change, however if somebody (incorrectly) changes sys_foo() to the code below: spin_lock(&lock); rc = bar(ptr); if (rc) goto out; out: spin_unlock(&lock); return rc; Broken code like above used to generate warnings. With your change we won't see any warnings anymore. Instead we get random and bad behaviour: For !CONFIG_PREEMPT if the page at ptr is not mapped, the kernel will see a fault, potentially schedule and potentially deadlock on &lock. Without _any_ warning anymore. For CONFIG_PREEMPT if the page at ptr is mapped, everthing works. However if the page is not mapped, userspace now all of the sudden will see an invalid(!) -EFAULT return code, instead of that the kernel resolved the page fault. Yes, the kernel can't resolve the fault since we hold a spinlock. But the above bogus code did give warnings to give you an idea that something probably is not correct. Who on earth is supposed to debug crap like this??? What we really want is: Code like spin_lock(&lock); if (copy_to_user(...)) rc = ... spin_unlock(&lock); really *should* generate warnings like it did before. And *only* code like spin_lock(&lock); page_fault_disable(); if (copy_to_user(...)) rc = ... page_fault_enable(); spin_unlock(&lock); should not generate warnings, since the author hopefully knew what he did. We could achieve that by e.g. adding a couple of pagefault disabled bits within current_thread_info()->preempt_count, which would allow pagefault_disable() and pagefault_enable() to modify a different part of preempt_count than it does now, so there is a way to tell if pagefaults have been explicitly disabled or are just a side effect of preemption being disabled. This would allow might_fault() to restore its old sane behaviour for the !page_fault_disabled() case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 07:04:47PM +0200, Michael S. Tsirkin wrote: > On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: > > > But this one was > giving users in field false positives. > > > > So lets try to fix those, ok? If we cant, then tough luck. > > Sure. > I think the simplest way might be to make spinlock disable > premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. Specifically maybe DEBUG_ATOMIC_SLEEP should select PREEMPT_COUNT? > As a result, userspace access will fail and caller will > get a nice error. > > > > > But coming up with wrong statements is not helpful. > > True. Sorry that I did that. > > > > > > > The point is that *_user is safe with preempt off. > > > It returns an error gracefully. > > > It does not sleep. > > > It does not trigger the scheduler in that context. > > > > There are special cases where your statement is true. But its not in > > general. > > copy_to_user might fault and that fault might sleep and reschedule. > > Yes. But not if called inatomic. > > > > > For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and > > all these functions could do an GFP_KERNEL allocation. Which might sleep. > > Which will schedule. > > > > > > > > > > > > > David's patch makes it say it does, so it's wrong. > > > > > > > > > > > Absolutely. > I think you can already debug your case easily, by enabling CONFIG_PREEMPT. > This seems counter-intuitive, and distro debug kernels don't seem to do this. > > -- > MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:51:08PM +0100, Christian Borntraeger wrote: > > But this one was > giving users in field false positives. > > So lets try to fix those, ok? If we cant, then tough luck. Sure. I think the simplest way might be to make spinlock disable premption when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. As a result, userspace access will fail and caller will get a nice error. > But coming up with wrong statements is not helpful. True. Sorry that I did that. > > > > The point is that *_user is safe with preempt off. > > It returns an error gracefully. > > It does not sleep. > > It does not trigger the scheduler in that context. > > There are special cases where your statement is true. But its not in general. > copy_to_user might fault and that fault might sleep and reschedule. Yes. But not if called inatomic. > For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all > these functions could do an GFP_KERNEL allocation. Which might sleep. Which > will schedule. > > > > > > > > David's patch makes it say it does, so it's wrong. > > > > > > Absolutely. I think you can already debug your case easily, by enabling CONFIG_PREEMPT. This seems counter-intuitive, and distro debug kernels don't seem to do this. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 17:32 schrieb Michael S. Tsirkin: [...] This is what happened on our side (very recent kernel): spin_lock(&lock) copy_to_user(...) spin_unlock(&lock) >>> >>> That's a deadlock even without copy_to_user - it's >>> enough for the thread to be preempted and another one >>> to try taking the lock. >> >> Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt >> = server anyway). > > Are you sure? Can you point me where it does this please? spin_lock --> raw_spin_lock --> _raw_spin_lock --> __raw_spin_lock static inline void __raw_spin_lock(raw_spinlock_t *lock) { > preempt_disable(); <- spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); } Michael, please be serious. The whole kernel would be broken if spin_lock would not disable preemption. > >> But please: One step back. The problem is not the good path. The problem is >> that we lost a debugging aid for a known to be broken case. In other words: >> Our code had a bug. Older kernels detected that kind of bug. With your >> change we no longer saw the sleeping while atomic. Thats it. See my other >> mail. >> >> Christian > > You want to add more debugging tools, fine. We dont want to add, we want to fix something that used to work > But this one was > giving users in field false positives. So lets try to fix those, ok? If we cant, then tough luck. But coming up with wrong statements is not helpful. > > The point is that *_user is safe with preempt off. > It returns an error gracefully. > It does not sleep. > It does not trigger the scheduler in that context. There are special cases where your statement is true. But its not in general. copy_to_user might fault and that fault might sleep and reschedule. For example handle_mm_fault might go down to pud_alloc, pmd_alloc etc and all these functions could do an GFP_KERNEL allocation. Which might sleep. Which will schedule. > > > David's patch makes it say it does, so it's wrong. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:30:35PM +0100, Christian Borntraeger wrote: > Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin: > > On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote: > This is what happened on our side (very recent kernel): > > spin_lock(&lock) > copy_to_user(...) > spin_unlock(&lock) > >>> > >>> That's a deadlock even without copy_to_user - it's > >>> enough for the thread to be preempted and another one > >>> to try taking the lock. > >>> > >>> > 1. s390 locks/unlocks a spin lock with a compare and swap, using the > _cpu id_ > as "old value" > 2. we slept during copy_to_user() > 3. the thread got scheduled onto another cpu > 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that > locked > the spinlock tried to unlocked it). > 5. lock remained locked -> deadlock > > Christian came up with the following explanation: > Without preemption, spin_lock() will not touch the preempt counter. > disable_pfault() will always touch it. > > Therefore, with preemption disabled, copy_to_user() has no idea that it > is > running in atomic context - and will therefore try to sleep. > > So copy_to_user() will on s390: > 1. run "as atomic" while spin_lock() with preemption enabled. > 2. run "as not atomic" while spin_lock() with preemption disabled. > 3. run "as atomic" while pagefault_disabled() with preemption enabled or > disabled. > 4. run "as not atomic" when really not atomic. > >> > >> should have been more clear at that point: > >> preemption enabled == kernel compiled with preemption support > >> preemption disabled == kernel compiled without preemption support > >> > > And exactly nr 2. is the thing that produced the deadlock in our > scenario and > the reason why I want a might_sleep() :) > >>> > >>> IMHO it's not copy to user that causes the problem. > >>> It's the misuse of spinlocks with preemption on. > >> > >> As I said, preemption was off. > > > > off -> disabled at compile time? > > > > But the code is broken for people that do enable it. > [...] > > You should normally disable preemption if you take > > spinlocks. > > Your are telling that any sequence of > spin_lock > ... > spin_unlock > > is broken with CONFIG_PREEMPT? > Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just > fine. > > Only sequences like > spin_lock > ... > schedule > ... > spin_unlock > are broken. > > But as I said. That is not the problem that we are discussing here. > > Christian I'm saying spin_lock without _irqsave is often a bug. I am also saying this code in mm/fault.c: __do_page_fault ... /* * If we're in an interrupt, have no user context or are running * in an atomic region then we must not take the fault: */ if (unlikely(in_atomic() || !mm)) { bad_area_nosemaphore(regs, error_code, address); return; } means that a fault won't cause sleep if called in atomic context. And a bunch of code relies on this. This is why might_fault does: * it would be nicer only to annotate paths which are not under * pagefault_disable, however that requires a larger audit and * providing helpers like get_user_atomic. */ if (in_atomic()) return; __might_sleep(__FILE__, __LINE__, 0); If you see this violated, let's figure out why. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:07:13PM +0100, Christian Borntraeger wrote: > Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin: > > On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote: > >>> On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > >> What's the path you are trying to debug? > > > > Well, we had a problem where we held a spin_lock and called > > copy_(from|to)_user(). We experienced very random deadlocks that took > > some guy > > almost a week to debug. The simple might_sleep() check would have > > showed this > > error immediately. > > This must have been a very old kernel. > A modern kernel will return an error from copy_to_user. > Which is really the point of the patch you are trying to revert. > >>> > >>> That's assuming you disabled preemption. If you didn't, and take > >>> a spinlock, you have deadlocks even without userspace access. > >>> > >> > >> (Thanks for your resent, my first email was sent directly to you ... grml) > >> > >> This is what happened on our side (very recent kernel): > >> > >> spin_lock(&lock) > >> copy_to_user(...) > >> spin_unlock(&lock) > > > > That's a deadlock even without copy_to_user - it's > > enough for the thread to be preempted and another one > > to try taking the lock. > > Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = > server anyway). Are you sure? Can you point me where it does this please? > But please: One step back. The problem is not the good path. The problem is > that we lost a debugging aid for a known to be broken case. In other words: > Our code had a bug. Older kernels detected that kind of bug. With your change > we no longer saw the sleeping while atomic. Thats it. See my other mail. > > Christian You want to add more debugging tools, fine. But this one was giving users in field false positives. The point is that *_user is safe with preempt off. It returns an error gracefully. It does not sleep. It does not trigger the scheduler in that context. David's patch makes it say it does, so it's wrong. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 17:19 schrieb Michael S. Tsirkin: > On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote: This is what happened on our side (very recent kernel): spin_lock(&lock) copy_to_user(...) spin_unlock(&lock) >>> >>> That's a deadlock even without copy_to_user - it's >>> enough for the thread to be preempted and another one >>> to try taking the lock. >>> >>> 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as "old value" 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked -> deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run "as atomic" while spin_lock() with preemption enabled. 2. run "as not atomic" while spin_lock() with preemption disabled. 3. run "as atomic" while pagefault_disabled() with preemption enabled or disabled. 4. run "as not atomic" when really not atomic. >> >> should have been more clear at that point: >> preemption enabled == kernel compiled with preemption support >> preemption disabled == kernel compiled without preemption support >> And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) >>> >>> IMHO it's not copy to user that causes the problem. >>> It's the misuse of spinlocks with preemption on. >> >> As I said, preemption was off. > > off -> disabled at compile time? > > But the code is broken for people that do enable it. [...] > You should normally disable preemption if you take > spinlocks. Your are telling that any sequence of spin_lock ... spin_unlock is broken with CONFIG_PREEMPT? Michael, that is bullshit. spin_lock will take care of CONFIG_PREEMPT just fine. Only sequences like spin_lock ... schedule ... spin_unlock are broken. But as I said. That is not the problem that we are discussing here. Christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:02:23PM +0100, David Hildenbrand wrote: > > > This is what happened on our side (very recent kernel): > > > > > > spin_lock(&lock) > > > copy_to_user(...) > > > spin_unlock(&lock) > > > > That's a deadlock even without copy_to_user - it's > > enough for the thread to be preempted and another one > > to try taking the lock. > > > > > > > 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu > > > id_ > > >as "old value" > > > 2. we slept during copy_to_user() > > > 3. the thread got scheduled onto another cpu > > > 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that > > > locked > > >the spinlock tried to unlocked it). > > > 5. lock remained locked -> deadlock > > > > > > Christian came up with the following explanation: > > > Without preemption, spin_lock() will not touch the preempt counter. > > > disable_pfault() will always touch it. > > > > > > Therefore, with preemption disabled, copy_to_user() has no idea that it is > > > running in atomic context - and will therefore try to sleep. > > > > > > So copy_to_user() will on s390: > > > 1. run "as atomic" while spin_lock() with preemption enabled. > > > 2. run "as not atomic" while spin_lock() with preemption disabled. > > > 3. run "as atomic" while pagefault_disabled() with preemption enabled or > > > disabled. > > > 4. run "as not atomic" when really not atomic. > > should have been more clear at that point: > preemption enabled == kernel compiled with preemption support > preemption disabled == kernel compiled without preemption support > > > > > > > And exactly nr 2. is the thing that produced the deadlock in our scenario > > > and > > > the reason why I want a might_sleep() :) > > > > IMHO it's not copy to user that causes the problem. > > It's the misuse of spinlocks with preemption on. > > As I said, preemption was off. off -> disabled at compile time? But the code is broken for people that do enable it. > > > > So might_sleep would make you think copy_to_user is > > the problem, and e.g. let you paper over it by > > moving copy_to_user out. > > Actually implementing different way of locking easily fixed the problem for > us. > The old might_sleep() checks would have given us the problem within a few > seconds (I tested it). Or enable CONFIG_PREMPT, with same effect (copy_to_user will report an error). Do you check return code from copy to user? If not then you have another bug ... > > > > Enable lock prover and you will see what the real > > issue is, which is you didn't disable preempt. > > and if you did, copy_to_user would be okay. > > > > Our kernel is compiled without preemption and we turned on all lock/atomic > sleep debugging aid. No problem was detected. But your code is still buggy with preemption on, isn't it? > > But the question is if we shouldn't rather provide a: > > copy_to_user_nosleep() implementation that can be called from > pagefault_disable() because it won't sleep. > and a > copy_to_user_sleep() implementation that cannot be called from > pagefault_disable(). > > Another way to fix it would be a reworked pagefault_disable() function that > somehow sets "a flag", so copy_to_user() knows that it is in fact called from > a > valid context, not just from "some atomic" context. So we could trigger > might_sleep() when detecting a !pagefault_disable contex I think all this is just directing people to paper over the problem. You should normally disable preemption if you take spinlocks. Yes it might happen to work if preempt is compiled out and you don't trigger scheduler, but Linux might add scheduler calls at any point without notice, code must be preempt safe. Maybe add a debug option warning about spinlocks taken with preempt on. That would make sense I think. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 16:47 schrieb Michael S. Tsirkin: > On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote: >>> On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: >> What's the path you are trying to debug? > > Well, we had a problem where we held a spin_lock and called > copy_(from|to)_user(). We experienced very random deadlocks that took > some guy > almost a week to debug. The simple might_sleep() check would have showed > this > error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. >>> >>> That's assuming you disabled preemption. If you didn't, and take >>> a spinlock, you have deadlocks even without userspace access. >>> >> >> (Thanks for your resent, my first email was sent directly to you ... grml) >> >> This is what happened on our side (very recent kernel): >> >> spin_lock(&lock) >> copy_to_user(...) >> spin_unlock(&lock) > > That's a deadlock even without copy_to_user - it's > enough for the thread to be preempted and another one > to try taking the lock. Huh? With CONFIG_PREEMPT spin_lock will disable preemption. (we had preempt = server anyway). But please: One step back. The problem is not the good path. The problem is that we lost a debugging aid for a known to be broken case. In other words: Our code had a bug. Older kernels detected that kind of bug. With your change we no longer saw the sleeping while atomic. Thats it. See my other mail. Christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> > This is what happened on our side (very recent kernel): > > > > spin_lock(&lock) > > copy_to_user(...) > > spin_unlock(&lock) > > That's a deadlock even without copy_to_user - it's > enough for the thread to be preempted and another one > to try taking the lock. > > > > 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu > > id_ > >as "old value" > > 2. we slept during copy_to_user() > > 3. the thread got scheduled onto another cpu > > 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked > >the spinlock tried to unlocked it). > > 5. lock remained locked -> deadlock > > > > Christian came up with the following explanation: > > Without preemption, spin_lock() will not touch the preempt counter. > > disable_pfault() will always touch it. > > > > Therefore, with preemption disabled, copy_to_user() has no idea that it is > > running in atomic context - and will therefore try to sleep. > > > > So copy_to_user() will on s390: > > 1. run "as atomic" while spin_lock() with preemption enabled. > > 2. run "as not atomic" while spin_lock() with preemption disabled. > > 3. run "as atomic" while pagefault_disabled() with preemption enabled or > > disabled. > > 4. run "as not atomic" when really not atomic. should have been more clear at that point: preemption enabled == kernel compiled with preemption support preemption disabled == kernel compiled without preemption support > > > > And exactly nr 2. is the thing that produced the deadlock in our scenario > > and > > the reason why I want a might_sleep() :) > > IMHO it's not copy to user that causes the problem. > It's the misuse of spinlocks with preemption on. As I said, preemption was off. > > So might_sleep would make you think copy_to_user is > the problem, and e.g. let you paper over it by > moving copy_to_user out. Actually implementing different way of locking easily fixed the problem for us. The old might_sleep() checks would have given us the problem within a few seconds (I tested it). > > Enable lock prover and you will see what the real > issue is, which is you didn't disable preempt. > and if you did, copy_to_user would be okay. > Our kernel is compiled without preemption and we turned on all lock/atomic sleep debugging aid. No problem was detected. But the question is if we shouldn't rather provide a: copy_to_user_nosleep() implementation that can be called from pagefault_disable() because it won't sleep. and a copy_to_user_sleep() implementation that cannot be called from pagefault_disable(). Another way to fix it would be a reworked pagefault_disable() function that somehow sets "a flag", so copy_to_user() knows that it is in fact called from a valid context, not just from "some atomic" context. So we could trigger might_sleep() when detecting a !pagefault_disable context. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 16:37 schrieb Michael S. Tsirkin: > On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote: >> Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin: >>> On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. >>> >> >>> This must have been a very old kernel. >>> A modern kernel will return an error from copy_to_user. >> >> I disagree. copy_to_user will not return while holding a spinlock, because >> it does not know! How should it? >> See: spin_lock will call preempt_disable, but thats a no-op for a >> non-preempt kernel. So the mere fact that we hold a spin_lock is not known >> by any user access function. (or others). No? >> >> Christian >> >> > > Well might_sleep() merely checks preempt count and irqs_disabled too. > If you want debugging things to trigger, you need to enable > a bunch of config options. That's not new. You miss the point of the whole thread: The problem is that even with debug options enabled, holding a spinlock would not trigger a bug on copy_to_user. So the problem is not the good path, the problem is that a debugging aid for detecting a broken case was lost. Even with all kernel debugging enabled. That is because CONFIG_DEBUG_ATOMIC_SLEEP selects PREEMPT_COUNT. That means: spin_lock will then be considered as in_atomic and no message comes. Without CONFIG_DEBUG_ATOMIC_SLEEP spin_lock will not touch the preempt_count but we also dont see a message because might_fault is now a nop I understand that you dont like Davids changes due to other side effects that you have mentioned. So lets focus on how we can fix the debug option. Ok? Christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 04:32:07PM +0100, David Hildenbrand wrote: > > On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: > > > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > > > > > What's the path you are trying to debug? > > > > > > > > Well, we had a problem where we held a spin_lock and called > > > > copy_(from|to)_user(). We experienced very random deadlocks that took > > > > some guy > > > > almost a week to debug. The simple might_sleep() check would have > > > > showed this > > > > error immediately. > > > > > > This must have been a very old kernel. > > > A modern kernel will return an error from copy_to_user. > > > Which is really the point of the patch you are trying to revert. > > > > That's assuming you disabled preemption. If you didn't, and take > > a spinlock, you have deadlocks even without userspace access. > > > > (Thanks for your resent, my first email was sent directly to you ... grml) > > This is what happened on our side (very recent kernel): > > spin_lock(&lock) > copy_to_user(...) > spin_unlock(&lock) That's a deadlock even without copy_to_user - it's enough for the thread to be preempted and another one to try taking the lock. > 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ >as "old value" > 2. we slept during copy_to_user() > 3. the thread got scheduled onto another cpu > 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked >the spinlock tried to unlocked it). > 5. lock remained locked -> deadlock > > Christian came up with the following explanation: > Without preemption, spin_lock() will not touch the preempt counter. > disable_pfault() will always touch it. > > Therefore, with preemption disabled, copy_to_user() has no idea that it is > running in atomic context - and will therefore try to sleep. > > So copy_to_user() will on s390: > 1. run "as atomic" while spin_lock() with preemption enabled. > 2. run "as not atomic" while spin_lock() with preemption disabled. > 3. run "as atomic" while pagefault_disabled() with preemption enabled or > disabled. > 4. run "as not atomic" when really not atomic. > > And exactly nr 2. is the thing that produced the deadlock in our scenario and > the reason why I want a might_sleep() :) IMHO it's not copy to user that causes the problem. It's the misuse of spinlocks with preemption on. So might_sleep would make you think copy_to_user is the problem, and e.g. let you paper over it by moving copy_to_user out. Enable lock prover and you will see what the real issue is, which is you didn't disable preempt. and if you did, copy_to_user would be okay. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 04:30:32PM +0100, Christian Borntraeger wrote: > Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin: > > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > >>> What's the path you are trying to debug? > >> > >> Well, we had a problem where we held a spin_lock and called > >> copy_(from|to)_user(). We experienced very random deadlocks that took some > >> guy > >> almost a week to debug. The simple might_sleep() check would have showed > >> this > >> error immediately. > > > > > This must have been a very old kernel. > > A modern kernel will return an error from copy_to_user. > > I disagree. copy_to_user will not return while holding a spinlock, because it > does not know! How should it? > See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt > kernel. So the mere fact that we hold a spin_lock is not known by any user > access function. (or others). No? > > Christian > > Well might_sleep() merely checks preempt count and irqs_disabled too. If you want debugging things to trigger, you need to enable a bunch of config options. That's not new. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
> On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: > > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > > > > What's the path you are trying to debug? > > > > > > Well, we had a problem where we held a spin_lock and called > > > copy_(from|to)_user(). We experienced very random deadlocks that took > > > some guy > > > almost a week to debug. The simple might_sleep() check would have showed > > > this > > > error immediately. > > > > This must have been a very old kernel. > > A modern kernel will return an error from copy_to_user. > > Which is really the point of the patch you are trying to revert. > > That's assuming you disabled preemption. If you didn't, and take > a spinlock, you have deadlocks even without userspace access. > (Thanks for your resent, my first email was sent directly to you ... grml) This is what happened on our side (very recent kernel): spin_lock(&lock) copy_to_user(...) spin_unlock(&lock) 1. s390 locks/unlocks a spin lock with a compare and swap, using the _cpu id_ as "old value" 2. we slept during copy_to_user() 3. the thread got scheduled onto another cpu 4. spin_unlock failed as the _cpu id_ didn't match (another cpu that locked the spinlock tried to unlocked it). 5. lock remained locked -> deadlock Christian came up with the following explanation: Without preemption, spin_lock() will not touch the preempt counter. disable_pfault() will always touch it. Therefore, with preemption disabled, copy_to_user() has no idea that it is running in atomic context - and will therefore try to sleep. So copy_to_user() will on s390: 1. run "as atomic" while spin_lock() with preemption enabled. 2. run "as not atomic" while spin_lock() with preemption disabled. 3. run "as atomic" while pagefault_disabled() with preemption enabled or disabled. 4. run "as not atomic" when really not atomic. And exactly nr 2. is the thing that produced the deadlock in our scenario and the reason why I want a might_sleep() :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Am 26.11.2014 um 16:17 schrieb Michael S. Tsirkin: > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: >>> What's the path you are trying to debug? >> >> Well, we had a problem where we held a spin_lock and called >> copy_(from|to)_user(). We experienced very random deadlocks that took some >> guy >> almost a week to debug. The simple might_sleep() check would have showed this >> error immediately. > > This must have been a very old kernel. > A modern kernel will return an error from copy_to_user. I disagree. copy_to_user will not return while holding a spinlock, because it does not know! How should it? See: spin_lock will call preempt_disable, but thats a no-op for a non-preempt kernel. So the mere fact that we hold a spin_lock is not known by any user access function. (or others). No? Christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 05:17:29PM +0200, Michael S. Tsirkin wrote: > On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > > > What's the path you are trying to debug? > > > > Well, we had a problem where we held a spin_lock and called > > copy_(from|to)_user(). We experienced very random deadlocks that took some > > guy > > almost a week to debug. The simple might_sleep() check would have showed > > this > > error immediately. > > This must have been a very old kernel. > A modern kernel will return an error from copy_to_user. > Which is really the point of the patch you are trying to revert. That's assuming you disabled preemption. If you didn't, and take a spinlock, you have deadlocks even without userspace access. -- MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Hmm you sent same mail to me off-list, and I replied there. Now there's a copy on list - I'm just going to assume it's exactly identical, pasting my response here as well. If there are more questions I missed, let me know. On Wed, Nov 26, 2014 at 09:23:31AM +0100, David Hildenbrand wrote: > Sorry I haven't put you on cc, must have lost you while packing my list :) > Thanks for your answer! > > This change was introduced in 2013, and I haven't seen an instance making use > of your described scenario, right? My work is still out of tree. RHEL6 shipped some patches that use this. I don't know whether any instances in-tree use this capability. But it just doesn't make sense for might_fault to call might_sleep because a fault does not imply sleep. > > On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote: > > > I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466 > > > removed/skipped all might_sleep checks for might_fault() calls when in > > > atomic. > > > > Yes. You can add e.g. might_sleep in your code if, for some reason, it is > > illegal to call it in an atomic context. > > But faults are legal in atomic if you handle the possible > > errors, and faults do not necessary cause caller to sleep, so might_fault > > should not call might_sleep. > > I think we should use in_atomic variants for this purpose (as done in the code > for now) - especially as pagefault_disable has been relying on the preempt > count for a long time. But see my comment below about existing documentation. IIUC they are not interchangeable. *in_atomic seems to require that page is pinned. *user does not: it installs a fixup so you get an error code if you try to access an invalid address. Besides, this would just lead to a ton of if (atomic) return inatomic else return user in code, for no good purpose. > > > > > Reason was to allow calling copy_(to|from)_user while in > > > pagefault_disabled(), > > > because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives. > > > > That wasn't the only reason BTW. > > Andi Kleen also showed that compiler did a terrible job optimizing > > get/put user when might_sleep was called. > > might_fault() should never call might_sleep() when lock debugging is off, so > there should be no performance problem or am I missing something? CONFIG_DEBUG_ATOMIC_SLEEP too, no? > > See e.g. this thread: > > https://lkml.org/lkml/2013/8/14/652 > > There was even an lwn.net article about it, that I don't seem to be > > able to locate. > > Thanks, will see if I can find it. > > > So might_sleep is not appropriate for lightweight operations like > > __get_user, > > which people literally expect to be a single instruction. > > Yes, as discussed below, __.* variants should never call it. So that would be even more inconsistent. They fault exactly the same as the non __ variants. > > > > > > I also have a project going which handles very short packets by copying > > them into guest memory directly without waking up a thread. > > I do it by calling recvmsg on a socket, then switching to > > a thread if I get back EFAULT. > > Not yet clean enough to upstream but it does seem to cut > > the latency down quite a bit, so I'd like to have the option. > > > > > > Generally, a caller that does something like this under a spinlock: > > preempt_disable > > pagefault_disable > > error = copy_to_user > > pagefault_enable > > preempt_enable_no_resched > > > > is not doing anything wrong and should not get a warning, > > as long as error is handled correctly later. > > I think this would be a perfect fit for an inatomic variant, no? No because inatomic does not handle faults. > I mean even the copy_to_user documentation of e.g. s390, x86, mips > clearly says: > "Context: User context only.>-This function may sleep." So the comment is incomplete - I didn't get around fixing that. It may sleep but not in atomic context. > So calling it from your described scenario is wrong. And as the interface > says, > it might_sleep() and therefore also call the check in might_fault(). Exactly the reverse. There's no way for it to sleep except on fault and that only if preempttion is on. > > > > You can also find the discussion around the patches > > educational: > > http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928 > > > > > However > > > we have the inatomic variants of these function for this purpose. > > > > Does inatomic install fixups now? > > I think this varies between architectures but I am no expert. But as 99,9% of > all pagefault_disable code uses inatomic, I would have guessed that this is > rather the way to got than simply using the non atomic variant that clearly > states on the interface that it might sleep? Let's go ahead and make the comment more exact then. > > > > Last I checked, it didn't so it did not satisfy this purpose. > > See this comment from x86: > > > > * Copy data from k
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Wed, Nov 26, 2014 at 11:05:04AM +0100, David Hildenbrand wrote: > > What's the path you are trying to debug? > > Well, we had a problem where we held a spin_lock and called > copy_(from|to)_user(). We experienced very random deadlocks that took some guy > almost a week to debug. The simple might_sleep() check would have showed this > error immediately. This must have been a very old kernel. A modern kernel will return an error from copy_to_user. Which is really the point of the patch you are trying to revert. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
Hi Michael, thanks for your reply! some general thought: This change was introduced mid 2013 but we don't have a single user relying on this code change yet, right? Disabling might_sleep() for functions that clearly state that they may sleep to get some special case running feels wrong to me. > On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote: > > I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466 > > removed/skipped all might_sleep checks for might_fault() calls when in > > atomic. > > Yes. You can add e.g. might_sleep in your code if, for some reason, it is > illegal to call it in an atomic context. > But faults are legal in atomic if you handle the possible > errors, and faults do not necessary cause caller to sleep, so might_fault > should not call might_sleep. My point is that and (almost at) everywhere where we use pagefault_disable, we are using the inatomic variants. Also the documentation of copy_to_user() clearly states at various points that this function "may sleep": -> git grep "This function may sleep" yields "Context: User context only. This function may sleep." e.g. s390, x86, mips Patching out the might_sleep() from these functions seems more to be a hack to solve another problem - not using the inatomic variants or finding them not to be sufficient for a task? So calling might_sleep() in these functions seems very right to me. > > > Reason was to allow calling copy_(to|from)_user while in > > pagefault_disabled(), > > because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives. > > That wasn't the only reason BTW. > Andi Kleen also showed that compiler did a terrible job optimizing > get/put user when might_sleep was called. > See e.g. this thread: > https://lkml.org/lkml/2013/8/14/652 > There was even an lwn.net article about it, that I don't seem to be > able to locate. Thanks, I'll try to look it up! but: might_sleep() will only be called when lock debugging / sleep in atomic is in, so this doesn't seem to be a problem for me in a production environment. or am I missing something? > So might_sleep is not appropriate for lightweight operations like __get_user, > which people literally expect to be a single instruction. I agree that __.* variants should not call might_fault() (like I mentioned after the table below). > > > I also have a project going which handles very short packets by copying > them into guest memory directly without waking up a thread. > I do it by calling recvmsg on a socket, then switching to > a thread if I get back EFAULT. > Not yet clean enough to upstream but it does seem to cut > the latency down quite a bit, so I'd like to have the option. > > > Generally, a caller that does something like this under a spinlock: > preempt_disable > pagefault_disable > error = copy_to_user So why can't we use the inatomic variant? This seems to be atomic environment to me. Calling a function that states that it may sleep doesn't feel right to me. > pagefault_enable > preempt_enable_no_resched > > is not doing anything wrong and should not get a warning, > as long as error is handled correctly later. > > You can also find the discussion around the patches > educational: > http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928 > > > However > > we have the inatomic variants of these function for this purpose. > > Does inatomic install fixups now? If not, I think this would rather be the way to go. > > Last I checked, it didn't so it did not satisfy this purpose. > See this comment from x86: > > * Copy data from kernel space to user space. Caller must check > * the specified block with access_ok() before calling this function. > * The caller should also make sure he pins the user space address > * so that we don't result in page fault and sleep. > > > Also - switching to inatomic would scatter if (atomic) all > over code. It's much better to simply call the same > function (e.g. recvmsg) and have it fail gracefully: > after all, we have code to handle get/put user errors > anyway. > > > The result of this change was that all guest access (that correctly uses > > might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP > > is > > enabled. We lost a mighty debugging feature for user access. > > What's the path you are trying to debug? Well, we had a problem where we held a spin_lock and called copy_(from|to)_user(). We experienced very random deadlocks that took some guy almost a week to debug. The simple might_sleep() check would have showed this error immediately. I would have said that in 99,9% of all copy_to_user() calls we want to check might_sleep(). That pagefault_disable() is a special case that should be handled differently - in my opinion. > > If your code can faults, then it's safe to call > from atomic context. > If it can't, it must pin the page. You can not do access_ok checks and > then assume acces
Re: [RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic
On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote: > I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466 > removed/skipped all might_sleep checks for might_fault() calls when in atomic. Yes. You can add e.g. might_sleep in your code if, for some reason, it is illegal to call it in an atomic context. But faults are legal in atomic if you handle the possible errors, and faults do not necessary cause caller to sleep, so might_fault should not call might_sleep. > Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(), > because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives. That wasn't the only reason BTW. Andi Kleen also showed that compiler did a terrible job optimizing get/put user when might_sleep was called. See e.g. this thread: https://lkml.org/lkml/2013/8/14/652 There was even an lwn.net article about it, that I don't seem to be able to locate. So might_sleep is not appropriate for lightweight operations like __get_user, which people literally expect to be a single instruction. I also have a project going which handles very short packets by copying them into guest memory directly without waking up a thread. I do it by calling recvmsg on a socket, then switching to a thread if I get back EFAULT. Not yet clean enough to upstream but it does seem to cut the latency down quite a bit, so I'd like to have the option. Generally, a caller that does something like this under a spinlock: preempt_disable pagefault_disable error = copy_to_user pagefault_enable preempt_enable_no_resched is not doing anything wrong and should not get a warning, as long as error is handled correctly later. You can also find the discussion around the patches educational: http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928 > However > we have the inatomic variants of these function for this purpose. Does inatomic install fixups now? Last I checked, it didn't so it did not satisfy this purpose. See this comment from x86: * Copy data from kernel space to user space. Caller must check * the specified block with access_ok() before calling this function. * The caller should also make sure he pins the user space address * so that we don't result in page fault and sleep. Also - switching to inatomic would scatter if (atomic) all over code. It's much better to simply call the same function (e.g. recvmsg) and have it fail gracefully: after all, we have code to handle get/put user errors anyway. > The result of this change was that all guest access (that correctly uses > might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is > enabled. We lost a mighty debugging feature for user access. What's the path you are trying to debug? If your code can faults, then it's safe to call from atomic context. If it can't, it must pin the page. You can not do access_ok checks and then assume access won't fault. > As I wasn't able to come up with any other reason why this should be > necessary, I suggest turning the might_sleep() checks on again in the > might_fault() code. Faults triggered with pagefault_disabled do not cause caller to sleep, merely to fail and get an error, so might_sleep is simply wrong. > > pagefault_disable/pagefault_enable seems to be used mainly for futex, perf > event > and kmap. > > Going over all changes since that commit, it seems like most code already > uses the > inatomic variants of copy_(to|from)_user. Code relying on (get|put)_user > during > pagefault_disable() don't make use of any might_fault() in their > (get|put)_user > implementation. Examples: > - arch/m68k/include/asm/futex.h > - arch/parisc/include/asm/futex.h > - arch/sh/include/asm/futex-irq.h > - arch/tile/include/asm/futex.h > So changing might_fault() back to trigger might_sleep() won't change a thing > for > them. Hope I haven't missed anything. Did you check the generated code? On x86 it seems to me this patchset is definitely going to slow things down, in fact putting back in a performance regression that Andi found. > I only identified one might_fault() that would get triggered by get_user() on > powerpc and fixed it by using the inatomic variant (not tested). I am not sure > if we need some non-sleeping access_ok() prior to __get_user_inatomic(). > > By looking at the code I was wondering where the correct place for > might_fault() > calls is? Doesn't seem to be very consistent. Examples: > >| asm-generic | arm | arm64 | frv | m32r | x86 and s390 > --- > get_user() | Yes | Yes | Yes | No | Yes | Yes > __get_user() | No| Yes | No| No | Yes | No > put_user() | Yes | Yes | Yes | No | Yes | Yes > __put_user() | No| Yes | No| No | Yes | No > copy_to_user() | Yes | No | No