Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Fri, Oct 19, 2018 at 07:29:45AM -0700, Andy Lutomirski wrote: > > On Oct 19, 2018, at 1:33 AM, Peter Zijlstra wrote: > > > >> On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote: > >> Consider for example do_int3(), and see my inlined comments: > >> > >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) > >> { > >>... > >>ist_enter(regs);// => preempt_disable() > >>cond_local_irq_enable(regs);// => assume it enables IRQs > >> > >>... > >>// resched irq can be delivered here. It will not caused rescheduling > >>// since preemption is disabled > >> > >>cond_local_irq_disable(regs);// => assume it disables IRQs > >>ist_exit(regs);// => preempt_enable_no_resched() > >> } > >> > >> At this point resched will not happen for unbounded length of time (unless > >> there is another point when exiting the trap handler that checks if > >> preemption should take place). > >> > >> Another example is __BPF_PROG_RUN_ARRAY(), which also uses > >> preempt_enable_no_resched(). > >> > >> Am I missing something? > > > > Would not the interrupt return then check for TIF_NEED_RESCHED and call > > schedule() ? > > The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s > fundamentally atomic — it’s running on a percpu stack and it can’t > schedule. In theory we could do some evil stack switching, but we > don’t. > > How does NMI handle this? If an NMI that hit interruptible kernel > code overflows a perf counter, how does the wake up work? NMIs should never set NEED_RESCHED. What the perf does it self-IPI (irq_work) and do the wakeup from there.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Fri, 19 Oct 2018 04:44:33 + Nadav Amit wrote: > at 9:29 PM, Andy Lutomirski wrote: > > >> On Oct 18, 2018, at 6:08 PM, Nadav Amit wrote: > >> > >> at 10:00 AM, Andy Lutomirski wrote: > >> > On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: > > at 8:51 PM, Andy Lutomirski wrote: > > >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: > >> at 6:22 PM, Andy Lutomirski wrote: > >> > On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: > > It is sometimes beneficial to prevent preemption for very few > instructions, or prevent preemption for some instructions that > precede > a branch (this latter case will be introduced in the next patches). > > To provide such functionality on x86-64, we use an empty REX-prefix > (opcode 0x40) as an indication that preemption is disabled for the > following instruction. > >>> > >>> Nifty! > >>> > >>> That being said, I think you have a few bugs. First, you can’t just > >>> ignore > >>> a rescheduling interrupt, as you introduce unbounded latency when this > >>> happens ― you’re effectively emulating preempt_enable_no_resched(), > >>> which > >>> is not a drop-in replacement for preempt_enable(). To fix this, you > >>> may > >>> need to jump to a slow-path trampoline that calls schedule() at the > >>> end or > >>> consider rewinding one instruction instead. Or use TF, which is only a > >>> little bit terrifying… > >> > >> Yes, I didn’t pay enough attention here. For my use-case, I think that > >> the > >> easiest solution would be to make synchronize_sched() ignore > >> preemptions > >> that happen while the prefix is detected. It would slightly change the > >> meaning of the prefix. > > So thinking about it further, rewinding the instruction seems the easiest > and most robust solution. I’ll do it. > > >>> You also aren’t accounting for the case where you get an exception > >>> that > >>> is, in turn, preempted. > >> > >> Hmm.. Can you give me an example for such an exception in my use-case? > >> I > >> cannot think of an exception that might be preempted (assuming #BP, #MC > >> cannot be preempted). > > > > Look for cond_local_irq_enable(). > > I looked at it. Yet, I still don’t see how exceptions might happen in my > use-case, but having said that - this can be fixed too. > >>> > >>> I’m not totally certain there’s a case that matters. But it’s worth > >>> checking > >> > >> I am still checking. But, I wanted to ask you whether the existing code is > >> correct, since it seems to me that others do the same mistake I did, unless > >> I don’t understand the code. > >> > >> Consider for example do_int3(), and see my inlined comments: > >> > >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) > >> { > >> ... > >> ist_enter(regs);// => preempt_disable() > >> cond_local_irq_enable(regs);// => assume it enables IRQs > >> > >> ... > >> // resched irq can be delivered here. It will not caused rescheduling > >> // since preemption is disabled > >> > >> cond_local_irq_disable(regs);// => assume it disables IRQs > >> ist_exit(regs);// => preempt_enable_no_resched() > >> } > >> > >> At this point resched will not happen for unbounded length of time (unless > >> there is another point when exiting the trap handler that checks if > >> preemption should take place). > > > > I think it's only a bug in the cases where someone uses extable to fix > > up an int3 (which would be nuts) or that we oops. But I should still > > fix it. In the normal case where int3 was in user code, we'll miss > > the reschedule in do_trap(), but we'll reschedule in > > prepare_exit_to_usermode() -> exit_to_usermode_loop(). > > Thanks for your quick response, and sorry for bothering instead of dealing > with it. Note that do_debug() does something similar to do_int3(). > > And then there is optimized_callback() that also uses > preempt_enable_no_resched(). I think the original use was correct, but then > a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized > kprobes”) removed the IRQ disabling, while leaving > preempt_enable_no_resched() . No? Ah, good catch! Indeed, we don't need to stick on no_resched anymore. Thanks! -- Masami Hiramatsu
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Fri, Oct 19, 2018 at 1:22 AM Peter Zijlstra wrote: > > On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote: > > > > > > > > > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > > > > preempt_enable_no_resched(). > > > > > > Alexei, I think this code is just wrong. > > > > why 'just wrong' ? > > Because you lost a preemption point, this is a no-no. > > > > > > Do you know why it uses > > > preempt_enable_no_resched()? > > > > dont recall precisely. > > we could be preemptable at the point where macro is called. > > I think the goal of no_resched was to avoid adding scheduling points > > where they didn't exist before just because a prog ran for few nsec. > > May be Daniel or Roman remember. > > No, you did the exact opposite, where there previously was a preemption, > you just ate it. The band saw didn't get stopped in time, you loose your > hand etc.. Let me do few experiments then. We will fix it up.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
> On Oct 19, 2018, at 1:33 AM, Peter Zijlstra wrote: > >> On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote: >> Consider for example do_int3(), and see my inlined comments: >> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) >> { >>... >>ist_enter(regs);// => preempt_disable() >>cond_local_irq_enable(regs);// => assume it enables IRQs >> >>... >>// resched irq can be delivered here. It will not caused rescheduling >>// since preemption is disabled >> >>cond_local_irq_disable(regs);// => assume it disables IRQs >>ist_exit(regs);// => preempt_enable_no_resched() >> } >> >> At this point resched will not happen for unbounded length of time (unless >> there is another point when exiting the trap handler that checks if >> preemption should take place). >> >> Another example is __BPF_PROG_RUN_ARRAY(), which also uses >> preempt_enable_no_resched(). >> >> Am I missing something? > > Would not the interrupt return then check for TIF_NEED_RESCHED and call > schedule() ? The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s fundamentally atomic — it’s running on a percpu stack and it can’t schedule. In theory we could do some evil stack switching, but we don’t. How does NMI handle this? If an NMI that hit interruptible kernel code overflows a perf counter, how does the wake up work? (do_int3() is special because it’s not actually IST. But it can hit in odd places due to kprobes, and I’m nervous about recursing incorrectly into RCU and context tracking code if we were to use exception_enter().) > > I think (and this certainly wants a comment) is that the ist_exit() > thing hard relies on the interrupt-return path doing the reschedule.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On 10/18, Andy Lutomirski wrote: > > Oleg, the code in kernel/signal.c: > > preempt_disable(); > read_unlock(&tasklist_lock); > preempt_enable_no_resched(); > freezable_schedule(); > > looks bogus. I don't get what it's trying to achieve with > preempt_disable(), and I also don't see why no_resched does anything. > Sure, it prevents a reschedule triggered during read_unlock() from > causing a reschedule, Yes. Lets suppose we remove preempt_disable/enable. Debugger was already woken up, if it runs on the same CPU quite possibly it will preemt the tracee. After that debugger will spin in wait_task_inactive(), until it is in turn preempted or calls schedule_timeout(1), so that the tracee (current) can finally call __schedule(preempt = F) and call deactivate_task() to become inactive. > but it doesn't prevent an interrupt immediately > after the preempt_enable_no_resched() call from scheduling. Yes, but this is less likely. Oleg.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Fri, Oct 19, 2018 at 01:08:23AM +, Nadav Amit wrote: > Consider for example do_int3(), and see my inlined comments: > > dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) > { > ... > ist_enter(regs);// => preempt_disable() > cond_local_irq_enable(regs);// => assume it enables IRQs > > ... > // resched irq can be delivered here. It will not caused rescheduling > // since preemption is disabled > > cond_local_irq_disable(regs); // => assume it disables IRQs > ist_exit(regs); // => preempt_enable_no_resched() > } > > At this point resched will not happen for unbounded length of time (unless > there is another point when exiting the trap handler that checks if > preemption should take place). > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > preempt_enable_no_resched(). > > Am I missing something? Would not the interrupt return then check for TIF_NEED_RESCHED and call schedule() ? I think (and this certainly wants a comment) is that the ist_exit() thing hard relies on the interrupt-return path doing the reschedule.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote: > > > > > > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > > > preempt_enable_no_resched(). > > > > Alexei, I think this code is just wrong. > > why 'just wrong' ? Because you lost a preemption point, this is a no-no. > > > Do you know why it uses > > preempt_enable_no_resched()? > > dont recall precisely. > we could be preemptable at the point where macro is called. > I think the goal of no_resched was to avoid adding scheduling points > where they didn't exist before just because a prog ran for few nsec. > May be Daniel or Roman remember. No, you did the exact opposite, where there previously was a preemption, you just ate it. The band saw didn't get stopped in time, you loose your hand etc..
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Thu, Oct 18, 2018 at 09:29:39PM -0700, Andy Lutomirski wrote: > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > > preempt_enable_no_resched(). > > Alexei, I think this code is just wrong. Do you know why it uses > preempt_enable_no_resched()? Yes, that's a straight up bug. It looks like I need to go fix up abuse again :/ > Oleg, the code in kernel/signal.c: > > preempt_disable(); > read_unlock(&tasklist_lock); > preempt_enable_no_resched(); > freezable_schedule(); > The purpose here is to avoid back-to-back schedule() calls, and this pattern is one of the few correct uses of preempt_enable_no_resched(). Suppose we got a preemption while holding the read_lock(), then when we'd do read_unlock(), we'd drop preempt_count to 0 and reschedule, then when we get back we instantly call into schedule _again_. What this code does, is it increments preempt_count such that read_unlock() doesn't hit 0 and doesn't call schedule, then we lower it to 0 without a call to schedule() and then call schedule() explicitly.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
> > > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > > preempt_enable_no_resched(). > > Alexei, I think this code is just wrong. why 'just wrong' ? > Do you know why it uses > preempt_enable_no_resched()? dont recall precisely. we could be preemptable at the point where macro is called. I think the goal of no_resched was to avoid adding scheduling points where they didn't exist before just because a prog ran for few nsec. May be Daniel or Roman remember.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 9:29 PM, Andy Lutomirski wrote: >> On Oct 18, 2018, at 6:08 PM, Nadav Amit wrote: >> >> at 10:00 AM, Andy Lutomirski wrote: >> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: at 8:51 PM, Andy Lutomirski wrote: >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: >> at 6:22 PM, Andy Lutomirski wrote: >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: It is sometimes beneficial to prevent preemption for very few instructions, or prevent preemption for some instructions that precede a branch (this latter case will be introduced in the next patches). To provide such functionality on x86-64, we use an empty REX-prefix (opcode 0x40) as an indication that preemption is disabled for the following instruction. >>> >>> Nifty! >>> >>> That being said, I think you have a few bugs. First, you can’t just >>> ignore >>> a rescheduling interrupt, as you introduce unbounded latency when this >>> happens — you’re effectively emulating preempt_enable_no_resched(), >>> which >>> is not a drop-in replacement for preempt_enable(). To fix this, you may >>> need to jump to a slow-path trampoline that calls schedule() at the end >>> or >>> consider rewinding one instruction instead. Or use TF, which is only a >>> little bit terrifying… >> >> Yes, I didn’t pay enough attention here. For my use-case, I think that >> the >> easiest solution would be to make synchronize_sched() ignore preemptions >> that happen while the prefix is detected. It would slightly change the >> meaning of the prefix. So thinking about it further, rewinding the instruction seems the easiest and most robust solution. I’ll do it. >>> You also aren’t accounting for the case where you get an exception that >>> is, in turn, preempted. >> >> Hmm.. Can you give me an example for such an exception in my use-case? I >> cannot think of an exception that might be preempted (assuming #BP, #MC >> cannot be preempted). > > Look for cond_local_irq_enable(). I looked at it. Yet, I still don’t see how exceptions might happen in my use-case, but having said that - this can be fixed too. >>> >>> I’m not totally certain there’s a case that matters. But it’s worth >>> checking >> >> I am still checking. But, I wanted to ask you whether the existing code is >> correct, since it seems to me that others do the same mistake I did, unless >> I don’t understand the code. >> >> Consider for example do_int3(), and see my inlined comments: >> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) >> { >> ... >> ist_enter(regs);// => preempt_disable() >> cond_local_irq_enable(regs);// => assume it enables IRQs >> >> ... >> // resched irq can be delivered here. It will not caused rescheduling >> // since preemption is disabled >> >> cond_local_irq_disable(regs);// => assume it disables IRQs >> ist_exit(regs);// => preempt_enable_no_resched() >> } >> >> At this point resched will not happen for unbounded length of time (unless >> there is another point when exiting the trap handler that checks if >> preemption should take place). > > I think it's only a bug in the cases where someone uses extable to fix > up an int3 (which would be nuts) or that we oops. But I should still > fix it. In the normal case where int3 was in user code, we'll miss > the reschedule in do_trap(), but we'll reschedule in > prepare_exit_to_usermode() -> exit_to_usermode_loop(). Thanks for your quick response, and sorry for bothering instead of dealing with it. Note that do_debug() does something similar to do_int3(). And then there is optimized_callback() that also uses preempt_enable_no_resched(). I think the original use was correct, but then a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized kprobes”) removed the IRQ disabling, while leaving preempt_enable_no_resched() . No?
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
> On Oct 18, 2018, at 6:08 PM, Nadav Amit wrote: > > at 10:00 AM, Andy Lutomirski wrote: > >> >> >>> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: >>> >>> at 8:51 PM, Andy Lutomirski wrote: >>> > On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: > at 6:22 PM, Andy Lutomirski wrote: > >>> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >>> >>> It is sometimes beneficial to prevent preemption for very few >>> instructions, or prevent preemption for some instructions that precede >>> a branch (this latter case will be introduced in the next patches). >>> >>> To provide such functionality on x86-64, we use an empty REX-prefix >>> (opcode 0x40) as an indication that preemption is disabled for the >>> following instruction. >> >> Nifty! >> >> That being said, I think you have a few bugs. First, you can’t just >> ignore >> a rescheduling interrupt, as you introduce unbounded latency when this >> happens — you’re effectively emulating preempt_enable_no_resched(), which >> is not a drop-in replacement for preempt_enable(). To fix this, you may >> need to jump to a slow-path trampoline that calls schedule() at the end >> or >> consider rewinding one instruction instead. Or use TF, which is only a >> little bit terrifying… > > Yes, I didn’t pay enough attention here. For my use-case, I think that the > easiest solution would be to make synchronize_sched() ignore preemptions > that happen while the prefix is detected. It would slightly change the > meaning of the prefix. >>> >>> So thinking about it further, rewinding the instruction seems the easiest >>> and most robust solution. I’ll do it. >>> >> You also aren’t accounting for the case where you get an exception that >> is, in turn, preempted. > > Hmm.. Can you give me an example for such an exception in my use-case? I > cannot think of an exception that might be preempted (assuming #BP, #MC > cannot be preempted). Look for cond_local_irq_enable(). >>> >>> I looked at it. Yet, I still don’t see how exceptions might happen in my >>> use-case, but having said that - this can be fixed too. >> >> I’m not totally certain there’s a case that matters. But it’s worth checking > > I am still checking. But, I wanted to ask you whether the existing code is > correct, since it seems to me that others do the same mistake I did, unless > I don’t understand the code. > > Consider for example do_int3(), and see my inlined comments: > > dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) > { >... >ist_enter(regs);// => preempt_disable() >cond_local_irq_enable(regs);// => assume it enables IRQs > >... >// resched irq can be delivered here. It will not caused rescheduling >// since preemption is disabled > >cond_local_irq_disable(regs);// => assume it disables IRQs >ist_exit(regs);// => preempt_enable_no_resched() > } > > At this point resched will not happen for unbounded length of time (unless > there is another point when exiting the trap handler that checks if > preemption should take place). I think it's only a bug in the cases where someone uses extable to fix up an int3 (which would be nuts) or that we oops. But I should still fix it. In the normal case where int3 was in user code, we'll miss the reschedule in do_trap(), but we'll reschedule in prepare_exit_to_usermode() -> exit_to_usermode_loop(). > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > preempt_enable_no_resched(). Alexei, I think this code is just wrong. Do you know why it uses preempt_enable_no_resched()? Oleg, the code in kernel/signal.c: preempt_disable(); read_unlock(&tasklist_lock); preempt_enable_no_resched(); freezable_schedule(); looks bogus. I don't get what it's trying to achieve with preempt_disable(), and I also don't see why no_resched does anything. Sure, it prevents a reschedule triggered during read_unlock() from causing a reschedule, but it doesn't prevent an interrupt immediately after the preempt_enable_no_resched() call from scheduling. --Andy > > Am I missing something? > > Thanks, > Nadav
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 10:00 AM, Andy Lutomirski wrote: > > >> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: >> >> at 8:51 PM, Andy Lutomirski wrote: >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: at 6:22 PM, Andy Lutomirski wrote: >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >> >> It is sometimes beneficial to prevent preemption for very few >> instructions, or prevent preemption for some instructions that precede >> a branch (this latter case will be introduced in the next patches). >> >> To provide such functionality on x86-64, we use an empty REX-prefix >> (opcode 0x40) as an indication that preemption is disabled for the >> following instruction. > > Nifty! > > That being said, I think you have a few bugs. First, you can’t just ignore > a rescheduling interrupt, as you introduce unbounded latency when this > happens — you’re effectively emulating preempt_enable_no_resched(), which > is not a drop-in replacement for preempt_enable(). To fix this, you may > need to jump to a slow-path trampoline that calls schedule() at the end or > consider rewinding one instruction instead. Or use TF, which is only a > little bit terrifying… Yes, I didn’t pay enough attention here. For my use-case, I think that the easiest solution would be to make synchronize_sched() ignore preemptions that happen while the prefix is detected. It would slightly change the meaning of the prefix. >> >> So thinking about it further, rewinding the instruction seems the easiest >> and most robust solution. I’ll do it. >> > You also aren’t accounting for the case where you get an exception that > is, in turn, preempted. Hmm.. Can you give me an example for such an exception in my use-case? I cannot think of an exception that might be preempted (assuming #BP, #MC cannot be preempted). >>> >>> Look for cond_local_irq_enable(). >> >> I looked at it. Yet, I still don’t see how exceptions might happen in my >> use-case, but having said that - this can be fixed too. > > I’m not totally certain there’s a case that matters. But it’s worth checking I am still checking. But, I wanted to ask you whether the existing code is correct, since it seems to me that others do the same mistake I did, unless I don’t understand the code. Consider for example do_int3(), and see my inlined comments: dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) { ... ist_enter(regs);// => preempt_disable() cond_local_irq_enable(regs);// => assume it enables IRQs ... // resched irq can be delivered here. It will not caused rescheduling // since preemption is disabled cond_local_irq_disable(regs); // => assume it disables IRQs ist_exit(regs); // => preempt_enable_no_resched() } At this point resched will not happen for unbounded length of time (unless there is another point when exiting the trap handler that checks if preemption should take place). Another example is __BPF_PROG_RUN_ARRAY(), which also uses preempt_enable_no_resched(). Am I missing something? Thanks, Nadav
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 12:54 AM, Peter Zijlstra wrote: > On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote: >>> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >>> >>> It is sometimes beneficial to prevent preemption for very few >>> instructions, or prevent preemption for some instructions that precede >>> a branch (this latter case will be introduced in the next patches). >>> >>> To provide such functionality on x86-64, we use an empty REX-prefix >>> (opcode 0x40) as an indication that preemption is disabled for the >>> following instruction. >> >> Nifty! >> >> That being said, I think you have a few bugs. > >> First, you can’t just ignore a rescheduling interrupt, as you >> introduce unbounded latency when this happens — you’re effectively >> emulating preempt_enable_no_resched(), which is not a drop-in >> replacement for preempt_enable(). > >> To fix this, you may need to jump to a slow-path trampoline that calls >> schedule() at the end or consider rewinding one instruction instead. >> Or use TF, which is only a little bit terrifying... > > At which point we're very close to in-kernel rseq. Interesting. I didn’t know about this feature. I’ll see if I can draw some ideas from there.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 10:29 AM, Andy Lutomirski wrote: > On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit wrote: >> at 10:00 AM, Andy Lutomirski wrote: >> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: at 8:51 PM, Andy Lutomirski wrote: >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: >> at 6:22 PM, Andy Lutomirski wrote: >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: It is sometimes beneficial to prevent preemption for very few instructions, or prevent preemption for some instructions that precede a branch (this latter case will be introduced in the next patches). To provide such functionality on x86-64, we use an empty REX-prefix (opcode 0x40) as an indication that preemption is disabled for the following instruction. >>> >>> Nifty! >>> >>> That being said, I think you have a few bugs. First, you can’t just >>> ignore >>> a rescheduling interrupt, as you introduce unbounded latency when this >>> happens — you’re effectively emulating preempt_enable_no_resched(), >>> which >>> is not a drop-in replacement for preempt_enable(). To fix this, you may >>> need to jump to a slow-path trampoline that calls schedule() at the end >>> or >>> consider rewinding one instruction instead. Or use TF, which is only a >>> little bit terrifying… >> >> Yes, I didn’t pay enough attention here. For my use-case, I think that >> the >> easiest solution would be to make synchronize_sched() ignore preemptions >> that happen while the prefix is detected. It would slightly change the >> meaning of the prefix. So thinking about it further, rewinding the instruction seems the easiest and most robust solution. I’ll do it. >>> You also aren’t accounting for the case where you get an exception that >>> is, in turn, preempted. >> >> Hmm.. Can you give me an example for such an exception in my use-case? I >> cannot think of an exception that might be preempted (assuming #BP, #MC >> cannot be preempted). > > Look for cond_local_irq_enable(). I looked at it. Yet, I still don’t see how exceptions might happen in my use-case, but having said that - this can be fixed too. >>> >>> I’m not totally certain there’s a case that matters. But it’s worth >>> checking >>> To be frank, I paid relatively little attention to this subject. Any feedback about the other parts and especially on the high-level approach? Is modifying the retpolines in the proposed manner (assembly macros) acceptable? >>> >>> It’s certainly a neat idea, and it could be a real speedup. >> >> Great. So I’ll try to shape things up, and I still wait for other comments >> (from others). >> >> I’ll just mention two more patches I need to cleanup (I know I still owe you >> some >> work, so obviously it will be done later): >> >> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17 >> BPF filters on the Redis server process that are invoked on each >> system-call. Invoking each one requires an indirect branch. The patch keeps >> a per-process kernel code-page that holds trampolines for these functions. > > I wonder how many levels of branches are needed before the branches > involved exceed the retpoline cost. In this case there is no hierarchy, but a list of trampolines that are called one after the other, as the seccomp filter order is predefined. It does not work if different threads of the same process have different filters. >> 2. Binary-search for system-calls. Use the per-process kernel code-page also >> to hold multiple trampolines for the 16 common system calls of a certain >> process. The patch uses an indirection table and a binary-search to find the >> proper trampoline. > > Same comment applies here. Branch misprediction wastes ~7 cycles and a retpoline takes at least 30. So assuming the branch predictor is not completely stupid 3-4 levels should not be too much.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit wrote: > > at 10:00 AM, Andy Lutomirski wrote: > > > > > > >> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: > >> > >> at 8:51 PM, Andy Lutomirski wrote: > >> > On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: > at 6:22 PM, Andy Lutomirski wrote: > > >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: > >> > >> It is sometimes beneficial to prevent preemption for very few > >> instructions, or prevent preemption for some instructions that precede > >> a branch (this latter case will be introduced in the next patches). > >> > >> To provide such functionality on x86-64, we use an empty REX-prefix > >> (opcode 0x40) as an indication that preemption is disabled for the > >> following instruction. > > > > Nifty! > > > > That being said, I think you have a few bugs. First, you can’t just > > ignore > > a rescheduling interrupt, as you introduce unbounded latency when this > > happens — you’re effectively emulating preempt_enable_no_resched(), > > which > > is not a drop-in replacement for preempt_enable(). To fix this, you may > > need to jump to a slow-path trampoline that calls schedule() at the end > > or > > consider rewinding one instruction instead. Or use TF, which is only a > > little bit terrifying… > > Yes, I didn’t pay enough attention here. For my use-case, I think that > the > easiest solution would be to make synchronize_sched() ignore preemptions > that happen while the prefix is detected. It would slightly change the > meaning of the prefix. > >> > >> So thinking about it further, rewinding the instruction seems the easiest > >> and most robust solution. I’ll do it. > >> > > You also aren’t accounting for the case where you get an exception that > > is, in turn, preempted. > > Hmm.. Can you give me an example for such an exception in my use-case? I > cannot think of an exception that might be preempted (assuming #BP, #MC > cannot be preempted). > >>> > >>> Look for cond_local_irq_enable(). > >> > >> I looked at it. Yet, I still don’t see how exceptions might happen in my > >> use-case, but having said that - this can be fixed too. > > > > I’m not totally certain there’s a case that matters. But it’s worth > > checking > > > >> To be frank, I paid relatively little attention to this subject. Any > >> feedback about the other parts and especially on the high-level approach? > >> Is > >> modifying the retpolines in the proposed manner (assembly macros) > >> acceptable? > > > > It’s certainly a neat idea, and it could be a real speedup. > > Great. So I’ll try to shape things up, and I still wait for other comments > (from others). > > I’ll just mention two more patches I need to cleanup (I know I still owe you > some > work, so obviously it will be done later): > > 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17 > BPF filters on the Redis server process that are invoked on each > system-call. Invoking each one requires an indirect branch. The patch keeps > a per-process kernel code-page that holds trampolines for these functions. I wonder how many levels of branches are needed before the branches involved exceed the retpoline cost. > > 2. Binary-search for system-calls. Use the per-process kernel code-page also > to hold multiple trampolines for the 16 common system calls of a certain > process. The patch uses an indirection table and a binary-search to find the > proper trampoline. Same comment applies here. > > Thanks again, > Nadav -- Andy Lutomirski AMA Capital Management, LLC
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 10:00 AM, Andy Lutomirski wrote: > > >> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: >> >> at 8:51 PM, Andy Lutomirski wrote: >> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: at 6:22 PM, Andy Lutomirski wrote: >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >> >> It is sometimes beneficial to prevent preemption for very few >> instructions, or prevent preemption for some instructions that precede >> a branch (this latter case will be introduced in the next patches). >> >> To provide such functionality on x86-64, we use an empty REX-prefix >> (opcode 0x40) as an indication that preemption is disabled for the >> following instruction. > > Nifty! > > That being said, I think you have a few bugs. First, you can’t just ignore > a rescheduling interrupt, as you introduce unbounded latency when this > happens — you’re effectively emulating preempt_enable_no_resched(), which > is not a drop-in replacement for preempt_enable(). To fix this, you may > need to jump to a slow-path trampoline that calls schedule() at the end or > consider rewinding one instruction instead. Or use TF, which is only a > little bit terrifying… Yes, I didn’t pay enough attention here. For my use-case, I think that the easiest solution would be to make synchronize_sched() ignore preemptions that happen while the prefix is detected. It would slightly change the meaning of the prefix. >> >> So thinking about it further, rewinding the instruction seems the easiest >> and most robust solution. I’ll do it. >> > You also aren’t accounting for the case where you get an exception that > is, in turn, preempted. Hmm.. Can you give me an example for such an exception in my use-case? I cannot think of an exception that might be preempted (assuming #BP, #MC cannot be preempted). >>> >>> Look for cond_local_irq_enable(). >> >> I looked at it. Yet, I still don’t see how exceptions might happen in my >> use-case, but having said that - this can be fixed too. > > I’m not totally certain there’s a case that matters. But it’s worth checking > >> To be frank, I paid relatively little attention to this subject. Any >> feedback about the other parts and especially on the high-level approach? Is >> modifying the retpolines in the proposed manner (assembly macros) >> acceptable? > > It’s certainly a neat idea, and it could be a real speedup. Great. So I’ll try to shape things up, and I still wait for other comments (from others). I’ll just mention two more patches I need to cleanup (I know I still owe you some work, so obviously it will be done later): 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17 BPF filters on the Redis server process that are invoked on each system-call. Invoking each one requires an indirect branch. The patch keeps a per-process kernel code-page that holds trampolines for these functions. 2. Binary-search for system-calls. Use the per-process kernel code-page also to hold multiple trampolines for the 16 common system calls of a certain process. The patch uses an indirection table and a binary-search to find the proper trampoline. Thanks again, Nadav
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: > > at 8:51 PM, Andy Lutomirski wrote: > >>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: >>> at 6:22 PM, Andy Lutomirski wrote: >>> > On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: > > It is sometimes beneficial to prevent preemption for very few > instructions, or prevent preemption for some instructions that precede > a branch (this latter case will be introduced in the next patches). > > To provide such functionality on x86-64, we use an empty REX-prefix > (opcode 0x40) as an indication that preemption is disabled for the > following instruction. Nifty! That being said, I think you have a few bugs. First, you can’t just ignore a rescheduling interrupt, as you introduce unbounded latency when this happens — you’re effectively emulating preempt_enable_no_resched(), which is not a drop-in replacement for preempt_enable(). To fix this, you may need to jump to a slow-path trampoline that calls schedule() at the end or consider rewinding one instruction instead. Or use TF, which is only a little bit terrifying… >>> >>> Yes, I didn’t pay enough attention here. For my use-case, I think that the >>> easiest solution would be to make synchronize_sched() ignore preemptions >>> that happen while the prefix is detected. It would slightly change the >>> meaning of the prefix. > > So thinking about it further, rewinding the instruction seems the easiest > and most robust solution. I’ll do it. > You also aren’t accounting for the case where you get an exception that is, in turn, preempted. >>> >>> Hmm.. Can you give me an example for such an exception in my use-case? I >>> cannot think of an exception that might be preempted (assuming #BP, #MC >>> cannot be preempted). >> >> Look for cond_local_irq_enable(). > > I looked at it. Yet, I still don’t see how exceptions might happen in my > use-case, but having said that - this can be fixed too. I’m not totally certain there’s a case that matters. But it’s worth checking > > To be frank, I paid relatively little attention to this subject. Any > feedback about the other parts and especially on the high-level approach? Is > modifying the retpolines in the proposed manner (assembly macros) > acceptable? > It’s certainly a neat idea, and it could be a real speedup. > Thanks, > Nadav
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 8:51 PM, Andy Lutomirski wrote: > On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: >> at 6:22 PM, Andy Lutomirski wrote: >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: It is sometimes beneficial to prevent preemption for very few instructions, or prevent preemption for some instructions that precede a branch (this latter case will be introduced in the next patches). To provide such functionality on x86-64, we use an empty REX-prefix (opcode 0x40) as an indication that preemption is disabled for the following instruction. >>> >>> Nifty! >>> >>> That being said, I think you have a few bugs. First, you can’t just ignore >>> a rescheduling interrupt, as you introduce unbounded latency when this >>> happens — you’re effectively emulating preempt_enable_no_resched(), which >>> is not a drop-in replacement for preempt_enable(). To fix this, you may >>> need to jump to a slow-path trampoline that calls schedule() at the end or >>> consider rewinding one instruction instead. Or use TF, which is only a >>> little bit terrifying… >> >> Yes, I didn’t pay enough attention here. For my use-case, I think that the >> easiest solution would be to make synchronize_sched() ignore preemptions >> that happen while the prefix is detected. It would slightly change the >> meaning of the prefix. So thinking about it further, rewinding the instruction seems the easiest and most robust solution. I’ll do it. >>> You also aren’t accounting for the case where you get an exception that >>> is, in turn, preempted. >> >> Hmm.. Can you give me an example for such an exception in my use-case? I >> cannot think of an exception that might be preempted (assuming #BP, #MC >> cannot be preempted). > > Look for cond_local_irq_enable(). I looked at it. Yet, I still don’t see how exceptions might happen in my use-case, but having said that - this can be fixed too. To be frank, I paid relatively little attention to this subject. Any feedback about the other parts and especially on the high-level approach? Is modifying the retpolines in the proposed manner (assembly macros) acceptable? Thanks, Nadav
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote: > > > On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: > > > > It is sometimes beneficial to prevent preemption for very few > > instructions, or prevent preemption for some instructions that precede > > a branch (this latter case will be introduced in the next patches). > > > > To provide such functionality on x86-64, we use an empty REX-prefix > > (opcode 0x40) as an indication that preemption is disabled for the > > following instruction. > > Nifty! > > That being said, I think you have a few bugs. > First, you can’t just ignore a rescheduling interrupt, as you > introduce unbounded latency when this happens — you’re effectively > emulating preempt_enable_no_resched(), which is not a drop-in > replacement for preempt_enable(). > To fix this, you may need to jump to a slow-path trampoline that calls > schedule() at the end or consider rewinding one instruction instead. > Or use TF, which is only a little bit terrifying... At which point we're very close to in-kernel rseq.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: > > at 6:22 PM, Andy Lutomirski wrote: > > > > >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: > >> > >> It is sometimes beneficial to prevent preemption for very few > >> instructions, or prevent preemption for some instructions that precede > >> a branch (this latter case will be introduced in the next patches). > >> > >> To provide such functionality on x86-64, we use an empty REX-prefix > >> (opcode 0x40) as an indication that preemption is disabled for the > >> following instruction. > > > > Nifty! > > > > That being said, I think you have a few bugs. First, you can’t just ignore > > a rescheduling interrupt, as you introduce unbounded latency when this > > happens — you’re effectively emulating preempt_enable_no_resched(), which > > is not a drop-in replacement for preempt_enable(). To fix this, you may > > need to jump to a slow-path trampoline that calls schedule() at the end or > > consider rewinding one instruction instead. Or use TF, which is only a > > little bit terrifying… > > Yes, I didn’t pay enough attention here. For my use-case, I think that the > easiest solution would be to make synchronize_sched() ignore preemptions > that happen while the prefix is detected. It would slightly change the > meaning of the prefix. > > > You also aren’t accounting for the case where you get an exception that > > is, in turn, preempted. > > Hmm.. Can you give me an example for such an exception in my use-case? I > cannot think of an exception that might be preempted (assuming #BP, #MC > cannot be preempted). > Look for cond_local_irq_enable(). --Andy
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 8:11 PM, Nadav Amit wrote: > at 6:22 PM, Andy Lutomirski wrote: > >>> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >>> >>> It is sometimes beneficial to prevent preemption for very few >>> instructions, or prevent preemption for some instructions that precede >>> a branch (this latter case will be introduced in the next patches). >>> >>> To provide such functionality on x86-64, we use an empty REX-prefix >>> (opcode 0x40) as an indication that preemption is disabled for the >>> following instruction. >> >> Nifty! >> >> That being said, I think you have a few bugs. First, you can’t just ignore >> a rescheduling interrupt, as you introduce unbounded latency when this >> happens — you’re effectively emulating preempt_enable_no_resched(), which >> is not a drop-in replacement for preempt_enable(). To fix this, you may >> need to jump to a slow-path trampoline that calls schedule() at the end or >> consider rewinding one instruction instead. Or use TF, which is only a >> little bit terrifying… > > Yes, I didn’t pay enough attention here. For my use-case, I think that the > easiest solution would be to make synchronize_sched() ignore preemptions > that happen while the prefix is detected. It would slightly change the > meaning of the prefix. Ignore this nonsense that I wrote. I’ll try to come up with a decent solution. >> You also aren’t accounting for the case where you get an exception that >> is, in turn, preempted. > > Hmm.. Can you give me an example for such an exception in my use-case? I > cannot think of an exception that might be preempted (assuming #BP, #MC > cannot be preempted). > > I agree that for super-general case this might be inappropriate.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
at 6:22 PM, Andy Lutomirski wrote: > >> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >> >> It is sometimes beneficial to prevent preemption for very few >> instructions, or prevent preemption for some instructions that precede >> a branch (this latter case will be introduced in the next patches). >> >> To provide such functionality on x86-64, we use an empty REX-prefix >> (opcode 0x40) as an indication that preemption is disabled for the >> following instruction. > > Nifty! > > That being said, I think you have a few bugs. First, you can’t just ignore > a rescheduling interrupt, as you introduce unbounded latency when this > happens — you’re effectively emulating preempt_enable_no_resched(), which > is not a drop-in replacement for preempt_enable(). To fix this, you may > need to jump to a slow-path trampoline that calls schedule() at the end or > consider rewinding one instruction instead. Or use TF, which is only a > little bit terrifying… Yes, I didn’t pay enough attention here. For my use-case, I think that the easiest solution would be to make synchronize_sched() ignore preemptions that happen while the prefix is detected. It would slightly change the meaning of the prefix. > You also aren’t accounting for the case where you get an exception that > is, in turn, preempted. Hmm.. Can you give me an example for such an exception in my use-case? I cannot think of an exception that might be preempted (assuming #BP, #MC cannot be preempted). I agree that for super-general case this might be inappropriate.
Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix
> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: > > It is sometimes beneficial to prevent preemption for very few > instructions, or prevent preemption for some instructions that precede > a branch (this latter case will be introduced in the next patches). > > To provide such functionality on x86-64, we use an empty REX-prefix > (opcode 0x40) as an indication that preemption is disabled for the > following instruction. Nifty! That being said, I think you have a few bugs. First, you can’t just ignore a rescheduling interrupt, as you introduce unbounded latency when this happens — you’re effectively emulating preempt_enable_no_resched(), which is not a drop-in replacement for preempt_enable(). To fix this, you may need to jump to a slow-path trampoline that calls schedule() at the end or consider rewinding one instruction instead. Or use TF, which is only a little bit terrifying... You also aren’t accounting for the case where you get an exception that is, in turn, preempted. > > It is expected that this opcode is not in common use. > > Signed-off-by: Nadav Amit > --- > arch/x86/entry/entry_64.S| 10 ++ > arch/x86/include/asm/nospec-branch.h | 12 > 2 files changed, 22 insertions(+) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index cb8a5893fd33..31d59aad496e 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -643,6 +643,16 @@ retint_kernel: >jnc1f > 0:cmpl$0, PER_CPU_VAR(__preempt_count) >jnz1f > + > +/* > + * Allow to use hint to prevent preemption on a certain instruction. > + * Consider an instruction with the first byte having REX prefix > + * without any bits set as an indication for preemption disabled. > + */ > +movqRIP(%rsp), %rax > +cmpb$PREEMPT_DISABLE_PREFIX, (%rax) > +jz1f > + >callpreempt_schedule_irq >jmp0b > 1: > diff --git a/arch/x86/include/asm/nospec-branch.h > b/arch/x86/include/asm/nospec-branch.h > index 80dc14422495..0267611eb247 100644 > --- a/arch/x86/include/asm/nospec-branch.h > +++ b/arch/x86/include/asm/nospec-branch.h > @@ -52,6 +52,12 @@ >jnz771b;\ >add$(BITS_PER_LONG/8) * nr, sp; > > +/* > + * An empty REX-prefix is an indication that preemption should not take > place on > + * this instruction. > + */ > +#define PREEMPT_DISABLE_PREFIX (0x40) > + > #ifdef __ASSEMBLY__ > > /* > @@ -148,6 +154,12 @@ > #endif > .endm > > +.macro preempt_disable_prefix > +#ifdef CONFIG_PREEMPT > +.bytePREEMPT_DISABLE_PREFIX > +#endif > +.endm > + > #else /* __ASSEMBLY__ */ > > #define ANNOTATE_NOSPEC_ALTERNATIVE\ > -- > 2.17.1 >
[RFC PATCH 1/5] x86: introduce preemption disable prefix
It is sometimes beneficial to prevent preemption for very few instructions, or prevent preemption for some instructions that precede a branch (this latter case will be introduced in the next patches). To provide such functionality on x86-64, we use an empty REX-prefix (opcode 0x40) as an indication that preemption is disabled for the following instruction. It is expected that this opcode is not in common use. Signed-off-by: Nadav Amit --- arch/x86/entry/entry_64.S| 10 ++ arch/x86/include/asm/nospec-branch.h | 12 2 files changed, 22 insertions(+) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index cb8a5893fd33..31d59aad496e 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -643,6 +643,16 @@ retint_kernel: jnc 1f 0: cmpl$0, PER_CPU_VAR(__preempt_count) jnz 1f + + /* +* Allow to use hint to prevent preemption on a certain instruction. +* Consider an instruction with the first byte having REX prefix +* without any bits set as an indication for preemption disabled. +*/ + movqRIP(%rsp), %rax + cmpb$PREEMPT_DISABLE_PREFIX, (%rax) + jz 1f + callpreempt_schedule_irq jmp 0b 1: diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 80dc14422495..0267611eb247 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -52,6 +52,12 @@ jnz 771b; \ add $(BITS_PER_LONG/8) * nr, sp; +/* + * An empty REX-prefix is an indication that preemption should not take place on + * this instruction. + */ +#define PREEMPT_DISABLE_PREFIX (0x40) + #ifdef __ASSEMBLY__ /* @@ -148,6 +154,12 @@ #endif .endm +.macro preempt_disable_prefix +#ifdef CONFIG_PREEMPT + .byte PREEMPT_DISABLE_PREFIX +#endif +.endm + #else /* __ASSEMBLY__ */ #define ANNOTATE_NOSPEC_ALTERNATIVE\ -- 2.17.1