Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-02-01 Thread Jan Beulich
>>> On 01.02.18 at 16:20,  wrote:
> On 31/01/18 11:00, Jan Beulich wrote:
> On 30.01.18 at 20:26,  wrote:
>>> On 30/01/18 08:39, Jan Beulich wrote:
>>> On 29.01.18 at 16:38,  wrote:
> +/*
> + * We are the CPU performing the patching, and might have ended up 
> here by
> + * hitting a breakpoint.
> + *
> + * Either way, we need to complete particular patch to make forwards
> + * progress.  This logic is safe even if executed recursively in the
> + * breakpoint handler; the worst that will happen when normal 
> execution
> + * resumes is that we will rewrite the same bytes a second time.
> + */
> +
> +/* First, insert a breakpoint to prevent execution of the patch 
> site. */
> +i->addr[0] = 0xcc;
> +smp_wmb();
 This is necessary, but not sufficient when replacing more than a
 single insn: The other CPU may be executing instructions _after_
 the initial one that is being replaced, and ...

> +/* Second, copy the remaining instructions into place. */
> +memcpy(i->addr + 1, i->opcode + 1, i->len - 1);
 ... this may be altering things underneath its feet.
>>> Hmm.
>>>
>>> It is completely impossible to recover if another CPU hits the middle of
>>> this memcpy().  It is impossible to synchronise a specific write against
>>> the remote CPU pipelines.
>> It is not completely impossible, but the solution would be awkward
>> to use: If the code doing the patching knew instruction boundaries,
>> it could put breakpoints onto all of them. Or maybe it isn't all that
>> bad to use - we have an insn decoder after all.
> 
> An instruction decoder doesn't help.  Yes - it allows us to avoid
> executing a spliced instruction, but we can't recover/rewind state for a
> cpu which hits one of these latter breakpoints.

Oh, true. At least doing so would be very difficult.

>>> While we can arrange and test full NMI safety, we cannot test #MC safety
>>> at all.  Any code to try and implement #MC safety is going to be
>>> complicated, and make things worse if an #MC does hit.
>>>
>>> Therefore, I agree with the Linux view that trying to do anything for
>>> #MC safety is going to be worse than doing nothing at all.
>> I agree. But as said before - the immediate goal ought to be to
>> make alternatives patching safe, and that doesn't require any
>> of these considerations.
> 
> ? Of course it requires these considerations.

I don't understand, but see below.

> If we ignore the livepatching side of things (which includes an
> alternatives pass), then we can make the boot-time alternatives pass NMI
> safe by explicitly choosing to patch in NMI context before we've booted
> the APs.
> 
> Arranging for boot time alternatives to be NMI-safe is easy.  Arranging
> for livepatching to be NMI-safe is also fairly easy.
> 
> Arranging for anything, *even at boot time* to be #MC safe, is very
> complicated, and will be a large amount of code we cannot test.  Hence
> why I'm leaning towards the Linux solution for the #MC problem.

I could live with this.

> @@ -153,7 +231,31 @@ void init_or_livepatch add_nops(void *insns, 
> unsigned int len)
>  void init_or_livepatch noinline
>  text_poke(void *addr, const void *opcode, size_t len, bool live)
>  {
> -memcpy(addr, opcode, len);
> +if ( !live || len == 1 )
> +{
> +/*
> + * If we know *addr can't be executed, or we are patching a 
> single
> + * byte, it is safe to use a straight memcpy().
> + */
> +memcpy(addr, opcode, len);
 Is it really worth special casing this? Whether to actually ack
 patches 2 and 3 depends on that.
>>> Yes, and even more so if we are going to use an NMI rendezvous.  We
>>> definitely don't want to have an NMI rendezvous while applying
>>> alternatives as part of livepatch preparation.
>> Well, again - live patching should be the second step here.
> 
> You keep on saying this, and I can only conclude that you are confused
> as to which actions happen when.
> 
> At boot time, we do a single alternatives pass on the live .text section.
> 
> At livepatch load time, we do an alternatives pass on the incoming
> .text, but this is safe because the text definitely isn't being executed.
> 
> At livepatch apply time, we quiesce the system and do a patching pass on
> the live .text, most of which are `jmp disp32`.
> 
> 
> We therefore have two alternatives passes (one safe, one unsafe), and
> one (unsafe) livepatch pass.

I don't think I'm confused in any way. The approach you've chosen
in the patch - to complete the patching inside the NMI or #MC
handler - is entirely sufficient to deal with NMI and - unless there
are recursive ones - #MC. You don't need NMI context to do NMI
safe patching when only a single CPU is up.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xenpr

Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-02-01 Thread Andrew Cooper
On 31/01/18 11:00, Jan Beulich wrote:
 On 30.01.18 at 20:26,  wrote:
>> On 30/01/18 08:39, Jan Beulich wrote:
>> On 29.01.18 at 16:38,  wrote:
 +/*
 + * We are the CPU performing the patching, and might have ended up 
 here by
 + * hitting a breakpoint.
 + *
 + * Either way, we need to complete particular patch to make forwards
 + * progress.  This logic is safe even if executed recursively in the
 + * breakpoint handler; the worst that will happen when normal 
 execution
 + * resumes is that we will rewrite the same bytes a second time.
 + */
 +
 +/* First, insert a breakpoint to prevent execution of the patch site. 
 */
 +i->addr[0] = 0xcc;
 +smp_wmb();
>>> This is necessary, but not sufficient when replacing more than a
>>> single insn: The other CPU may be executing instructions _after_
>>> the initial one that is being replaced, and ...
>>>
 +/* Second, copy the remaining instructions into place. */
 +memcpy(i->addr + 1, i->opcode + 1, i->len - 1);
>>> ... this may be altering things underneath its feet.
>> Hmm.
>>
>> It is completely impossible to recover if another CPU hits the middle of
>> this memcpy().  It is impossible to synchronise a specific write against
>> the remote CPU pipelines.
> It is not completely impossible, but the solution would be awkward
> to use: If the code doing the patching knew instruction boundaries,
> it could put breakpoints onto all of them. Or maybe it isn't all that
> bad to use - we have an insn decoder after all.

An instruction decoder doesn't help.  Yes - it allows us to avoid
executing a spliced instruction, but we can't recover/rewind state for a
cpu which hits one of these latter breakpoints.

>
>> The only way to be safe is to guarantee that CPUs can't hit the code to
>> begin with.
>>
>> On AMD, we can use STGI/CLGI to do this sensibly, and really really
>> inhibit all interrupts.
> Not really for #MC - clear GIF may also result in shutdown when
> one would need delivering.

With GIF clear, #MC is held pending if possible, but otherwise a
shutdown will occur.

From that point of view, it is slightly better than clearing CR4.MCE,
but not by much.

>
>>  For Intel, we can fix the NMI problem by
>> rendezvousing and running the patching loop in NMI context, at which
>> point the hardware latch will prevent further NMIs.
>>
>> However, there is literally nothing we can do to prevent #MC from
>> arriving.  We can stop servicing #MC by disabling CR4.MCE, but then the
>> processor will shut down.
> Not a good idea indeed.
>
>> We can't put a big barrier at the head of #MC handler which delays
>> processing, because we have no way to ensure that processors aren't
>> already later in the #MC handler.  Furthermore, attempting to do so
>> heightens the risk of hitting a shutdown condition from a second queued #MC.
>>
>>
>> The chance of hitting an NMI/#MC collide with patching is already
>> minuscule.  Alternatives and livepatching have been used (by XenServer
>> alone) in this form for nearly 3 years, across millions of servers,
>> without a single report of such a collision.  The chance of an #MC
>> collision is far less likely than an NMI collision, and in the best case.
>>
>> While we can arrange and test full NMI safety, we cannot test #MC safety
>> at all.  Any code to try and implement #MC safety is going to be
>> complicated, and make things worse if an #MC does hit.
>>
>> Therefore, I agree with the Linux view that trying to do anything for
>> #MC safety is going to be worse than doing nothing at all.
> I agree. But as said before - the immediate goal ought to be to
> make alternatives patching safe, and that doesn't require any
> of these considerations.

? Of course it requires these considerations.

If we ignore the livepatching side of things (which includes an
alternatives pass), then we can make the boot-time alternatives pass NMI
safe by explicitly choosing to patch in NMI context before we've booted
the APs.

Arranging for boot time alternatives to be NMI-safe is easy.  Arranging
for livepatching to be NMI-safe is also fairly easy.

Arranging for anything, *even at boot time* to be #MC safe, is very
complicated, and will be a large amount of code we cannot test.  Hence
why I'm leaning towards the Linux solution for the #MC problem.

>
 @@ -153,7 +231,31 @@ void init_or_livepatch add_nops(void *insns, unsigned 
 int len)
  void init_or_livepatch noinline
  text_poke(void *addr, const void *opcode, size_t len, bool live)
  {
 -memcpy(addr, opcode, len);
 +if ( !live || len == 1 )
 +{
 +/*
 + * If we know *addr can't be executed, or we are patching a single
 + * byte, it is safe to use a straight memcpy().
 + */
 +memcpy(addr, opcode, len);
>>> Is it really worth special casing this? Whether to actually ack
>>> patches 2

Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-01-31 Thread Jan Beulich
>>> On 30.01.18 at 20:26,  wrote:
> On 30/01/18 08:39, Jan Beulich wrote:
> On 29.01.18 at 16:38,  wrote:
>>> +/*
>>> + * We are the CPU performing the patching, and might have ended up 
>>> here by
>>> + * hitting a breakpoint.
>>> + *
>>> + * Either way, we need to complete particular patch to make forwards
>>> + * progress.  This logic is safe even if executed recursively in the
>>> + * breakpoint handler; the worst that will happen when normal execution
>>> + * resumes is that we will rewrite the same bytes a second time.
>>> + */
>>> +
>>> +/* First, insert a breakpoint to prevent execution of the patch site. 
>>> */
>>> +i->addr[0] = 0xcc;
>>> +smp_wmb();
>> This is necessary, but not sufficient when replacing more than a
>> single insn: The other CPU may be executing instructions _after_
>> the initial one that is being replaced, and ...
>>
>>> +/* Second, copy the remaining instructions into place. */
>>> +memcpy(i->addr + 1, i->opcode + 1, i->len - 1);
>> ... this may be altering things underneath its feet.
> 
> Hmm.
> 
> It is completely impossible to recover if another CPU hits the middle of
> this memcpy().  It is impossible to synchronise a specific write against
> the remote CPU pipelines.

It is not completely impossible, but the solution would be awkward
to use: If the code doing the patching knew instruction boundaries,
it could put breakpoints onto all of them. Or maybe it isn't all that
bad to use - we have an insn decoder after all.

> The only way to be safe is to guarantee that CPUs can't hit the code to
> begin with.
> 
> On AMD, we can use STGI/CLGI to do this sensibly, and really really
> inhibit all interrupts.

Not really for #MC - clear GIF may also result in shutdown when
one would need delivering.

>  For Intel, we can fix the NMI problem by
> rendezvousing and running the patching loop in NMI context, at which
> point the hardware latch will prevent further NMIs.
> 
> However, there is literally nothing we can do to prevent #MC from
> arriving.  We can stop servicing #MC by disabling CR4.MCE, but then the
> processor will shut down.

Not a good idea indeed.

> We can't put a big barrier at the head of #MC handler which delays
> processing, because we have no way to ensure that processors aren't
> already later in the #MC handler.  Furthermore, attempting to do so
> heightens the risk of hitting a shutdown condition from a second queued #MC.
> 
> 
> The chance of hitting an NMI/#MC collide with patching is already
> minuscule.  Alternatives and livepatching have been used (by XenServer
> alone) in this form for nearly 3 years, across millions of servers,
> without a single report of such a collision.  The chance of an #MC
> collision is far less likely than an NMI collision, and in the best case.
> 
> While we can arrange and test full NMI safety, we cannot test #MC safety
> at all.  Any code to try and implement #MC safety is going to be
> complicated, and make things worse if an #MC does hit.
> 
> Therefore, I agree with the Linux view that trying to do anything for
> #MC safety is going to be worse than doing nothing at all.

I agree. But as said before - the immediate goal ought to be to
make alternatives patching safe, and that doesn't require any
of these considerations.

>>> @@ -153,7 +231,31 @@ void init_or_livepatch add_nops(void *insns, unsigned 
>>> int len)
>>>  void init_or_livepatch noinline
>>>  text_poke(void *addr, const void *opcode, size_t len, bool live)
>>>  {
>>> -memcpy(addr, opcode, len);
>>> +if ( !live || len == 1 )
>>> +{
>>> +/*
>>> + * If we know *addr can't be executed, or we are patching a single
>>> + * byte, it is safe to use a straight memcpy().
>>> + */
>>> +memcpy(addr, opcode, len);
>> Is it really worth special casing this? Whether to actually ack
>> patches 2 and 3 depends on that.
> 
> Yes, and even more so if we are going to use an NMI rendezvous.  We
> definitely don't want to have an NMI rendezvous while applying
> alternatives as part of livepatch preparation.

Well, again - live patching should be the second step here.

>>> +};
>>> +smp_wmb();
>>> +active_text_patching = true;
>>> +smp_wmb();
>>> +text_poke_live(NULL);
>>> +smp_wmb();
>>> +active_text_patching = false;
>> Perhaps better to zap live_poke_info.cpu again here? That could
>> in fact replace the separate active_text_patching flag.
> 
> No - it cant because...
> 
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -119,6 +119,8 @@ boolean_param("ler", opt_ler);
>>>  #define stack_words_per_line 4
>>>  #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)regs->rsp)
>>>  
>>> +bool active_text_patching;
>> Why here rather than in alternative.c?
> 
> ... in the !LIVEPATCH case, alternative.c is an .init.o and the build
> fails because of a nonzero bss.
> 
> We cannot have a possibly-init varia

Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-01-31 Thread Andrew Cooper
On 31/01/18 06:07, Juergen Gross wrote:
> On 30/01/18 20:26, Andrew Cooper wrote:
>> However, there is literally nothing we can do to prevent #MC from
>> arriving.  We can stop servicing #MC by disabling CR4.MCE, but then the
>> processor will shut down.
> Hmm, there is a way to avoid #MC on other processors, but this
> requires the really big hammer: stop all other cpus and restart
> them after patching.

I think that firmly goes on the pile of "worse than doing nothing" :)

Also, it doesn't solve the problem of #MC arriving on the current processor.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-01-30 Thread Juergen Gross
On 30/01/18 20:26, Andrew Cooper wrote:
> However, there is literally nothing we can do to prevent #MC from
> arriving.  We can stop servicing #MC by disabling CR4.MCE, but then the
> processor will shut down.

Hmm, there is a way to avoid #MC on other processors, but this
requires the really big hammer: stop all other cpus and restart
them after patching.


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-01-30 Thread Andrew Cooper
On 30/01/18 08:39, Jan Beulich wrote:
 On 29.01.18 at 16:38,  wrote:
>> +bool init_or_livepatch text_poke_live(const struct cpu_user_regs *regs)
>> +{
>> +struct live_poke_info *i = &live_poke_info;
>> +
>> +if ( unlikely(i->cpu != smp_processor_id()) )
>> +{
>> +ASSERT(regs);
>> +
>> +/*
>> + * We hit a breakpoint, on a CPU which was not performing the
>> + * patching.  This is only expected to be possible for the NMI/#MC
>> + * paths, and even then, only if we hit the tiny race window below
>> + * while patching in the NMI/#MC handlers.
>> + *
>> + * We can't safely evaluate whether we hit a transient breakpoint
>> + * because i->cpu has likely completed the patch and moved on to the
>> + * next patch site.
>> + *
>> + * Go to sleep for a bit and synchronise the pipeline as we are now 
>> in
>> + * a cross-modifying scenario.
>> + */
>> +cpu_relax();
>> +cpuid_eax(0);
>> +
>> +/*
>> + * Presume that we hit the transient breakpoint, as we can't confirm
>> + * whether we did or not.  We don't expect there to be any other
>> + * breakpoints to hit, but if we did hit another one, then in the
>> + * worse case we will only livelock until i->cpu has finished all of
>> + * its patching.
>> + */
>> +return true;
>> +}
>> +
>> +/*
>> + * We are the CPU performing the patching, and might have ended up here 
>> by
>> + * hitting a breakpoint.
>> + *
>> + * Either way, we need to complete particular patch to make forwards
>> + * progress.  This logic is safe even if executed recursively in the
>> + * breakpoint handler; the worst that will happen when normal execution
>> + * resumes is that we will rewrite the same bytes a second time.
>> + */
>> +
>> +/* First, insert a breakpoint to prevent execution of the patch site. */
>> +i->addr[0] = 0xcc;
>> +smp_wmb();
> This is necessary, but not sufficient when replacing more than a
> single insn: The other CPU may be executing instructions _after_
> the initial one that is being replaced, and ...
>
>> +/* Second, copy the remaining instructions into place. */
>> +memcpy(i->addr + 1, i->opcode + 1, i->len - 1);
> ... this may be altering things underneath its feet.

Hmm.

It is completely impossible to recover if another CPU hits the middle of
this memcpy().  It is impossible to synchronise a specific write against
the remote CPU pipelines.

The only way to be safe is to guarantee that CPUs can't hit the code to
begin with.

On AMD, we can use STGI/CLGI to do this sensibly, and really really
inhibit all interrupts.  For Intel, we can fix the NMI problem by
rendezvousing and running the patching loop in NMI context, at which
point the hardware latch will prevent further NMIs.

However, there is literally nothing we can do to prevent #MC from
arriving.  We can stop servicing #MC by disabling CR4.MCE, but then the
processor will shut down.

We can't put a big barrier at the head of #MC handler which delays
processing, because we have no way to ensure that processors aren't
already later in the #MC handler.  Furthermore, attempting to do so
heightens the risk of hitting a shutdown condition from a second queued #MC.


The chance of hitting an NMI/#MC collide with patching is already
minuscule.  Alternatives and livepatching have been used (by XenServer
alone) in this form for nearly 3 years, across millions of servers,
without a single report of such a collision.  The chance of an #MC
collision is far less likely than an NMI collision, and in the best case.

While we can arrange and test full NMI safety, we cannot test #MC safety
at all.  Any code to try and implement #MC safety is going to be
complicated, and make things worse if an #MC does hit.

Therefore, I agree with the Linux view that trying to do anything for
#MC safety is going to be worse than doing nothing at all.

>> @@ -153,7 +231,31 @@ void init_or_livepatch add_nops(void *insns, unsigned 
>> int len)
>>  void init_or_livepatch noinline
>>  text_poke(void *addr, const void *opcode, size_t len, bool live)
>>  {
>> -memcpy(addr, opcode, len);
>> +if ( !live || len == 1 )
>> +{
>> +/*
>> + * If we know *addr can't be executed, or we are patching a single
>> + * byte, it is safe to use a straight memcpy().
>> + */
>> +memcpy(addr, opcode, len);
> Is it really worth special casing this? Whether to actually ack
> patches 2 and 3 depends on that.

Yes, and even more so if we are going to use an NMI rendezvous.  We
definitely don't want to have an NMI rendezvous while applying
alternatives as part of livepatch preparation.

>
>> +};
>> +smp_wmb();
>> +active_text_patching = true;
>> +smp_wmb();
>> +text_poke_live(NULL);
>> +smp_wmb();
>> +active_text_

Re: [Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-01-30 Thread Jan Beulich
>>> On 29.01.18 at 16:38,  wrote:
> +bool init_or_livepatch text_poke_live(const struct cpu_user_regs *regs)
> +{
> +struct live_poke_info *i = &live_poke_info;
> +
> +if ( unlikely(i->cpu != smp_processor_id()) )
> +{
> +ASSERT(regs);
> +
> +/*
> + * We hit a breakpoint, on a CPU which was not performing the
> + * patching.  This is only expected to be possible for the NMI/#MC
> + * paths, and even then, only if we hit the tiny race window below
> + * while patching in the NMI/#MC handlers.
> + *
> + * We can't safely evaluate whether we hit a transient breakpoint
> + * because i->cpu has likely completed the patch and moved on to the
> + * next patch site.
> + *
> + * Go to sleep for a bit and synchronise the pipeline as we are now 
> in
> + * a cross-modifying scenario.
> + */
> +cpu_relax();
> +cpuid_eax(0);
> +
> +/*
> + * Presume that we hit the transient breakpoint, as we can't confirm
> + * whether we did or not.  We don't expect there to be any other
> + * breakpoints to hit, but if we did hit another one, then in the
> + * worse case we will only livelock until i->cpu has finished all of
> + * its patching.
> + */
> +return true;
> +}
> +
> +/*
> + * We are the CPU performing the patching, and might have ended up here 
> by
> + * hitting a breakpoint.
> + *
> + * Either way, we need to complete particular patch to make forwards
> + * progress.  This logic is safe even if executed recursively in the
> + * breakpoint handler; the worst that will happen when normal execution
> + * resumes is that we will rewrite the same bytes a second time.
> + */
> +
> +/* First, insert a breakpoint to prevent execution of the patch site. */
> +i->addr[0] = 0xcc;
> +smp_wmb();

This is necessary, but not sufficient when replacing more than a
single insn: The other CPU may be executing instructions _after_
the initial one that is being replaced, and ...

> +/* Second, copy the remaining instructions into place. */
> +memcpy(i->addr + 1, i->opcode + 1, i->len - 1);

... this may be altering things underneath its feet.

> @@ -153,7 +231,31 @@ void init_or_livepatch add_nops(void *insns, unsigned 
> int len)
>  void init_or_livepatch noinline
>  text_poke(void *addr, const void *opcode, size_t len, bool live)
>  {
> -memcpy(addr, opcode, len);
> +if ( !live || len == 1 )
> +{
> +/*
> + * If we know *addr can't be executed, or we are patching a single
> + * byte, it is safe to use a straight memcpy().
> + */
> +memcpy(addr, opcode, len);

Is it really worth special casing this? Whether to actually ack
patches 2 and 3 depends on that.

> +}
> +else
> +{
> +/*
> + * If not, arrange safe patching via the use of breakpoints.  
> Ordering
> + * of actions here are between this CPU, and the instruction fetch of
> + * the breakpoint exception handler on any CPU.
> + */
> +live_poke_info = (struct live_poke_info){
> +addr, opcode, len, smp_processor_id()

Better use C99 field initializers here? At which point (together with
the next comment) it may become better not to use a compound
initializer in the first place.

> +};
> +smp_wmb();
> +active_text_patching = true;
> +smp_wmb();
> +text_poke_live(NULL);
> +smp_wmb();
> +active_text_patching = false;

Perhaps better to zap live_poke_info.cpu again here? That could
in fact replace the separate active_text_patching flag.

> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -119,6 +119,8 @@ boolean_param("ler", opt_ler);
>  #define stack_words_per_line 4
>  #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)regs->rsp)
>  
> +bool active_text_patching;

Why here rather than in alternative.c?

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -530,7 +530,10 @@ ENTRY(page_fault)
>  movl  $TRAP_page_fault,4(%rsp)
>  /* No special register assumptions. */
>  GLOBAL(handle_exception)
> -SAVE_ALL CLAC
> +SAVE_ALL
> +
> +handle_exception_gprs_saved:
> +ASM_CLAC

I'm not convinced it is a good idea to defer the CLAC here. I see
no problem doing the CLAC below in the INT3 path before jumping
here.

> @@ -686,9 +689,36 @@ ENTRY(debug)
>  jmp   handle_exception
>  
>  ENTRY(int3)
> +/* For patching-safety, there must not be any alternatives here. */
>  pushq $0
>  movl  $TRAP_int3,4(%rsp)
> -jmp   handle_exception
> +
> +/* If there is no patching active, continue normally.  */
> +cmpb  $1, active_text_patching(%rip)

I think it is better to compare against zero in cases like this. But
then - is this safe? There's no guarantee th

[Xen-devel] [PATCH 4/5] x86/alternative: Implement NMI/#MC-safe patching

2018-01-29 Thread Andrew Cooper
Patching code which is being executed is problematic, because it impossible to
arrange an atomic update of the instruction stream outside of a few corner
cases.  Furthermore, we have no feasible way to prevent execution of the NMI
and #MC exception handlers, but have patch points in them.

Use a breakpoint to capture execution which hits an in-progress patch, and
have the exception handler cope with the safety.

See the code comments for exact algorithm details.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Konrad Rzeszutek Wilk 
CC: Ross Lagerwall 
---
 xen/arch/x86/alternative.c  | 104 +++-
 xen/arch/x86/traps.c|   2 +
 xen/arch/x86/x86_64/entry.S |  34 -
 xen/include/asm-x86/processor.h |   2 +
 4 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
index 0b309c3..40bfaad 100644
--- a/xen/arch/x86/alternative.c
+++ b/xen/arch/x86/alternative.c
@@ -131,6 +131,84 @@ void init_or_livepatch add_nops(void *insns, unsigned int 
len)
 }
 }
 
+static init_or_livepatch_data struct live_poke_info {
+uint8_t *addr;
+const uint8_t *opcode;
+size_t len;
+unsigned int cpu;
+} live_poke_info;
+
+/*
+ * text_poke_live - Update the live .text area, in an interrupt-safe way.
+ *
+ * !!! WARNING - Reentrantly called from the int3 exception handler !!!
+ *
+ * All patching data are passed via the live_poke_info structure.  @regs is
+ * non-NULL if entering from an exception handler.
+ *
+ * Returns a boolean indicating whether the transient breakpoint was hit.
+ */
+bool init_or_livepatch text_poke_live(const struct cpu_user_regs *regs)
+{
+struct live_poke_info *i = &live_poke_info;
+
+if ( unlikely(i->cpu != smp_processor_id()) )
+{
+ASSERT(regs);
+
+/*
+ * We hit a breakpoint, on a CPU which was not performing the
+ * patching.  This is only expected to be possible for the NMI/#MC
+ * paths, and even then, only if we hit the tiny race window below
+ * while patching in the NMI/#MC handlers.
+ *
+ * We can't safely evaluate whether we hit a transient breakpoint
+ * because i->cpu has likely completed the patch and moved on to the
+ * next patch site.
+ *
+ * Go to sleep for a bit and synchronise the pipeline as we are now in
+ * a cross-modifying scenario.
+ */
+cpu_relax();
+cpuid_eax(0);
+
+/*
+ * Presume that we hit the transient breakpoint, as we can't confirm
+ * whether we did or not.  We don't expect there to be any other
+ * breakpoints to hit, but if we did hit another one, then in the
+ * worse case we will only livelock until i->cpu has finished all of
+ * its patching.
+ */
+return true;
+}
+
+/*
+ * We are the CPU performing the patching, and might have ended up here by
+ * hitting a breakpoint.
+ *
+ * Either way, we need to complete particular patch to make forwards
+ * progress.  This logic is safe even if executed recursively in the
+ * breakpoint handler; the worst that will happen when normal execution
+ * resumes is that we will rewrite the same bytes a second time.
+ */
+
+/* First, insert a breakpoint to prevent execution of the patch site. */
+i->addr[0] = 0xcc;
+smp_wmb();
+/* Second, copy the remaining instructions into place. */
+memcpy(i->addr + 1, i->opcode + 1, i->len - 1);
+smp_wmb();
+/* Third, replace the breakpoint with the real instruction byte. */
+i->addr[0] = i->opcode[0];
+
+/*
+ * Inform our caller whether we hit the transient breakpoint (in which
+ * case we iret normally, having already finished the patch), or whether
+ * we need to continue further into the debug trap handler.
+ */
+return regs && regs->rip == (unsigned long)&i->addr[1];
+}
+
 /*
  * text_poke - Update instructions on a live kernel or non-executed code.
  * @addr: address to modify
@@ -153,7 +231,31 @@ void init_or_livepatch add_nops(void *insns, unsigned int 
len)
 void init_or_livepatch noinline
 text_poke(void *addr, const void *opcode, size_t len, bool live)
 {
-memcpy(addr, opcode, len);
+if ( !live || len == 1 )
+{
+/*
+ * If we know *addr can't be executed, or we are patching a single
+ * byte, it is safe to use a straight memcpy().
+ */
+memcpy(addr, opcode, len);
+}
+else
+{
+/*
+ * If not, arrange safe patching via the use of breakpoints.  Ordering
+ * of actions here are between this CPU, and the instruction fetch of
+ * the breakpoint exception handler on any CPU.
+ */
+live_poke_info = (struct live_poke_info){
+addr, opcode, len, smp_processor_id()
+};
+smp_wmb();
+active_text_patching = true;
+