Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-07-28 Thread Alexander Graf


On 25.06.14 02:21, Scott Wood wrote:

On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote:

On 25.06.14 01:15, Scott Wood wrote:

On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:

On 24.06.14 20:53, Scott Wood wrote:

The timer interrupt works, but I'm not fully convinced that it's a good

idea for things like MC events which we also block during critical
sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

  Machine check interrupts are typically caused by a hardware or
  memory subsystem failure or by an attempt to access an invalid
  address. They may be caused indirectly by execution of an
  instruction, but may not be recognized or reported until long
  after the processor has executed past the instruction that
  caused the machine check. As such, machine check interrupts are
  not thought of as synchronous or asynchronous nor as precise or
  imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.

So what value would you set the timer to? If the value is too small, we
never finish the critical section. If it's too big, we add lots of jitter.

Maybe something like 100us?

Single stepping would be better, though.


Single stepping is hard enough to get right on interaction between QEMU,
KVM and the guest. I didn't really want to make that stuff any more
complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.

What if the guest is single stepping itself? How do we determine when to
unset the bit again? When we get out of the critical section? How do we
know what the value was before we set it?

Keep track of each requester of single stepping separately, and only
ever set the real bit by ORing them.


Considering that Paul started working on integrating the in-kernel 
emulator with KVM I think we're best off to just wait for that one and 
then use it :).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-07-28 Thread Alexander Graf


On 25.06.14 02:21, Scott Wood wrote:

On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote:

On 25.06.14 01:15, Scott Wood wrote:

On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:

On 24.06.14 20:53, Scott Wood wrote:

The timer interrupt works, but I'm not fully convinced that it's a good

idea for things like MC events which we also block during critical
sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

  Machine check interrupts are typically caused by a hardware or
  memory subsystem failure or by an attempt to access an invalid
  address. They may be caused indirectly by execution of an
  instruction, but may not be recognized or reported until long
  after the processor has executed past the instruction that
  caused the machine check. As such, machine check interrupts are
  not thought of as synchronous or asynchronous nor as precise or
  imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.

So what value would you set the timer to? If the value is too small, we
never finish the critical section. If it's too big, we add lots of jitter.

Maybe something like 100us?

Single stepping would be better, though.


Single stepping is hard enough to get right on interaction between QEMU,
KVM and the guest. I didn't really want to make that stuff any more
complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.

What if the guest is single stepping itself? How do we determine when to
unset the bit again? When we get out of the critical section? How do we
know what the value was before we set it?

Keep track of each requester of single stepping separately, and only
ever set the real bit by ORing them.


Considering that Paul started working on integrating the in-kernel 
emulator with KVM I think we're best off to just wait for that one and 
then use it :).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
 Howdy,
 
 Ben reminded me a while back that we have a nasty race in our KVM PV code.
 
 We replace a few instructions with longer streams of instructions to check
 whether it's necessary to trap out from it (like mtmsr, no need to trap if
 we only disable interrupts). During those replacement chunks we must not get
 any interrupts, because they might overwrite scratch space that we already
 used to save otherwise clobbered register state into.
 
 So we have a thing called critical sections which allows us to atomically
 get in and out of interrupt disabled modes without touching MSR. When we
 are supposed to deliver an interrupt into the guest while we are in a critical
 section, we just don't inject the interrupt yet, but leave it be until the
 next trap.
 
 However, we never really know when the next trap would be. For all we know it
 could be never. At this point we created a race that is a potential source
 for interrupt loss or at least deferral.
 
 This patch set aims at solving the race. Instead of merely deferring an
 interrupt when we see such a situation, we go into a special instruction
 interpretation mode. In this mode, we interpret all PPC assembler instructions
 that happen until we are out of the critical section again, at which point
 we can now inject the interrupt.
 
 This bug only affects KVM implementations that make use of the magic page, so
 e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called critical sections which allows us to atomically
get in and out of interrupt disabled modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?


There are a few other alternatives to this implementation:

  1) Unmap the magic page, emulate all memory access to it while in 
critical and irq pending
  2) Trigger a timer that sends a request to the vcpu to wake it from 
potential sleep and inject the irq

  3) Single step until we're beyond the critical section
  4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds 
complexity to the MMU mapping code, since we need to make sure we don't 
map it back in on demand and treat faults to it specially.


The timer interrupt works, but I'm not fully convinced that it's a good 
idea for things like MC events which we also block during critical 
sections on e500v2.


Single stepping is hard enough to get right on interaction between QEMU, 
KVM and the guest. I didn't really want to make that stuff any more 
complicated.


This approach is really just one out of many - and it's one that's 
nicely self-contained and shouldn't have any impact at all on 
implementations that don't care about it ;).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
 On 24.06.14 20:53, Scott Wood wrote:
  On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
  Howdy,
 
  Ben reminded me a while back that we have a nasty race in our KVM PV code.
 
  We replace a few instructions with longer streams of instructions to check
  whether it's necessary to trap out from it (like mtmsr, no need to trap if
  we only disable interrupts). During those replacement chunks we must not 
  get
  any interrupts, because they might overwrite scratch space that we already
  used to save otherwise clobbered register state into.
 
  So we have a thing called critical sections which allows us to atomically
  get in and out of interrupt disabled modes without touching MSR. When we
  are supposed to deliver an interrupt into the guest while we are in a 
  critical
  section, we just don't inject the interrupt yet, but leave it be until the
  next trap.
 
  However, we never really know when the next trap would be. For all we know 
  it
  could be never. At this point we created a race that is a potential source
  for interrupt loss or at least deferral.
 
  This patch set aims at solving the race. Instead of merely deferring an
  interrupt when we see such a situation, we go into a special instruction
  interpretation mode. In this mode, we interpret all PPC assembler 
  instructions
  that happen until we are out of the critical section again, at which point
  we can now inject the interrupt.
 
  This bug only affects KVM implementations that make use of the magic page, 
  so
  e500v2, book3s_32 and book3s_64 PR KVM.
  Would it be possible to single step through the critical section
  instead?  Or set a high res timer to expire very quickly?
 
 There are a few other alternatives to this implementation:
 
1) Unmap the magic page, emulate all memory access to it while in 
 critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from 
 potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)
 
 Each has their good and bad sides. Unmapping the magic page adds 
 complexity to the MMU mapping code, since we need to make sure we don't 
 map it back in on demand and treat faults to it specially.
 
 The timer interrupt works, but I'm not fully convinced that it's a good 
 idea for things like MC events which we also block during critical 
 sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

Machine check interrupts are typically caused by a hardware or
memory subsystem failure or by an attempt to access an invalid
address. They may be caused indirectly by execution of an
instruction, but may not be recognized or reported until long
after the processor has executed past the instruction that
caused the machine check. As such, machine check interrupts are
not thought of as synchronous or asynchronous nor as precise or
imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.

 Single stepping is hard enough to get right on interaction between QEMU, 
 KVM and the guest. I didn't really want to make that stuff any more 
 complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.

 This approach is really just one out of many - and it's one that's 
 nicely self-contained and shouldn't have any impact at all on 
 implementations that don't care about it ;).

Nicely self-contained is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 25.06.14 01:15, Scott Wood wrote:

On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:

On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called critical sections which allows us to atomically
get in and out of interrupt disabled modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

There are a few other alternatives to this implementation:

1) Unmap the magic page, emulate all memory access to it while in
critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from
potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds
complexity to the MMU mapping code, since we need to make sure we don't
map it back in on demand and treat faults to it specially.

The timer interrupt works, but I'm not fully convinced that it's a good
idea for things like MC events which we also block during critical
sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

 Machine check interrupts are typically caused by a hardware or
 memory subsystem failure or by an attempt to access an invalid
 address. They may be caused indirectly by execution of an
 instruction, but may not be recognized or reported until long
 after the processor has executed past the instruction that
 caused the machine check. As such, machine check interrupts are
 not thought of as synchronous or asynchronous nor as precise or
 imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.


So what value would you set the timer to? If the value is too small, we 
never finish the critical section. If it's too big, we add lots of jitter.





Single stepping is hard enough to get right on interaction between QEMU,
KVM and the guest. I didn't really want to make that stuff any more
complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.


What if the guest is single stepping itself? How do we determine when to 
unset the bit again? When we get out of the critical section? How do we 
know what the value was before we set it?





This approach is really just one out of many - and it's one that's
nicely self-contained and shouldn't have any impact at all on
implementations that don't care about it ;).

Nicely self-contained is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.


It means that there's only a single entry point for when the code gets 
executed, not that it's very little code.


Eventually this emulation code should get merged with the already 
existing in-kernel emulation code. Paul had already started work to 
merge the emulators a while ago. He even measured speedups when he sent 
all real mode and split real mode code via the interpreter rather than 
the entry/exit dance we do today.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote:
 On 25.06.14 01:15, Scott Wood wrote:
  On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
  On 24.06.14 20:53, Scott Wood wrote:
  The timer interrupt works, but I'm not fully convinced that it's a good
  idea for things like MC events which we also block during critical
  sections on e500v2.
  Are you concerned about the guest seeing machine checks that are (more)
  asynchronous with the error condition?  e500v2 machine checks are always
  asynchronous.  From the core manual:
 
   Machine check interrupts are typically caused by a hardware or
   memory subsystem failure or by an attempt to access an invalid
   address. They may be caused indirectly by execution of an
   instruction, but may not be recognized or reported until long
   after the processor has executed past the instruction that
   caused the machine check. As such, machine check interrupts are
   not thought of as synchronous or asynchronous nor as precise or
   imprecise.
 
  I don't think the lag would be a problem, and certainly it's better than
  the current situation.
 
 So what value would you set the timer to? If the value is too small, we 
 never finish the critical section. If it's too big, we add lots of jitter.

Maybe something like 100us?

Single stepping would be better, though.

  Single stepping is hard enough to get right on interaction between QEMU,
  KVM and the guest. I didn't really want to make that stuff any more
  complicated.
  I'm not sure that it would add much complexity.  We'd just need to check
  whether any source other than the magic page turned wants DCBR0_IC on,
  to determine whether to exit to userspace or not.
 
 What if the guest is single stepping itself? How do we determine when to 
 unset the bit again? When we get out of the critical section? How do we 
 know what the value was before we set it?

Keep track of each requester of single stepping separately, and only
ever set the real bit by ORing them.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
 Howdy,
 
 Ben reminded me a while back that we have a nasty race in our KVM PV code.
 
 We replace a few instructions with longer streams of instructions to check
 whether it's necessary to trap out from it (like mtmsr, no need to trap if
 we only disable interrupts). During those replacement chunks we must not get
 any interrupts, because they might overwrite scratch space that we already
 used to save otherwise clobbered register state into.
 
 So we have a thing called critical sections which allows us to atomically
 get in and out of interrupt disabled modes without touching MSR. When we
 are supposed to deliver an interrupt into the guest while we are in a critical
 section, we just don't inject the interrupt yet, but leave it be until the
 next trap.
 
 However, we never really know when the next trap would be. For all we know it
 could be never. At this point we created a race that is a potential source
 for interrupt loss or at least deferral.
 
 This patch set aims at solving the race. Instead of merely deferring an
 interrupt when we see such a situation, we go into a special instruction
 interpretation mode. In this mode, we interpret all PPC assembler instructions
 that happen until we are out of the critical section again, at which point
 we can now inject the interrupt.
 
 This bug only affects KVM implementations that make use of the magic page, so
 e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called critical sections which allows us to atomically
get in and out of interrupt disabled modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?


There are a few other alternatives to this implementation:

  1) Unmap the magic page, emulate all memory access to it while in 
critical and irq pending
  2) Trigger a timer that sends a request to the vcpu to wake it from 
potential sleep and inject the irq

  3) Single step until we're beyond the critical section
  4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds 
complexity to the MMU mapping code, since we need to make sure we don't 
map it back in on demand and treat faults to it specially.


The timer interrupt works, but I'm not fully convinced that it's a good 
idea for things like MC events which we also block during critical 
sections on e500v2.


Single stepping is hard enough to get right on interaction between QEMU, 
KVM and the guest. I didn't really want to make that stuff any more 
complicated.


This approach is really just one out of many - and it's one that's 
nicely self-contained and shouldn't have any impact at all on 
implementations that don't care about it ;).



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
 On 24.06.14 20:53, Scott Wood wrote:
  On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
  Howdy,
 
  Ben reminded me a while back that we have a nasty race in our KVM PV code.
 
  We replace a few instructions with longer streams of instructions to check
  whether it's necessary to trap out from it (like mtmsr, no need to trap if
  we only disable interrupts). During those replacement chunks we must not 
  get
  any interrupts, because they might overwrite scratch space that we already
  used to save otherwise clobbered register state into.
 
  So we have a thing called critical sections which allows us to atomically
  get in and out of interrupt disabled modes without touching MSR. When we
  are supposed to deliver an interrupt into the guest while we are in a 
  critical
  section, we just don't inject the interrupt yet, but leave it be until the
  next trap.
 
  However, we never really know when the next trap would be. For all we know 
  it
  could be never. At this point we created a race that is a potential source
  for interrupt loss or at least deferral.
 
  This patch set aims at solving the race. Instead of merely deferring an
  interrupt when we see such a situation, we go into a special instruction
  interpretation mode. In this mode, we interpret all PPC assembler 
  instructions
  that happen until we are out of the critical section again, at which point
  we can now inject the interrupt.
 
  This bug only affects KVM implementations that make use of the magic page, 
  so
  e500v2, book3s_32 and book3s_64 PR KVM.
  Would it be possible to single step through the critical section
  instead?  Or set a high res timer to expire very quickly?
 
 There are a few other alternatives to this implementation:
 
1) Unmap the magic page, emulate all memory access to it while in 
 critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from 
 potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)
 
 Each has their good and bad sides. Unmapping the magic page adds 
 complexity to the MMU mapping code, since we need to make sure we don't 
 map it back in on demand and treat faults to it specially.
 
 The timer interrupt works, but I'm not fully convinced that it's a good 
 idea for things like MC events which we also block during critical 
 sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

Machine check interrupts are typically caused by a hardware or
memory subsystem failure or by an attempt to access an invalid
address. They may be caused indirectly by execution of an
instruction, but may not be recognized or reported until long
after the processor has executed past the instruction that
caused the machine check. As such, machine check interrupts are
not thought of as synchronous or asynchronous nor as precise or
imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.

 Single stepping is hard enough to get right on interaction between QEMU, 
 KVM and the guest. I didn't really want to make that stuff any more 
 complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.

 This approach is really just one out of many - and it's one that's 
 nicely self-contained and shouldn't have any impact at all on 
 implementations that don't care about it ;).

Nicely self-contained is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 25.06.14 01:15, Scott Wood wrote:

On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:

On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called critical sections which allows us to atomically
get in and out of interrupt disabled modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

There are a few other alternatives to this implementation:

1) Unmap the magic page, emulate all memory access to it while in
critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from
potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds
complexity to the MMU mapping code, since we need to make sure we don't
map it back in on demand and treat faults to it specially.

The timer interrupt works, but I'm not fully convinced that it's a good
idea for things like MC events which we also block during critical
sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

 Machine check interrupts are typically caused by a hardware or
 memory subsystem failure or by an attempt to access an invalid
 address. They may be caused indirectly by execution of an
 instruction, but may not be recognized or reported until long
 after the processor has executed past the instruction that
 caused the machine check. As such, machine check interrupts are
 not thought of as synchronous or asynchronous nor as precise or
 imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.


So what value would you set the timer to? If the value is too small, we 
never finish the critical section. If it's too big, we add lots of jitter.





Single stepping is hard enough to get right on interaction between QEMU,
KVM and the guest. I didn't really want to make that stuff any more
complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.


What if the guest is single stepping itself? How do we determine when to 
unset the bit again? When we get out of the critical section? How do we 
know what the value was before we set it?





This approach is really just one out of many - and it's one that's
nicely self-contained and shouldn't have any impact at all on
implementations that don't care about it ;).

Nicely self-contained is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.


It means that there's only a single entry point for when the code gets 
executed, not that it's very little code.


Eventually this emulation code should get merged with the already 
existing in-kernel emulation code. Paul had already started work to 
merge the emulators a while ago. He even measured speedups when he sent 
all real mode and split real mode code via the interpreter rather than 
the entry/exit dance we do today.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote:
 On 25.06.14 01:15, Scott Wood wrote:
  On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
  On 24.06.14 20:53, Scott Wood wrote:
  The timer interrupt works, but I'm not fully convinced that it's a good
  idea for things like MC events which we also block during critical
  sections on e500v2.
  Are you concerned about the guest seeing machine checks that are (more)
  asynchronous with the error condition?  e500v2 machine checks are always
  asynchronous.  From the core manual:
 
   Machine check interrupts are typically caused by a hardware or
   memory subsystem failure or by an attempt to access an invalid
   address. They may be caused indirectly by execution of an
   instruction, but may not be recognized or reported until long
   after the processor has executed past the instruction that
   caused the machine check. As such, machine check interrupts are
   not thought of as synchronous or asynchronous nor as precise or
   imprecise.
 
  I don't think the lag would be a problem, and certainly it's better than
  the current situation.
 
 So what value would you set the timer to? If the value is too small, we 
 never finish the critical section. If it's too big, we add lots of jitter.

Maybe something like 100us?

Single stepping would be better, though.

  Single stepping is hard enough to get right on interaction between QEMU,
  KVM and the guest. I didn't really want to make that stuff any more
  complicated.
  I'm not sure that it would add much complexity.  We'd just need to check
  whether any source other than the magic page turned wants DCBR0_IC on,
  to determine whether to exit to userspace or not.
 
 What if the guest is single stepping itself? How do we determine when to 
 unset the bit again? When we get out of the critical section? How do we 
 know what the value was before we set it?

Keep track of each requester of single stepping separately, and only
ever set the real bit by ORing them.

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html