Re: KVM: x86: fix kvmclock write race (v2)

2015-04-20 Thread Radim Krčmář
2015-04-17 21:23-0300, Marcelo Tosatti:
 
 From: Radim Krčmář rkrc...@redhat.com
 
 As noted by Andy Lutomirski, kvm does not follow the documented version
 protocol. Fix it.
 
 Note: this bug results in a race which can occur if the following three
 conditions are met:
 
 1) There is KVM guest time update (there is one every 5 minutes).
 
 2) Which races with a thread in the guest in the following way:
 The execution of these 29 instructions has to take at _least_
 2 seconds (rebalance interval is 1 second).
 
 lsl%r9w,%esi
 mov%esi,%r8d
 and$0x3f,%esi
 and$0xfff,%r8d
 test   $0xfc0,%r8d
 jne0xa12 vread_pvclock+210
 shl$0x6,%rsi
 mov-0xa01000(%rsi),%r10d
 data32 xchg %ax,%ax
 data32 xchg %ax,%ax
 rdtsc  
 shl$0x20,%rdx
 mov%eax,%eax
 movsbl -0xa00fe4(%rsi),%ecx
 or %rax,%rdx
 sub-0xa00ff8(%rsi),%rdx
 mov-0xa00fe8(%rsi),%r11d
 mov%rdx,%rax
 shl%cl,%rax
 test   %ecx,%ecx
 js 0xa08 vread_pvclock+200
 mov%r11d,%edx
 movzbl -0xa00fe3(%rsi),%ecx
 mov-0xa00ff0(%rsi),%r11
 mul%rdx
 shrd   $0x20,%rdx,%rax
 data32 xchg %ax,%ax
 data32 xchg %ax,%ax
 lsl%r9w,%edx
 
 3) Scheduler moves the task, while executing these 29 instructions, to a
 destination processor, then back to the source processor.
 
 4) Source processor, after has been moved back from destination,
 perceives data out of order as written by processor performing guest
 time update (item 1), with string mov.
 
 Given the rarity of this condition, and the fact it was never observed
 or reported, reverting pvclock vsyscall on systems whose host is
 susceptible to the race, seems an excessive measure.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Cc: sta...@kernel.org

Thanks.

Reviewed-or-Signed-off-by: Radim Krčmář rkrc...@redhat.com

Like most code, I would have written it differently now :)

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 + kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 + guest_hv_clock,
 + sizeof(guest_hv_clock));

The easiest optimization is replacing sizeof(guest_hv_clock) with
  offsetof(typeof(guest_hv_clock), version) + sizeof(guest_hv_clock.version)
because kvm_write_guest_cached() allows writing of prefixes.
This still won't get optimized to a simple MOV at compile time, but
saves few mov bytes.

(Offset of version is 0 now, so using 'sizeof guest_hv_clock.version' is
 just a minor offence sand saves some hard to read code.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: x86: fix kvmclock write race (v2)

2015-04-17 Thread Marcelo Tosatti

From: Radim Krčmář rkrc...@redhat.com

As noted by Andy Lutomirski, kvm does not follow the documented version
protocol. Fix it.

Note: this bug results in a race which can occur if the following three
conditions are met:

1) There is KVM guest time update (there is one every 5 minutes).

2) Which races with a thread in the guest in the following way:
The execution of these 29 instructions has to take at _least_
2 seconds (rebalance interval is 1 second).

lsl%r9w,%esi
mov%esi,%r8d
and$0x3f,%esi
and$0xfff,%r8d
test   $0xfc0,%r8d
jne0xa12 vread_pvclock+210
shl$0x6,%rsi
mov-0xa01000(%rsi),%r10d
data32 xchg %ax,%ax
data32 xchg %ax,%ax
rdtsc  
shl$0x20,%rdx
mov%eax,%eax
movsbl -0xa00fe4(%rsi),%ecx
or %rax,%rdx
sub-0xa00ff8(%rsi),%rdx
mov-0xa00fe8(%rsi),%r11d
mov%rdx,%rax
shl%cl,%rax
test   %ecx,%ecx
js 0xa08 vread_pvclock+200
mov%r11d,%edx
movzbl -0xa00fe3(%rsi),%ecx
mov-0xa00ff0(%rsi),%r11
mul%rdx
shrd   $0x20,%rdx,%rax
data32 xchg %ax,%ax
data32 xchg %ax,%ax
lsl%r9w,%edx

3) Scheduler moves the task, while executing these 29 instructions, to a
destination processor, then back to the source processor.

4) Source processor, after has been moved back from destination,
perceives data out of order as written by processor performing guest
time update (item 1), with string mov.

Given the rarity of this condition, and the fact it was never observed
or reported, reverting pvclock vsyscall on systems whose host is
susceptible to the race, seems an excessive measure.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Cc: sta...@kernel.org

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cc2c759..e75fafd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1658,12 +1658,20 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
guest_hv_clock, sizeof(guest_hv_clock
return 0;
 
-   /*
-* The interface expects us to write an even number signaling that the
-* update is finished. Since the guest won't see the intermediate
-* state, we just increase by 2 at the end.
+   /* A guest can read other VCPU's kvmclock; specification says that
+* version is odd if data is being modified and even after it is
+* consistent.
+* We write three times to be sure.
+*  1) update version to odd number
+*  2) write modified data (version is still odd)
+*  3) update version to even number
 */
-   vcpu-hv_clock.version = guest_hv_clock.version + 2;
+   guest_hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   guest_hv_clock,
+   sizeof(guest_hv_clock));
+
+   vcpu-hv_clock.version = guest_hv_clock.version;
 
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
@@ -1684,6 +1692,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
kvm_write_guest_cached(v-kvm, vcpu-pv_time,
vcpu-hv_clock,
sizeof(vcpu-hv_clock));
+
+   vcpu-hv_clock.version += 1;
+   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
+   vcpu-hv_clock,
+   sizeof(vcpu-hv_clock));
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html