subject:"\[Qemu\-devel\] \[Bug 1732959\] Re\: \[regression\] Clock jump on source VM after migration"

[Qemu-devel] [Bug 1732959] Re: [regression] Clock jump on source VM after migration

2018-06-21 Thread Neil Skrypuch

Actually, migration isn't required to reproduce this issue at all, it is
the stop/cont involved in the migration here that triggers the bug. It
is significantly easier to reproduce the bug with the following steps:

1) on host, adjtimex -f 1000
2) start guest
3) wait 20 minutes
4) stop and immediately after cont guest

At this point you will observe a time jump in the guest proportional to
the amount of time drift accumulated on the host since the last
stop/cont cycle (or guest startup if none). The easiest way to notice
the time jump in the guest is by running ntpd in the guest.

Step one above is optional, but if omitted, you must wait substantially
longer in step 3 (~1 day or more, depending on the amount of natural
clock drift on the host).

One upshot of this is that we now have a (very ugly and hacky)
workaround, which is to regularly (several times a day) issue a
stop/cont to every running VM, which smears the time jump out into
enough smaller pieces that it's not as bad as one bigger jump. It would
still be nice to have this bug fixed though.

** Summary changed:

- [regression] Clock jump on source VM after migration
+ [regression] stop/cont triggers clock jump proportional to host clock drift

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1732959

Title:
  [regression] stop/cont triggers clock jump proportional to host clock
  drift

Status in QEMU:
  New

Bug description:
  We (ab)use migration + block mirroring to perform transparent zero
  downtime VM backups. Basically:

  1) do a block mirror of the source VM's disk
  2) migrate the source VM to a destination VM using the disk copy
  3) cancel the block mirroring
  4) resume the source VM
  5) shut down the destination VM gracefully and move the disk to backup

  Relatively recently, the source VM's clock started jumping after step
  #4. More specifically, the clock jumps an amount of time proportional
  to the time since it was last migrated. With a week between
  migrations, clock jumps between ~2.5s and ~12s have been observed. For
  a particular host, the amount of clock jump is fairly consistent, but
  there is a large variation from one host to the next (this is likely
  down to hardware variations and the amount of NTP adjusted clock drift
  on the host).

  This is caused by a kernel regression which I was able to bisect. The
  result of the bisect was:

  108b249c453dd7132599ab6dc7e435a7036c193f is the first bad commit
  commit 108b249c453dd7132599ab6dc7e435a7036c193f
  Author: Paolo Bonzini 
  Date:   Thu Sep 1 14:21:03 2016 +0200

  KVM: x86: introduce get_kvmclock_ns
  
  Introduce a function that reads the exact nanoseconds value that is
  provided to the guest in kvmclock.  This crystallizes the notion of
  kvmclock as a thin veneer over a stable TSC, that the guest will
  (hopefully) convert with NTP.  In other words, kvmclock is *not* a
  paravirtualized host-to-guest NTP.
  
  Drop the get_kernel_ns() function, that was used both to get the base
  value of the master clock and to get the current value of kvmclock.
  The former use is replaced by ktime_get_boot_ns(), the latter is
  the purpose of get_kernel_ns().
  
  This also allows KVM to provide a Hyper-V time reference counter that
  is synchronized with the time that is computed from the TSC page.
  
  Reviewed-by: Roman Kagan 
  Signed-off-by: Paolo Bonzini 

  I am able to reproduce the issue with much newer kernels as well,
  including 4.12.5 and 4.9.6.

  Reliably reproducing the problem in isolation is difficult, as one
  must run a VM for many hours before the clock jump from this bug is
  noticeable over the clock jump inherent with a pause and resume of the
  VM. The reproducer I am including is set to run the VM for 18 hours
  before migration and looks for >= 150 ms of clock jump. On different
  hardware, you may need to let the VM run for more than 18 hours to
  reliably reproduce the issue.

  To reproduce the issue, please see the attached reproducer. The host
  needs to have perl, screen and socat installed for the backup script
  to work. Both the host and guest need to be running NTP (and NTP must
  autostart at boot in the guest). The host needs to be able to SSH into
  the guest using SSH keys (to measure the clock jump), so you will need
  to configure the network and SSH keys appropriately, then change the
  hardcoded IP address in checktime.sh and test.sh. I have only tested
  with CentOS 7 guests.

  The qemu command that gets run is in .kvmscreen (the destination VM's
  command line is programmatically constructed from this command as
  well), you may need to tweak the bridge configuration. Also, although
  the reproducer is relatively self contained, it has several built in
  assumptions that will break if the image file is not in the
  /var/lib/kvm directory or if the

[Qemu-devel] [Bug 1732959] Re: [regression] Clock jump on source VM after migration

2018-02-21 Thread Neil Skrypuch

As a further test, I disabled ntpd on the host and ran ntpdate via cron
every 12 hours, so that the clock would be relatively accurate, but no
clock skew would be involved. This also reproduced the failure as
initially described.

This is interesting as it means that a much simpler and faster
reproduction case is probably feasible, something like:

1) start guest
2) jump host clock by a few seconds
3) migrate + cont src guest
4) check clock in src guest

Though, I haven't tried this yet.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1732959

Title:
  [regression] Clock jump on source VM after migration

Status in QEMU:
  New

Bug description:
  We (ab)use migration + block mirroring to perform transparent zero
  downtime VM backups. Basically:

  1) do a block mirror of the source VM's disk
  2) migrate the source VM to a destination VM using the disk copy
  3) cancel the block mirroring
  4) resume the source VM
  5) shut down the destination VM gracefully and move the disk to backup

  Relatively recently, the source VM's clock started jumping after step
  #4. More specifically, the clock jumps an amount of time proportional
  to the time since it was last migrated. With a week between
  migrations, clock jumps between ~2.5s and ~12s have been observed. For
  a particular host, the amount of clock jump is fairly consistent, but
  there is a large variation from one host to the next (this is likely
  down to hardware variations and the amount of NTP adjusted clock drift
  on the host).

  This is caused by a kernel regression which I was able to bisect. The
  result of the bisect was:

  108b249c453dd7132599ab6dc7e435a7036c193f is the first bad commit
  commit 108b249c453dd7132599ab6dc7e435a7036c193f
  Author: Paolo Bonzini 
  Date:   Thu Sep 1 14:21:03 2016 +0200

  KVM: x86: introduce get_kvmclock_ns
  
  Introduce a function that reads the exact nanoseconds value that is
  provided to the guest in kvmclock.  This crystallizes the notion of
  kvmclock as a thin veneer over a stable TSC, that the guest will
  (hopefully) convert with NTP.  In other words, kvmclock is *not* a
  paravirtualized host-to-guest NTP.
  
  Drop the get_kernel_ns() function, that was used both to get the base
  value of the master clock and to get the current value of kvmclock.
  The former use is replaced by ktime_get_boot_ns(), the latter is
  the purpose of get_kernel_ns().
  
  This also allows KVM to provide a Hyper-V time reference counter that
  is synchronized with the time that is computed from the TSC page.
  
  Reviewed-by: Roman Kagan 
  Signed-off-by: Paolo Bonzini 

  I am able to reproduce the issue with much newer kernels as well,
  including 4.12.5 and 4.9.6.

  Reliably reproducing the problem in isolation is difficult, as one
  must run a VM for many hours before the clock jump from this bug is
  noticeable over the clock jump inherent with a pause and resume of the
  VM. The reproducer I am including is set to run the VM for 18 hours
  before migration and looks for >= 150 ms of clock jump. On different
  hardware, you may need to let the VM run for more than 18 hours to
  reliably reproduce the issue.

  To reproduce the issue, please see the attached reproducer. The host
  needs to have perl, screen and socat installed for the backup script
  to work. Both the host and guest need to be running NTP (and NTP must
  autostart at boot in the guest). The host needs to be able to SSH into
  the guest using SSH keys (to measure the clock jump), so you will need
  to configure the network and SSH keys appropriately, then change the
  hardcoded IP address in checktime.sh and test.sh. I have only tested
  with CentOS 7 guests.

  The qemu command that gets run is in .kvmscreen (the destination VM's
  command line is programmatically constructed from this command as
  well), you may need to tweak the bridge configuration. Also, although
  the reproducer is relatively self contained, it has several built in
  assumptions that will break if the image file is not in the
  /var/lib/kvm directory or if the monitor file is not in the
  /var/lib/kvm/monitor directory, or if the /backup directory does not
  exist. Finally, if you change the process name or socket name in
  .kvmscreen, you'll need to adjust the cleanup section in test.sh.

  With all of the above in place, run test.sh and check back in a little
  over 18 hours, part of the output should include something along these
  lines:

  Target not found (wanted 150, at 10)

  - or -

  Target found (wanted 150, found 340)

  If the target is reported as found, that means that we have probably
  reproduced the described issue.

  The version of QEMU in use does not appear to matter. At one point I
  tested every major version from 2.4 to

[Qemu-devel] [Bug 1732959] Re: [regression] Clock jump on source VM after migration

2018-02-09 Thread Neil Skrypuch

Two important findings:

1) If I disable ntpd on the host, this issue goes away.
2) If I forcefully induce substantial clock skew on the host (with adjtimex -f 
1000), it becomes much less time intensive to reproduce this issue. 
Using the attached reproducer but replacing the 18h sleep with a 20m sleep can 
still reliably reproduce the issue in this case.

So, this issue is definitely related to clock skew.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1732959

Title:
  [regression] Clock jump on source VM after migration

Status in QEMU:
  New

Bug description:
  We (ab)use migration + block mirroring to perform transparent zero
  downtime VM backups. Basically:

  1) do a block mirror of the source VM's disk
  2) migrate the source VM to a destination VM using the disk copy
  3) cancel the block mirroring
  4) resume the source VM
  5) shut down the destination VM gracefully and move the disk to backup

  Relatively recently, the source VM's clock started jumping after step
  #4. More specifically, the clock jumps an amount of time proportional
  to the time since it was last migrated. With a week between
  migrations, clock jumps between ~2.5s and ~12s have been observed. For
  a particular host, the amount of clock jump is fairly consistent, but
  there is a large variation from one host to the next (this is likely
  down to hardware variations and the amount of NTP adjusted clock drift
  on the host).

  This is caused by a kernel regression which I was able to bisect. The
  result of the bisect was:

  108b249c453dd7132599ab6dc7e435a7036c193f is the first bad commit
  commit 108b249c453dd7132599ab6dc7e435a7036c193f
  Author: Paolo Bonzini 
  Date:   Thu Sep 1 14:21:03 2016 +0200

  KVM: x86: introduce get_kvmclock_ns
  
  Introduce a function that reads the exact nanoseconds value that is
  provided to the guest in kvmclock.  This crystallizes the notion of
  kvmclock as a thin veneer over a stable TSC, that the guest will
  (hopefully) convert with NTP.  In other words, kvmclock is *not* a
  paravirtualized host-to-guest NTP.
  
  Drop the get_kernel_ns() function, that was used both to get the base
  value of the master clock and to get the current value of kvmclock.
  The former use is replaced by ktime_get_boot_ns(), the latter is
  the purpose of get_kernel_ns().
  
  This also allows KVM to provide a Hyper-V time reference counter that
  is synchronized with the time that is computed from the TSC page.
  
  Reviewed-by: Roman Kagan 
  Signed-off-by: Paolo Bonzini 

  I am able to reproduce the issue with much newer kernels as well,
  including 4.12.5 and 4.9.6.

  Reliably reproducing the problem in isolation is difficult, as one
  must run a VM for many hours before the clock jump from this bug is
  noticeable over the clock jump inherent with a pause and resume of the
  VM. The reproducer I am including is set to run the VM for 18 hours
  before migration and looks for >= 150 ms of clock jump. On different
  hardware, you may need to let the VM run for more than 18 hours to
  reliably reproduce the issue.

  To reproduce the issue, please see the attached reproducer. The host
  needs to have perl, screen and socat installed for the backup script
  to work. Both the host and guest need to be running NTP (and NTP must
  autostart at boot in the guest). The host needs to be able to SSH into
  the guest using SSH keys (to measure the clock jump), so you will need
  to configure the network and SSH keys appropriately, then change the
  hardcoded IP address in checktime.sh and test.sh. I have only tested
  with CentOS 7 guests.

  The qemu command that gets run is in .kvmscreen (the destination VM's
  command line is programmatically constructed from this command as
  well), you may need to tweak the bridge configuration. Also, although
  the reproducer is relatively self contained, it has several built in
  assumptions that will break if the image file is not in the
  /var/lib/kvm directory or if the monitor file is not in the
  /var/lib/kvm/monitor directory, or if the /backup directory does not
  exist. Finally, if you change the process name or socket name in
  .kvmscreen, you'll need to adjust the cleanup section in test.sh.

  With all of the above in place, run test.sh and check back in a little
  over 18 hours, part of the output should include something along these
  lines:

  Target not found (wanted 150, at 10)

  - or -

  Target found (wanted 150, found 340)

  If the target is reported as found, that means that we have probably
  reproduced the described issue.

  The version of QEMU in use does not appear to matter. At one point I
  tested every major version from 2.4 to 2.9 (inclusive) and reproduced
  the issue in all of them.

  This was

[Qemu-devel] [Bug 1732959] Re: [regression] Clock jump on source VM after migration

[Qemu-devel] [Bug 1732959] Re: [regression] Clock jump on source VM after migration

[Qemu-devel] [Bug 1732959] Re: [regression] Clock jump on source VM after migration

3 matches

Site Navigation

Mail list logo

Footer information