On Mon, 16 Dec 2019 at 16:44, Andrew Jones <drjo...@redhat.com> wrote: > > On Mon, Dec 16, 2019 at 03:40:16PM +0000, Peter Maydell wrote: > > On Mon, 16 Dec 2019 at 15:14, Peter Maydell <peter.mayd...@linaro.org> > > wrote: > > > How does this interact with the usual register sync to/from > > > KVM (ie kvm_arch_get_registers(), which I think will do a > > > GET_ONE_REG read of the TIMER_CNT register the way it does > > > any other sysreg, inside write_kvmstate_to_list(), plus > > > kvm_arch_set_registers() which does the write back to the > > > kernel in write_list_to_kvmstate()) ? Presumably we want this > > > version to take precedence by the set_virtual_time call > > > happening after the kvm_arch_set_registers, but is this > > > guaranteed ? > > > > ...you might also want to look at the effects of simply > > removing the KVM_REG_ARM_TIMER_CNT entry from the > > 'non_runtime_cpregs[]' array -- in commit 4b7a6bf402bd064 > > we explicitly stopped reading/writing this register's value > > to/from the kernel except for inbound migration, and it > > feels like this patchset is now rolling back that approach, > > so maybe we should also be (configurably) rolling back some > > of its implementation rather than just leaving it in place. > > I feel like I already considered that, maybe even tried it, a few months > ago when I first looked at this. I must have decided against it for some > reason at the time, but I don't recall what. Now I can say the reason is > because we only do this save/restore when we transition to/from paused > state, though.
I found the thread which discussed the bug which originally caused us to add commit 4b7a6bf402bd064: https://lists.cs.columbia.edu/pipermail/kvmarm/2015-July/015665.html -- there are some codepaths which cause us to do a sync from/to KVM for one VCPU while others are still running. If we do a read-CNT-and-write-back then we effectively cause time to jump backwards for the other still-running CPUs. So we do still want to have TIMER_CNT listed as being KVM_PUT_FULL_STATE regardless, or we re-introduce that bug. Your approach in this patchset reads and writes on vm-paused, so it won't have the pre-2015 problems. It still feels odd that we're storing this bit of guest state in two places now though -- in kvm_vtime, and also in its usual place in the cpreg_array data structures (we write back the value from kvm_vtime when the VM starts running, and we write back the value from the cpreg_array for a PUT_FULL_STATE, which the comments claim is only on startup or when we just loaded migration state (and also undocumentedly but reasonably on cpu-hotplug, which arm doesn't have yet). I've just spent a little while digging through code, and haven't been able to satisfy myself on the ordering of which writeback wins: for a loadvm I think we first do a cpu_synchronize_all_post_init() (writing back the counter value from the migration data) and then after than we will unpause the VM -- why doesn't this overwrite the correct value with the wrong value from kvm_vtime ? I just noticed also that the logic used in this patch doesn't match what other architectures do in their vm_state_change function -- eg cpu_ppc_clock_vm_state_change() has an "if (running) { load } else { save }", and kvmclock_vm_state_change() for i386 also has "if (running) { ... } else { ... }", though it has an extra wrinkle where it captures "are we PAUSED?" to use in the pre_save function; the comment above kvmclock_pre_save() suggests maybe that would be useful for other than x86, too. kvm_s390_tod_vm_state_change() has logic that's a slightly more complicated variation on just testing the 'running' flag, but it doesn't look at the specific new state. > > I note also that the commit message there had a remark > > about inconsistencies between VCPUs -- is the right thing > > to handle this per-VM rather than per-VCPU somehow? > > per-VM would make sense, because the counters should be synchronized > among the VCPUs. KVM does that for us, though, so whichever VCPU last > restores its counter is the one that will determine the final value. > > Maybe we should have a VM ioctl instead, but ATM we only have VCPU ioctls. I meant more "only do the save/load once per VM in QEMU but do it by working with just one VCPU". But I guess since migration works on all the VCPUs we're ok to do pause-resume the same way (and I've now tracked down what the 'inconsistentencies between VCPUs' were: they're when we were syncing the CNT value for one vCPU when others were still running.) thanks -- PMM