On 17.01.2013, at 00:01, Marcelo Tosatti wrote: > On Wed, Jan 16, 2013 at 09:41:54PM +0100, Christian Borntraeger wrote: >> On 16/01/13 21:21, Marcelo Tosatti wrote: >>> On Wed, Jan 16, 2013 at 09:03:20PM +0100, Christian Borntraeger wrote: >>>> On 16/01/13 17:05, Marcelo Tosatti wrote: >>>> >>>>> The S/390 problem, from >>>>> http://lists.nongnu.org/archive/html/qemu-devel/2012-11/msg02213.html: >>>>> >>>>> ">>> The kvm register sync needs to happen in the kvm register sync >>>>>>>> function :) >>>>>>> That would eliminate the whole purpose of sync regs and forces us to >>>>>>> have an >>>>>>> expensive ioctl on lots of exits (again). I would prefer to sync the >>>>>>> registers >>>>>>> that we never need in qemu just here. >>>>>> >>>>>> That's why the register sync has different stages. >>>>> >>>>> Not the get_register. Which is called on every synchronize_state. Which >>>>> happen >>>>> quite often >>>>> on s390." >>>>> >>>>> But wait: on these S/390 codepaths, you do GET_REGS already, via >>>>> cpu_synchronize_state. >>>>> >>>>> So on S/390 >>>>> >>>>> - cpu_synchronize_state(env) >>>>> - read any register from env >>>>> >>>>> Is not valid? This is what generic code assumes. >>>> >>>> TO recap the motiviation: >>>> >>>> cpu_synchronize_state on s390 currently updates any register in env that is >>>> used by qemu (general purpose, prefix, psw, control and access) in the >>>> normal >>>> runtime. it turns out we have all of these regs in kvm_run, so we can do >>>> synchronize states without doing an additional ioctl call. >>>> Now, for life migration and dump we need some additional registers (which >>>> are >>>> only accessable via onereg interface). So synchronize_state would need to >>>> do 3 or 4 additional system calls on the hot path, only to take care of >>>> something that is not on the hot path at all. >>>> For historic reasons, we have one exit code for almost all exits. >>>> Therefore, >>>> we need to call synchronize_states almost always. >>>> We could now start to have a poor mans synchronize_state in arch code, but >>>> that would collide with common code synchronize_state if done at the wrong >>>> time. Thus we want to make common code capable of having only a subset of >>>> the register synched - by making it possible to sync the other regs later >>>> on if needed without wiping the former sync. >>>> >>>> Makes sense? >>>> >>>> Christian >>> >>> Yes. As noted in the last email on the thread, runtime/reset/full are to >>> serapate sets of registers when writing _to_ kernel. When reading _from_ >>> kernel, reset and full distinctions are not appropriate (any register >>> can change, as far as knowledge goes). >> >> Hmm, I probably did not understood your point, so I will try to explain mine >> and see what you respond :-) >> >> The point of the patch set, is to allow this distinction when reading. >> In other words it allows code to state: I am only interested in regxy and >> dont >> care if the other regs in env are out of sync. > > Fine. > >> If a full sync is necessary later on the other regs are synched as well. >> If a full sync was already done before a partial get becomes a no-op. > > -> FULL is the set of registers written when loadvm/initialization is > performed. > -> RESET, a subset of full, is a set of registers written on SYSTEM > RESET. > -> RUNTIME, a subset of RESET, is a set of registers written during > RUNTIME. > > To write both the RESET and FULL set of registers during runtime, > contradicts the description above for both RESET and FULL. > > Two examples from i386: > > if (level == KVM_PUT_FULL_STATE) { > /* > * KVM is yet unable to synchronize TSC values of multiple VCPUs > * on > * writeback. Until this is fixed, we only write the offset to > * SMP > * guests after migration, desynchronizing the VCPUs, but > * avoiding > * huge jump-backs that would occur without any writeback at > * all. > */ > ... > } > > And: > > /* > * The following paravirtual MSRs have side effects on the guest or > * are > * too heavy for normal writeback. Limit them to reset or full state > * updates. > */ > >> Why should that be not possible. > > It should, but separately from FULL/RESET/RUNTIME distinction. > This sequence > > get_regs(FULLSTATE) > put_regs(FULLSTATE) > > During runtime is not allowed. And only syncing the RUNTIME set of > registers during and leaving the FULL set of registers marked as > dirty is confusing also. > > So perhaps what you'd want is selective read/write of RUNTIME registers > as suggested. > > > Date: Fri, 4 Jan 2013 23:49:42 -0200 > From: Marcelo Tosatti <mtosa...@redhat.com> > To: "Jason J. Herne" <jjhe...@linux.vnet.ibm.com> > Cc: Alexander Graf <ag...@suse.de>, > Bhushan Bharat-R65777 <r65...@freescale.com>, > Christian Borntraeger <borntrae...@de.ibm.com>, > Anthony Liguori <aligu...@us.ibm.com>, > "qemu-devel@nongnu.org qemu-devel" <qemu-devel@nongnu.org> > Subject: Re: [Qemu-devel] [PATCH 7/7] KVM regsync: Fix > do_kvm_cpu_synchronize_state data integrity issue > > On Fri, Jan 04, 2013 at 10:25:45AM -0500, Jason J. Herne wrote: >> If I've followed the conversation correctly this is what needs to be done: >> >> 1. Remove the level parameters from kvm_arch_get_registers and >> kvm_arch_put_registers. >> >> 2. Add a new bitmap parameter to kvm_arch_get_registers and >> kvm_arch_put_registers. >> >> 3. Define a bit that correlates to our current notion of "all >> runtime registers". This bit, and all bits in this bitmap, would be >> architecture specific. >> >> 4. Remove the cpustate->kvm_sync_dirty field. Replace it with a >> bitmap that tracks which bits are dirty and need to be synced back >> to KVM-land. >> >> 5. As we do today, we'll assume registers are dirty and turn on >> their corresponding bit in this new bitmap whenever we "get" the >> registers from KVM. >> >> 6. Add other bits as needed on a case by case basis. >> >> Does this seem to match what was discussed, and what we want to do? >> >> >> -- >> -- Jason J. Herne (jjhe...@linux.vnet.ibm.com) > > The s390 change to use runtime_state is self-contained and could be > integrated now from my pov. > > The suggestion was based on the observation that the ppc guys are trying > to fix their timer problem and that should be a generic solution, > per-register granularity. > > Perhaps something along the lines of > > init > all registers marked as cached (read/write accessors don't > synchronize state) > > Around ioctl(KVM_RUN) have: > > if regs dirty > writeback > r = ioctl(KVM_RUN) > mark runstate registers as not cached > > Then > > read_reg(x) > if x not cached > arch_get_regs(RUNTIME_STATE) (*) > > write_reg(x, val) > read_reg(x) > cpustate->x = val; > mark_dirty(x) > > Which is basically the pattern used in KVM x86 (but instead of > ioctl(KVM_RUN) there is VMENTRY).
But that would mean that any code in QEMU that accesses registers can't access env-> ,but instead needs to go through an accessor function. That's a lot of potential for subtile error, no? I think for now the best choice for get_regs() would be to ignore the FULL/RESET bits and always keep the syncing as it happens today under the RUNTIME umbrella only. So all of get_regs() only checks for RUNTIME. Whenever get_xxx() happens, a bit gets set for set_xxx(). Up to this point, only the RUNTIME bit is ever set, because that's what cpu_synchronize_registers() sets. Then s390 can add special separate bits for "sync GPRs" and "sync CRs". SYNC_RUNTIME would include those bits. The kvm hypercall exit calls a new synchronize_registers() function with a parameter telling it to only sync GPRs. This marks GPRs dirty, but not RUNTIME. The set_registers() function in s390 specific code could handle this particular case specially. That way everything's solved and scalable, no? Alex