On 31 January 2018 at 19:12, Christoffer Dall <christoffer.d...@linaro.org> wrote: > On Wed, Jan 31, 2018 at 7:00 PM, Ard Biesheuvel > <ard.biesheu...@linaro.org> wrote: >> On 31 January 2018 at 17:39, Christoffer Dall >> <christoffer.d...@linaro.org> wrote: >>> On Wed, Jan 31, 2018 at 5:59 PM, Ard Biesheuvel >>> <ard.biesheu...@linaro.org> wrote: >>>> On 31 January 2018 at 16:53, Christoffer Dall >>>> <christoffer.d...@linaro.org> wrote: >>>>> On Wed, Jan 31, 2018 at 4:18 PM, Ard Biesheuvel >>>>> <ard.biesheu...@linaro.org> wrote: >>>>>> On 31 January 2018 at 09:53, Christoffer Dall >>>>>> <christoffer.d...@linaro.org> wrote: >>>>>>> On Mon, Jan 29, 2018 at 10:32:12AM +0000, Marc Zyngier wrote: >>>>>>>> On 29/01/18 10:04, Peter Maydell wrote: >>>>>>>> > On 29 January 2018 at 09:53, Dr. David Alan Gilbert >>>>>>>> > <dgilb...@redhat.com> wrote: >>>>>>>> >> * Peter Maydell (peter.mayd...@linaro.org) wrote: >>>>>>>> >>> On 26 January 2018 at 19:46, Dr. David Alan Gilbert >>>>>>>> >>> <dgilb...@redhat.com> wrote: >>>>>>>> >>>> * Peter Maydell (peter.mayd...@linaro.org) wrote: >>>>>>>> >>>>> I think the correct fix here is that your test code should turn >>>>>>>> >>>>> its MMU on. Trying to treat guest RAM as uncacheable doesn't work >>>>>>>> >>>>> for Arm KVM guests (for the same reason that VGA device video >>>>>>>> >>>>> memory >>>>>>>> >>>>> doesn't work). If it's RAM your guest has to arrange to map it as >>>>>>>> >>>>> Normal Cacheable, and then everything should work fine. >>>>>>>> >>>> >>>>>>>> >>>> Does this cause problems with migrating at just the wrong point >>>>>>>> >>>> during >>>>>>>> >>>> a VM boot? >>>>>>>> >>> >>>>>>>> >>> It wouldn't surprise me if it did, but I don't think I've ever >>>>>>>> >>> tried to provoke that problem... >>>>>>>> >> >>>>>>>> >> If you think it'll get the RAM contents wrong, it might be best to >>>>>>>> >> fail >>>>>>>> >> the migration if you can detect the cache is disabled in the guest. >>>>>>>> > >>>>>>>> > I guess QEMU could look at the value of the "MMU disabled/enabled" >>>>>>>> > bit >>>>>>>> > in the guest's system registers, and refuse migration if it's off... >>>>>>>> > >>>>>>>> > (cc'd Marc, Christoffer to check that I don't have the wrong end >>>>>>>> > of the stick about how thin the ice is in the period before the >>>>>>>> > guest turns on its MMU...) >>>>>>>> >>>>>>>> Once MMU and caches are on, we should be in a reasonable place for QEMU >>>>>>>> to have a consistent view of the memory. The trick is to prevent the >>>>>>>> vcpus from changing that. A guest could perfectly turn off its MMU at >>>>>>>> any given time if it needs to (and it is actually required on some HW >>>>>>>> if >>>>>>>> you want to mitigate headlining CVEs), and KVM won't know about that. >>>>>>>> >>>>>>> >>>>>>> (Clarification: KVM can detect this is it bother to check the VCPU's >>>>>>> system registers, but we don't trap to KVM when the VCPU turns off its >>>>>>> caches, right?) >>>>>>> >>>>>>>> You may have to pause the vcpus before starting the migration, or >>>>>>>> introduce a new KVM feature that would automatically pause a vcpu that >>>>>>>> is trying to disable its MMU while the migration is on. This would >>>>>>>> involve trapping all the virtual memory related system registers, with >>>>>>>> an obvious cost. But that cost would be limited to the time it takes to >>>>>>>> migrate the memory, so maybe that's acceptable. >>>>>>>> >>>>>>> Is that even sufficient? >>>>>>> >>>>>>> What if the following happened. (1) guest turns off MMU, (2) guest >>>>>>> writes some data directly to ram (3) qemu stops the vcpu (4) qemu reads >>>>>>> guest ram. QEMU's view of guest ram is now incorrect (stale, >>>>>>> incoherent, ...). >>>>>>> >>>>>>> I'm also not really sure if pausing one VCPU because it turned off its >>>>>>> MMU will go very well when trying to migrate a large VM (wouldn't this >>>>>>> ask for all the other VCPUs beginning to complain that the stopped VCPU >>>>>>> appears to be dead?). As a short-term 'fix' it's probably better to >>>>>>> refuse migration if you detect that a VCPU had begun turning off its >>>>>>> MMU. >>>>>>> >>>>>>> On the larger scale of thins; this appears to me to be another case of >>>>>>> us really needing some way to coherently access memory between QEMU and >>>>>>> the VM, but in the case of the VCPU turning off the MMU prior to >>>>>>> migration, we don't even know where it may have written data, and I'm >>>>>>> therefore not really sure what the 'proper' solution would be. >>>>>>> >>>>>>> (cc'ing Ard who has has thought about this problem before in the context >>>>>>> of UEFI and VGA.) >>>>>>> >>>>>> >>>>>> Actually, the VGA case is much simpler because the host is not >>>>>> expected to write to the framebuffer, only read from it, and the guest >>>>>> is not expected to create a cacheable mapping for it, so any >>>>>> incoherency can be trivially solved by cache invalidation on the host >>>>>> side. (Note that this has nothing to do with DMA coherency, but only >>>>>> with PCI MMIO BARs that are backed by DRAM in the host) >>>>> >>>>> In case of the running guest, the host will also only read from the >>>>> cached mapping. Of course, at restore, the host will also write >>>>> through a cached mapping, but shouldn't the latter case be solvable by >>>>> having KVM clean the cache lines when faulting in any page? >>>>> >>>> >>>> We are still talking about the contents of the framebuffer, right? In >>>> that case, yes, afaict >>>> >>> >>> I was talking about normal RAM actually... not sure if that changes >>> anything? >>> >> >> The main difference is that with a framebuffer BAR, it is pointless >> for the guest to map it cacheable, given that the purpose of a >> framebuffer is its side effects, which are not guaranteed to occur >> timely if the mapping is cacheable. >> >> If we are talking about normal RAM, then why are we discussing it here >> and not down there? >> > > Because I was trying to figure out how the challenge of accessing the > VGA framebuffer differs from the challenge of accessing guest RAM > which may have been written by the guest with the MMU off. > > First approximation, they are extremely similar because the guest is > writing uncached to memory, which the host now has to access via a > cached mapping. > > But I'm guessing that a "clean+invalidate before read on the host" > solution will result in terrible performance for a framebuffer and > therefore isn't a good solution for that problem... >
That highly depends on where 'not working' resides on the performance scale. Currently, VGA on KVM simply does not work at all, and so working but slow would be a huge improvement over the current situation. Also, the performance hit is caused by the fact that the data needs to make a round trip to memory, and the invalidation (without cleaning) performed by the host shouldn't make that much worse than it fundamentally is to begin with. A paravirtualized framebuffer (as was proposed recently by Gerd I think?) would solve this, since the guest can just map it as cacheable. >> >> >>>>>> >>>>>> In the migration case, it is much more complicated, and I think >>>>>> capturing the state of the VM in a way that takes incoherency between >>>>>> caches and main memory into account is simply infeasible (i.e., the >>>>>> act of recording the state of guest RAM via a cached mapping may evict >>>>>> clean cachelines that are out of sync, and so it is impossible to >>>>>> record both the cached *and* the delta with the uncached state) >>>>> >>>>> This may be an incredibly stupid question (and I may have asked it >>>>> before), but why can't we clean+invalidate the guest page before >>>>> reading it and thereby obtain a coherent view of a page? >>>>> >>>> >>>> Because cleaning from the host will clobber whatever the guest wrote >>>> directly to memory with the MMU off, if there is a dirty cacheline >>>> shadowing that memory. >>> >>> If the host never wrote anything to that memory (it shouldn't mess >>> with the guest's memory) there will only be clean cache lines (even if >>> they contain content shadowing the memory) and cleaning them would be >>> equivalent to an invalidate. Am I misremembering how this works? >>> >> >> Cleaning doesn't actually invalidate, but it should be a no-op for >> clean cachelines. >> >>>> However, that same cacheline could be dirty >>>> because the guest itself wrote to memory with the MMU on. >>> >>> Yes, but the guest would have no control over when such a cache line >>> gets flushed to main memory by the hardware, and can have no >>> reasonable expectation that the cache lines don't get cleaned behind >>> its back. The fact that a migration triggers this, is reasonable. A >>> guest that wants hand-off from main memory that its accessing with the >>> MMU off, must invalidate the appropriate cache lines or ensure they're >>> clean. There's very likely some subtle aspect to all of this that I'm >>> forgetting. >>> >> >> OK, so if the only way cachelines covering guest memory could be dirty >> is after the guest wrote to that memory itself via a cacheable >> mapping, I guess it would be reasonable to do clean+invalidate before >> reading the memory. Then, the only way for the guest to lose anything >> is in cases where it could not reasonably expect it to be retained >> anyway. > > Right, that's what I'm thinking. > >> >> However, that does leave a window, between the invalidate and the >> read, where the guest could modify memory without it being visible to >> the host. > > Is that a problem specific to the coherency challenge? I thought this > problem was already addressed by dirty page tracking, but there's like > some interaction with the cache maintenance that we'd have to figure > out. > I don't know how dirty page tracking works exactly, but if it that can track direct writes to memory as easily as cached writes, it would probably cover this as well.