On 1 February 2018 at 09:17, Christoffer Dall <christoffer.d...@linaro.org> wrote: > On Wed, Jan 31, 2018 at 9:15 PM, Ard Biesheuvel > <ard.biesheu...@linaro.org> wrote: >> On 31 January 2018 at 19:12, Christoffer Dall >> <christoffer.d...@linaro.org> wrote: >>> On Wed, Jan 31, 2018 at 7:00 PM, Ard Biesheuvel >>> <ard.biesheu...@linaro.org> wrote: >>>> On 31 January 2018 at 17:39, Christoffer Dall >>>> <christoffer.d...@linaro.org> wrote: >>>>> On Wed, Jan 31, 2018 at 5:59 PM, Ard Biesheuvel >>>>> <ard.biesheu...@linaro.org> wrote: >>>>>> On 31 January 2018 at 16:53, Christoffer Dall >>>>>> <christoffer.d...@linaro.org> wrote: >>>>>>> On Wed, Jan 31, 2018 at 4:18 PM, Ard Biesheuvel >>>>>>> <ard.biesheu...@linaro.org> wrote: >>>>>>>> On 31 January 2018 at 09:53, Christoffer Dall >>>>>>>> <christoffer.d...@linaro.org> wrote: >>>>>>>>> On Mon, Jan 29, 2018 at 10:32:12AM +0000, Marc Zyngier wrote: >>>>>>>>>> On 29/01/18 10:04, Peter Maydell wrote: >>>>>>>>>> > On 29 January 2018 at 09:53, Dr. David Alan Gilbert >>>>>>>>>> > <dgilb...@redhat.com> wrote: >>>>>>>>>> >> * Peter Maydell (peter.mayd...@linaro.org) wrote: >>>>>>>>>> >>> On 26 January 2018 at 19:46, Dr. David Alan Gilbert >>>>>>>>>> >>> <dgilb...@redhat.com> wrote: >>>>>>>>>> >>>> * Peter Maydell (peter.mayd...@linaro.org) wrote: >>>>>>>>>> >>>>> I think the correct fix here is that your test code should turn >>>>>>>>>> >>>>> its MMU on. Trying to treat guest RAM as uncacheable doesn't >>>>>>>>>> >>>>> work >>>>>>>>>> >>>>> for Arm KVM guests (for the same reason that VGA device video >>>>>>>>>> >>>>> memory >>>>>>>>>> >>>>> doesn't work). If it's RAM your guest has to arrange to map it >>>>>>>>>> >>>>> as >>>>>>>>>> >>>>> Normal Cacheable, and then everything should work fine. >>>>>>>>>> >>>> >>>>>>>>>> >>>> Does this cause problems with migrating at just the wrong point >>>>>>>>>> >>>> during >>>>>>>>>> >>>> a VM boot? >>>>>>>>>> >>> >>>>>>>>>> >>> It wouldn't surprise me if it did, but I don't think I've ever >>>>>>>>>> >>> tried to provoke that problem... >>>>>>>>>> >> >>>>>>>>>> >> If you think it'll get the RAM contents wrong, it might be best >>>>>>>>>> >> to fail >>>>>>>>>> >> the migration if you can detect the cache is disabled in the >>>>>>>>>> >> guest. >>>>>>>>>> > >>>>>>>>>> > I guess QEMU could look at the value of the "MMU disabled/enabled" >>>>>>>>>> > bit >>>>>>>>>> > in the guest's system registers, and refuse migration if it's >>>>>>>>>> > off... >>>>>>>>>> > >>>>>>>>>> > (cc'd Marc, Christoffer to check that I don't have the wrong end >>>>>>>>>> > of the stick about how thin the ice is in the period before the >>>>>>>>>> > guest turns on its MMU...) >>>>>>>>>> >>>>>>>>>> Once MMU and caches are on, we should be in a reasonable place for >>>>>>>>>> QEMU >>>>>>>>>> to have a consistent view of the memory. The trick is to prevent the >>>>>>>>>> vcpus from changing that. A guest could perfectly turn off its MMU at >>>>>>>>>> any given time if it needs to (and it is actually required on some >>>>>>>>>> HW if >>>>>>>>>> you want to mitigate headlining CVEs), and KVM won't know about that. >>>>>>>>>> >>>>>>>>> >>>>>>>>> (Clarification: KVM can detect this is it bother to check the VCPU's >>>>>>>>> system registers, but we don't trap to KVM when the VCPU turns off its >>>>>>>>> caches, right?) >>>>>>>>> >>>>>>>>>> You may have to pause the vcpus before starting the migration, or >>>>>>>>>> introduce a new KVM feature that would automatically pause a vcpu >>>>>>>>>> that >>>>>>>>>> is trying to disable its MMU while the migration is on. This would >>>>>>>>>> involve trapping all the virtual memory related system registers, >>>>>>>>>> with >>>>>>>>>> an obvious cost. But that cost would be limited to the time it takes >>>>>>>>>> to >>>>>>>>>> migrate the memory, so maybe that's acceptable. >>>>>>>>>> >>>>>>>>> Is that even sufficient? >>>>>>>>> >>>>>>>>> What if the following happened. (1) guest turns off MMU, (2) guest >>>>>>>>> writes some data directly to ram (3) qemu stops the vcpu (4) qemu >>>>>>>>> reads >>>>>>>>> guest ram. QEMU's view of guest ram is now incorrect (stale, >>>>>>>>> incoherent, ...). >>>>>>>>> >>>>>>>>> I'm also not really sure if pausing one VCPU because it turned off its >>>>>>>>> MMU will go very well when trying to migrate a large VM (wouldn't this >>>>>>>>> ask for all the other VCPUs beginning to complain that the stopped >>>>>>>>> VCPU >>>>>>>>> appears to be dead?). As a short-term 'fix' it's probably better to >>>>>>>>> refuse migration if you detect that a VCPU had begun turning off its >>>>>>>>> MMU. >>>>>>>>> >>>>>>>>> On the larger scale of thins; this appears to me to be another case of >>>>>>>>> us really needing some way to coherently access memory between QEMU >>>>>>>>> and >>>>>>>>> the VM, but in the case of the VCPU turning off the MMU prior to >>>>>>>>> migration, we don't even know where it may have written data, and I'm >>>>>>>>> therefore not really sure what the 'proper' solution would be. >>>>>>>>> >>>>>>>>> (cc'ing Ard who has has thought about this problem before in the >>>>>>>>> context >>>>>>>>> of UEFI and VGA.) >>>>>>>>> >>>>>>>> >>>>>>>> Actually, the VGA case is much simpler because the host is not >>>>>>>> expected to write to the framebuffer, only read from it, and the guest >>>>>>>> is not expected to create a cacheable mapping for it, so any >>>>>>>> incoherency can be trivially solved by cache invalidation on the host >>>>>>>> side. (Note that this has nothing to do with DMA coherency, but only >>>>>>>> with PCI MMIO BARs that are backed by DRAM in the host) >>>>>>> >>>>>>> In case of the running guest, the host will also only read from the >>>>>>> cached mapping. Of course, at restore, the host will also write >>>>>>> through a cached mapping, but shouldn't the latter case be solvable by >>>>>>> having KVM clean the cache lines when faulting in any page? >>>>>>> >>>>>> >>>>>> We are still talking about the contents of the framebuffer, right? In >>>>>> that case, yes, afaict >>>>>> >>>>> >>>>> I was talking about normal RAM actually... not sure if that changes >>>>> anything? >>>>> >>>> >>>> The main difference is that with a framebuffer BAR, it is pointless >>>> for the guest to map it cacheable, given that the purpose of a >>>> framebuffer is its side effects, which are not guaranteed to occur >>>> timely if the mapping is cacheable. >>>> >>>> If we are talking about normal RAM, then why are we discussing it here >>>> and not down there? >>>> >>> >>> Because I was trying to figure out how the challenge of accessing the >>> VGA framebuffer differs from the challenge of accessing guest RAM >>> which may have been written by the guest with the MMU off. >>> >>> First approximation, they are extremely similar because the guest is >>> writing uncached to memory, which the host now has to access via a >>> cached mapping. >>> >>> But I'm guessing that a "clean+invalidate before read on the host" >>> solution will result in terrible performance for a framebuffer and >>> therefore isn't a good solution for that problem... >>> >> >> That highly depends on where 'not working' resides on the performance >> scale. Currently, VGA on KVM simply does not work at all, and so >> working but slow would be a huge improvement over the current >> situation. >> >> Also, the performance hit is caused by the fact that the data needs to >> make a round trip to memory, and the invalidation (without cleaning) >> performed by the host shouldn't make that much worse than it >> fundamentally is to begin with. >> > > True, but Marc pointed out (on IRC) that the cache maintenance > instructions are broadcast across all CPUs and may therefore adversely > affect the performance of your entire system by running an emulated > VGA framebuffer on a subset of the host CPUs. However, he also > pointed out that the necessary cache maintenance can be done in EL0 > and we wouldn't even need a new ioctl or anything else from KVM or the > kernel to do this. I think we should measure the effect of this > before dismissing it though. >
If you could use dirty page tracking to only perform the cache invalidation when the framebuffer memory has been updated, you can at least limit the impact to cases where the framebuffer is actually used, rather than sitting idle with a nice wallpaper image. >> A paravirtualized framebuffer (as was proposed recently by Gerd I >> think?) would solve this, since the guest can just map it as >> cacheable. >> > > Yes, but it seems to me that the problem of incoherent views of memory > between the guest and QEMU keeps coming up, and we therefore need to > have a fix for this in software. > Agreed. >>>> >>>> >>>>>>>> >>>>>>>> In the migration case, it is much more complicated, and I think >>>>>>>> capturing the state of the VM in a way that takes incoherency between >>>>>>>> caches and main memory into account is simply infeasible (i.e., the >>>>>>>> act of recording the state of guest RAM via a cached mapping may evict >>>>>>>> clean cachelines that are out of sync, and so it is impossible to >>>>>>>> record both the cached *and* the delta with the uncached state) >>>>>>> >>>>>>> This may be an incredibly stupid question (and I may have asked it >>>>>>> before), but why can't we clean+invalidate the guest page before >>>>>>> reading it and thereby obtain a coherent view of a page? >>>>>>> >>>>>> >>>>>> Because cleaning from the host will clobber whatever the guest wrote >>>>>> directly to memory with the MMU off, if there is a dirty cacheline >>>>>> shadowing that memory. >>>>> >>>>> If the host never wrote anything to that memory (it shouldn't mess >>>>> with the guest's memory) there will only be clean cache lines (even if >>>>> they contain content shadowing the memory) and cleaning them would be >>>>> equivalent to an invalidate. Am I misremembering how this works? >>>>> >>>> >>>> Cleaning doesn't actually invalidate, but it should be a no-op for >>>> clean cachelines. >>>> >>>>>> However, that same cacheline could be dirty >>>>>> because the guest itself wrote to memory with the MMU on. >>>>> >>>>> Yes, but the guest would have no control over when such a cache line >>>>> gets flushed to main memory by the hardware, and can have no >>>>> reasonable expectation that the cache lines don't get cleaned behind >>>>> its back. The fact that a migration triggers this, is reasonable. A >>>>> guest that wants hand-off from main memory that its accessing with the >>>>> MMU off, must invalidate the appropriate cache lines or ensure they're >>>>> clean. There's very likely some subtle aspect to all of this that I'm >>>>> forgetting. >>>>> >>>> >>>> OK, so if the only way cachelines covering guest memory could be dirty >>>> is after the guest wrote to that memory itself via a cacheable >>>> mapping, I guess it would be reasonable to do clean+invalidate before >>>> reading the memory. Then, the only way for the guest to lose anything >>>> is in cases where it could not reasonably expect it to be retained >>>> anyway. >>> >>> Right, that's what I'm thinking. >>> >>>> >>>> However, that does leave a window, between the invalidate and the >>>> read, where the guest could modify memory without it being visible to >>>> the host. >>> >>> Is that a problem specific to the coherency challenge? I thought this >>> problem was already addressed by dirty page tracking, but there's like >>> some interaction with the cache maintenance that we'd have to figure >>> out. >>> >> >> I don't know how dirty page tracking works exactly, but if it that can >> track direct writes to memory as easily as cached writes, it would >> probably cover this as well. > > I don't see why it shouldn't. Guest pages get mapped as read-only in > stage 2, and their memory attributes shouldn't be able to affect > permission checks (one can only hope). > > Unless I'm missing something fundamental, I think we should add > functionality in QEMU to clean+invalidate pages on aarch64 hosts and > see if that solves both VGA and migration, and if so, what the > potential performance impacts are. > Indeed. Identifying the framebuffer region is easy for QEMU, so there we can invalidate (without clean) selectively. However, although we'd probably need to capture the state of the MMU bit anyway when handling the stage 2 fault for the dirty page tracking, I don't think we can generally infer whether a faulting access was made via an uncached mapping while the MMU was on, and so this would require clean+invalidate on all dirty pages when doing a migration.