Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit
On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote: > This new KVM exit allows userspace to handle memory-related errors. It > indicates an error happens in KVM at guest memory range [gpa, gpa+size). > The flags includes additional information for userspace to handle the > error. Currently bit 0 is defined as 'private memory' where '1' > indicates error happens due to private memory access and '0' indicates > error happens due to shared memory access. > > When private memory is enabled, this new exit will be used for KVM to > exit to userspace for shared <-> private memory conversion in memory > encryption usage. In such usage, typically there are two kind of memory > conversions: > - explicit conversion: happens when guest explicitly calls into KVM > to map a range (as private or shared), KVM then exits to userspace > to perform the map/unmap operations. > - implicit conversion: happens in KVM page fault handler where KVM > exits to userspace for an implicit conversion when the page is in a > different state than requested (private or shared). > > Suggested-by: Sean Christopherson > Co-developed-by: Yu Zhang > Signed-off-by: Yu Zhang > Signed-off-by: Chao Peng > --- > Documentation/virt/kvm/api.rst | 23 +++ > include/uapi/linux/kvm.h | 9 + > 2 files changed, 32 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst > b/Documentation/virt/kvm/api.rst > index f3fa75649a78..975688912b8c 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6537,6 +6537,29 @@ array field represents return values. The > userspace should update the return > values of SBI call before resuming the VCPU. For more details on > RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) > + __u32 flags; > + __u32 padding; > + __u64 gpa; > + __u64 size; > + } memory; > + Would it make sense to also have a field for the access type (read, write, execute, etc)? I realize that shared <-> private conversion doesn't strictly need this, but it seems like it could be useful for logging failures and also for avoiding a second immediate fault if the type gets converted but doesn't have the right protection yet. (Obviously, if this were changed, KVM would need the ability to report that it doesn't actually know the mode.) --Andy
Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
(please excuse any formatting disasters. my internet went out as I was composing this, and i did my best to rescue it.) On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote: > +Will, Marc and Fuad (apologies if I missed other pKVM folks) > > On Mon, Sep 19, 2022, David Hildenbrand wrote: >> On 15.09.22 16:29, Chao Peng wrote: >> > From: "Kirill A. Shutemov" >> > >> > KVM can use memfd-provided memory for guest memory. For normal userspace >> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its >> > virtual address space and then tells KVM to use the virtual address to >> > setup the mapping in the secondary page table (e.g. EPT). >> > >> > With confidential computing technologies like Intel TDX, the >> > memfd-provided memory may be encrypted with special key for special >> > software domain (e.g. KVM guest) and is not expected to be directly >> > accessed by userspace. Precisely, userspace access to such encrypted >> > memory may lead to host crash so it should be prevented. >> >> Initially my thaught was that this whole inaccessible thing is TDX specific >> and there is no need to force that on other mechanisms. That's why I >> suggested to not expose this to user space but handle the notifier >> requirements internally. >> >> IIUC now, protected KVM has similar demands. Either access (read/write) of >> guest RAM would result in a fault and possibly crash the hypervisor (at >> least not the whole machine IIUC). > > Yep. The missing piece for pKVM is the ability to convert from shared > to private > while preserving the contents, e.g. to hand off a large buffer > (hundreds of MiB) > for processing in the protected VM. Thoughts on this at the bottom. > >> > This patch introduces userspace inaccessible memfd (created with >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through >> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via >> > in-kernel interface so KVM can directly interact with core-mm without >> > the need to map the memory into KVM userspace. >> >> With secretmem we decided to not add such "concept switch" flags and instead >> use a dedicated syscall. >> > > I have no personal preference whatsoever between a flag and a dedicated > syscall, > but a dedicated syscall does seem like it would give the kernel a bit more > flexibility. The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem or similar. But if we need flags or other details in the future, maybe this isn't ideal. > >> What about memfd_inaccessible()? Especially, sealing and hugetlb are not >> even supported and it might take a while to support either. > > Don't know about sealing, but hugetlb support for "inaccessible" memory > needs to > come sooner than later. "inaccessible" in quotes because we might want > to choose > a less binary name, e.g. "restricted"?. > > Regarding pKVM's use case, with the shim approach I believe this can be done > by > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions > piled on top. > > My first thought was to make the uAPI a set of KVM ioctls so that KVM > could tightly > tightly control usage without taking on too much complexity in the > kernel, but > working through things, routing the behavior through the shim itself > might not be > all that horrific. > > IIRC, we discarded the idea of allowing userspace to map the "private" > fd because > things got too complex, but with the shim it doesn't seem _that_ bad. What's the exact use case? Is it just to pre-populate the memory? > > E.g. on the memfd side: > > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e. > mapping is all or nothing. > > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping > for > the restricted memfd. > > 3. Add notifier hooks to allow downstream users to further restrict things. > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything > in > one shot. > > 5. Require that there are no outstanding references at munmap(). Or if this > can't be guaranteed by userspace, maybe add some way for userspace to > wait > until it's ok to convert to private? E.g. so that get_pfn() doesn't need > to do an expensive check every time. Hmm. I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure. When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning). Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range. As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote: > On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote: >> On 8/19/22 17:27, Kirill A. Shutemov wrote: >> > On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote: >> > > On Thu, 18 Aug 2022, Kirill A . Shutemov wrote: >> > > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: >> > > > > >> > > > > If your memory could be swapped, that would be enough of a good >> > > > > reason >> > > > > to make use of shmem.c: but it cannot be swapped; and although there >> > > > > are some references in the mailthreads to it perhaps being swappable >> > > > > in future, I get the impression that will not happen soon if ever. >> > > > > >> > > > > If your memory could be migrated, that would be some reason to use >> > > > > filesystem page cache (because page migration happens to understand >> > > > > that type of memory): but it cannot be migrated. >> > > > >> > > > Migration support is in pipeline. It is part of TDX 1.5 [1]. And >> > > > swapping >> > > > theoretically possible, but I'm not aware of any plans as of now. >> > > > >> > > > [1] >> > > > https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html >> > > >> > > I always forget, migration means different things to different audiences. >> > > As an mm person, I was meaning page migration, whereas a virtualization >> > > person thinks VM live migration (which that reference appears to be >> > > about), >> > > a scheduler person task migration, an ornithologist bird migration, etc. >> > > >> > > But you're an mm person too: you may have cited that reference in the >> > > knowledge that TDX 1.5 Live Migration will entail page migration of the >> > > kind I'm thinking of. (Anyway, it's not important to clarify that here.) >> > >> > TDX 1.5 brings both. >> > >> > In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE. >> > >> >> This seems to be a pretty bad fit for the way that the core mm migrates >> pages. The core mm unmaps the page, then moves (in software) the contents >> to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into >> that workflow very well. I'm not saying it can't be done, but it won't just >> work. > > Hm. From what I see we have all necessary infrastructure in place. > > Unmaping is NOP for inaccessible pages as it is never mapped and we have > mapping->a_ops->migrate_folio() callback that allows to replace software > copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE. > > What do I miss? Hmm, maybe this isn't as bad as I thought. Right now, unless I've missed something, the migration workflow is to unmap (via try_to_migrate) all mappings, then migrate the backing store (with ->migrate_folio(), although it seems like most callers expect the actual copy to happen outside of ->migrate_folio(), and then make new mappings. With the *current* (vma-based, not fd-based) model for KVM memory, this won't work -- we can't unmap before calling TDH.MEM.PAGE.RELOCATE. But maybe it's actually okay with some care or maybe mild modifications with the fd-based model. We don't have any mmaps, per se, to unmap for secret / INACCESSIBLE memory. So maybe we can get all the way to ->migrate_folio() without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE from inside migrate_folio(). And there will be nothing to fault back in. From the core code's perspective, it's like migrating a memfd that doesn't happen to have my mappings at the time. --Andy
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On 8/24/22 02:41, Chao Peng wrote: On Tue, Aug 23, 2022 at 04:05:27PM +, Sean Christopherson wrote: On Tue, Aug 23, 2022, David Hildenbrand wrote: On 19.08.22 05:38, Hugh Dickins wrote: On Fri, 19 Aug 2022, Sean Christopherson wrote: On Thu, Aug 18, 2022, Kirill A . Shutemov wrote: On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs. Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No. But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users. Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory. As raised in reply to the relevant patch, I'm not sure if we really have to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a requirement of specific memfd_notifer (memfile_notifier) implementations -- such as TDX that will convert the memory and MCE-kill the machine on ordinary write access. We might be able to set/enforce this when registering a notifier internally instead, and fail notifier registration if a condition isn't met (e.g., existing mmap). So I'd be curious, which other users of shmem/memfd would benefit from (MMU)-"INACCESSIBLE" memory obtained via memfd_create()? I agree that there's no need to expose the inaccessible behavior via uAPI. Making it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other flags that get added in the future). AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide any unique functionality. That's also what I'm thinking. And I don't see problem immediately if user has populated the fd at the binding time. Actually that looks an advantage for previously discussed guest payload pre-loading. I think this gets awkward. Trying to define sensible semantics for what happens if a shmem or similar fd gets used as secret guest memory and that fd isn't utterly and completely empty can get quite nasty. For example: If there are already mmaps, then TDX (much more so than SEV) really doesn't want to also use it as guest memory. If there is already data in the fd, then maybe some technologies can use this for pre-population, but TDX needs explicit instructions in order to get the guest's hash right. In general, it seems like it will be much more likely to actually work well if the user (uAPI) is required to declare to the kernel exactly what the fd is for (e.g. TDX secret memory, software-only secret memory, etc) before doing anything at all with it other than binding it to KVM. INACCESSIBLE is a way to achieve this. Maybe it's not the prettiest in the world -- I personally would rather see an explicit request for, say, TDX or SEV memory or maybe the memory that works for a particular KVM instance instead of something generic like INACCESSIBLE, but this is a pretty weak preference. But I think that just starting with a plain memfd is a can of worms.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On 8/19/22 17:27, Kirill A. Shutemov wrote: On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote: On Thu, 18 Aug 2022, Kirill A . Shutemov wrote: On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever. If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated. Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc. But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.) TDX 1.5 brings both. In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE. This seems to be a pretty bad fit for the way that the core mm migrates pages. The core mm unmaps the page, then moves (in software) the contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into that workflow very well. I'm not saying it can't be done, but it won't just work.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On 8/18/22 06:24, Kirill A . Shutemov wrote: On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: On Wed, 6 Jul 2022, Chao Peng wrote: This is the v7 of this series which tries to implement the fd-based KVM guest private memory. Here at last are my reluctant thoughts on this patchset. fd-based approach for supporting KVM guest private memory: fine. Use or abuse of memfd and shmem.c: mistaken. memfd_create() was an excellent way to put together the initial prototype. But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs. Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No. What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere). You don't need shmem.c or a filesystem for that! If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever. If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated. Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html This thing? https://cdrdv2.intel.com/v1/dl/getContent/733578 That looks like migration between computers, not between NUMA nodes. Or am I missing something?
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On 7/21/22 14:19, Sean Christopherson wrote: On Thu, Jul 21, 2022, Gupta, Pankaj wrote: I view it as a performance problem because nothing stops KVM from copying from userspace into the private fd during the SEV ioctl(). What's missing is the ability for userspace to directly initialze the private fd, which may or may not avoid an extra memcpy() depending on how clever userspace is. Can you please elaborate more what you see as a performance problem? And possible ways to solve it? Oh, I'm not saying there actually _is_ a performance problem. What I'm saying is that in-place encryption is not a functional requirement, which means it's purely an optimization, and thus we should other bother supporting in-place encryption _if_ it would solve a performane bottleneck. Even if we end up having a performance problem, I think we need to understand the workloads that we want to optimize before getting too excited about designing a speedup. In particular, there's (depending on the specific technology, perhaps, and also architecture) a possible tradeoff between trying to reduce copying and trying to reduce unmapping and the associated flushes. If a user program maps an fd, populates it, and then converts it in place into private memory (especially if it doesn't do it in a single shot), then that memory needs to get unmapped both from the user mm and probably from the kernel direct map. On the flip side, it's possible to imagine an ioctl that does copy-and-add-to-private-fd that uses a private mm and doesn't need any TLB IPIs. All of this is to say that trying to optimize right now seems quite premature to me.
Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory
On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote: This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit: b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity Introduction In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM >>> >>> Thinking a bit, As host side fd on tmpfs or shmem will store memory on host >>> page cache instead of mapping pages into userspace address space. Can we hit >>> double (un-coordinated) page cache problem with this when guest page cache >>> is also used? >> >> This is my understanding: in host it will be indeed in page cache (in >> current shmem implementation) but that's just the way it allocates and >> provides the physical memory for the guest. In guest, guest OS will not >> see this fd (absolutely), it only sees guest memory, on top of which it >> can build its own page cache system for its own file-mapped content but >> that is unrelated to host page cache. > > yes. If guest fills its page cache with file backed memory, this at host > side(on shmem fd backend) will also fill the host page cache fast. This > can have an impact on performance of guest VM's if host goes to memory > pressure situation sooner. Or else we end up utilizing way less System > RAM. Is this in any meaningful way different from a regular VM? --Andy
Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson wrote: > > On Tue, Jun 14, 2022, Andy Lutomirski wrote: > > On Tue, Jun 14, 2022 at 12:32 AM Chao Peng > > wrote: > > > > > > On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote: > > > > On Wed, Jun 08, 2022, Vishal Annapurve wrote: > > > > > > > > One argument is that userspace can simply rely on cgroups to detect > > > > misbehaving > > > > guests, but (a) those types of OOMs will be a nightmare to debug and > > > > (b) an OOM > > > > kill from the host is typically considered a _host_ issue and will be > > > > treated as > > > > a missed SLO. > > > > > > > > An idea for handling this in the kernel without too much complexity > > > > would be to > > > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page > > > > faults from > > > > allocating pages, i.e. holes can only be filled by an explicit > > > > fallocate(). Minor > > > > faults, e.g. due to NUMA balancing stupidity, and major faults due to > > > > swap would > > > > still work, but writes to previously unreserved/unallocated memory > > > > would get a > > > > SIGSEGV on something it has mapped. That would allow the userspace VMM > > > > to prevent > > > > unintentional allocations without having to coordinate > > > > unmapping/remapping across > > > > multiple processes. > > > > > > Since this is mainly for shared memory and the motivation is catching > > > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark > > > those range backed by private fd as PROT_NONE during the conversion so > > > subsequence misbehaved accesses will be blocked instead of causing double > > > allocation silently. > > PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap(). > > > This patch series is fairly close to implementing a rather more > > efficient solution. I'm not familiar enough with hypervisor userspace > > to really know if this would work, but: > > > > What if shared guest memory could also be file-backed, either in the > > same fd or with a second fd covering the shared portion of a memslot? > > This would allow changes to the backing store (punching holes, etc) to > > be some without mmap_lock or host-userspace TLB flushes? Depending on > > what the guest is doing with its shared memory, userspace might need > > the memory mapped or it might not. > > That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea. The > issue, > unless I'm misreading code, is that punching a hole in the shared memory > backing > store doesn't prevent reallocating that hole on fault, i.e. a helper process > that > keeps a valid mapping of guest shared memory can silently fill the hole. > > What we're hoping to achieve is a way to prevent allocating memory without a > very > explicit action from userspace, e.g. fallocate(). Ah, I misunderstood. I thought your goal was to mmap it and prevent page faults from allocating. It is indeed the case (and has been since before quite a few of us were born) that a hole in a sparse file is logically just a bunch of zeros. A way to make a file for which a hole is an actual hole seems like it would solve this problem nicely. It could also be solved more specifically for KVM by making sure that the private/shared mode that userspace programs is strict enough to prevent accidental allocations -- if a GPA is definitively private, shared, neither, or (potentially, on TDX only) both, then a page that *isn't* shared will never be accidentally allocated by KVM. If the shared backing is not mmapped, it also won't be accidentally allocated by host userspace on a stray or careless write. --Andy
Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Jun 14, 2022 at 12:32 AM Chao Peng wrote: > > On Thu, Jun 09, 2022 at 08:29:06PM +, Sean Christopherson wrote: > > On Wed, Jun 08, 2022, Vishal Annapurve wrote: > > > > One argument is that userspace can simply rely on cgroups to detect > > misbehaving > > guests, but (a) those types of OOMs will be a nightmare to debug and (b) an > > OOM > > kill from the host is typically considered a _host_ issue and will be > > treated as > > a missed SLO. > > > > An idea for handling this in the kernel without too much complexity would > > be to > > add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults > > from > > allocating pages, i.e. holes can only be filled by an explicit fallocate(). > > Minor > > faults, e.g. due to NUMA balancing stupidity, and major faults due to swap > > would > > still work, but writes to previously unreserved/unallocated memory would > > get a > > SIGSEGV on something it has mapped. That would allow the userspace VMM to > > prevent > > unintentional allocations without having to coordinate unmapping/remapping > > across > > multiple processes. > > Since this is mainly for shared memory and the motivation is catching > misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark > those range backed by private fd as PROT_NONE during the conversion so > subsequence misbehaved accesses will be blocked instead of causing double > allocation silently. This patch series is fairly close to implementing a rather more efficient solution. I'm not familiar enough with hypervisor userspace to really know if this would work, but: What if shared guest memory could also be file-backed, either in the same fd or with a second fd covering the shared portion of a memslot? This would allow changes to the backing store (punching holes, etc) to be some without mmap_lock or host-userspace TLB flushes? Depending on what the guest is doing with its shared memory, userspace might need the memory mapped or it might not. --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Mon, Apr 25, 2022 at 1:31 PM Sean Christopherson wrote: > > On Mon, Apr 25, 2022, Andy Lutomirski wrote: > > > > > > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote: > > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote: > > >> > > > > >> > > >> 2. Bind the memfile to a VM (or at least to a VM technology). Now it's > > >> in > > >> the initial state appropriate for that VM. > > >> > > >> For TDX, this completely bypasses the cases where the data is > > >> prepopulated > > >> and TDX can't handle it cleanly. > > I believe TDX can handle this cleanly, TDH.MEM.PAGE.ADD doesn't require that > the > source and destination have different HPAs. There's just no pressing need to > support such behavior because userspace is highly motivated to keep the > initial > image small for performance reasons, i.e. burning a few extra pages while > building > the guest is a non-issue. Following up on this, rather belatedly. After re-reading the docs, TDX can populate guest memory using TDH.MEM.PAGE.ADD, but see Intel® TDX Module Base Spec v1.5, section 2.3, step D.4 substeps 1 and 2 here: https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1.5-base-spec-348549001.pdf For each TD page: 1. The host VMM specifies a TDR as a parameter and calls the TDH.MEM.PAGE.ADD function. It copies the contents from the TD image page into the target TD page which is encrypted with the TD ephemeral key. TDH.MEM.PAGE.ADD also extends the TD measurement with the page GPA. 2. The host VMM extends the TD measurement with the contents of the new page by calling the TDH.MR.EXTEND function on each 256- byte chunk of the new TD page. So this is a bit like SGX. There is a specific series of operations that have to be done in precisely the right order to reproduce the intended TD measurement. Otherwise the guest will boot and run until it tries to get a report and then it will have a hard time getting anyone to believe its report. So I don't think the host kernel can get away with host userspace just providing pre-populated memory. Userspace needs to tell the host kernel exactly what sequence of adds, extends, etc to perform and in what order, and the host kernel needs to do precisely what userspace asks it to do. "Here's the contents of memory" doesn't cut it unless the tooling that builds the guest image matches the exact semantics that the host kernel provides. --Andy
Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
On Fri, May 20, 2022, at 11:31 AM, Sean Christopherson wrote: > But a dedicated KVM ioctl() to add/remove shared ranges would be easy > to implement > and wouldn't necessarily even need to interact with the memslots. It > could be a > consumer of memslots, e.g. if we wanted to disallow registering regions > without an > associated memslot, but I think we'd want to avoid even that because > things will > get messy during memslot updates, e.g. if dirty logging is toggled or a > shared > memory region is temporarily removed then we wouldn't want to destroy > the tracking. > > I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray > should be far more efficient. > > One benefit to explicitly tracking this in KVM is that it might be > useful for > software-only protected VMs, e.g. KVM could mark a region in the XArray > as "pending" > based on guest hypercalls to share/unshare memory, and then complete > the transaction > when userspace invokes the ioctl() to complete the share/unshare. That makes sense. If KVM goes this route, perhaps there the allowed states for a GPA should include private, shared, and also private-and-shared. Then anyone who wanted to use the same masked GPA for shared and private on TDX could do so if they wanted to.
Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
On 5/19/22 08:37, Chao Peng wrote: Extend the memslot definition to provide guest private memory through a file descriptor(fd) instead of userspace_addr(hva). Such guest private memory(fd) may never be mapped into userspace so no userspace_addr(hva) can be used. Instead add another two new fields (private_fd/private_offset), plus the existing memory_size to represent the private memory range. Such memslot can still have the existing userspace_addr(hva). When use, a single memslot can maintain both private memory through private fd(private_fd/private_offset) and shared memory through hva(userspace_addr). A GPA is considered private by KVM if the memslot has private fd and that corresponding page in the private fd is populated, otherwise, it's shared. So this is a strange API and, IMO, a layering violation. I want to make sure that we're all actually on board with making this a permanent part of the Linux API. Specifically, we end up with a multiplexing situation as you have described. For a given GPA, there are *two* possible host backings: an fd-backed one (from the fd, which is private for now might might end up potentially shared depending on future extensions) and a VMA-backed one. The selection of which one backs the address is made internally by whatever backs the fd. This, IMO, a clear layering violation. Normally, an fd has an associated address space, and pages in that address space can have contents, can be holes that appear to contain all zeros, or could have holes that are inaccessible. If you try to access a hole, you get whatever is in the hole. But now, with this patchset, the fd is more of an overlay and you get *something else* if you try to access through the hole. This results in operations on the fd bubbling up to the KVM mapping in what is, IMO, a strange way. If the user punches a hole, KVM has to modify its mappings such that the GPA goes to whatever VMA may be there. (And update the RMP, the hypervisor's tables, or whatever else might actually control privateness.) Conversely, if the user does fallocate to fill a hole, the guest mapping *to an unrelated page* has to be zapped so that the fd's page shows up. And the RMP needs updating, etc. I am lukewarm on this for a few reasons. 1. This is weird. AFAIK nothing else works like this. Obviously this is subjecting, but "weird" and "layering violation" sometimes translate to "problematic locking". 2. fd-backed private memory can't have normal holes. If I make a memfd, punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that reads as zero. If I write to it, the page gets allocated. But with this new mechanism, if I punch a hole and put it in a memslot, reads and writes go somewhere else. So what if I actually wanted lazily allocated private zeros? 2b. For a hypothetical future extension in which an fd can also have shared pages (for conversion, for example, or simply because the fd backing might actually be more efficient than indirecting through VMAs and therefore get used for shared memory or entirely-non-confidential VMs), lazy fd-backed zeros sound genuinely useful. 3. TDX hardware capability is not fully exposed. TDX can have a private page and a shared page at GPAs that differ only by the private bit. Sure, no one plans to use this today, but baking this into the user ABI throws away half the potential address space. 3b. Any software solution that works like TDX (which IMO seems like an eminently reasonable design to me) has the same issue. The alternative would be to have some kind of separate table or bitmap (part of the memslot?) that tells KVM whether a GPA should map to the fd. What do you all think?
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote: > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote: >> >> >> 2. Bind the memfile to a VM (or at least to a VM technology). Now it's in >> the initial state appropriate for that VM. >> >> For TDX, this completely bypasses the cases where the data is prepopulated >> and TDX can't handle it cleanly. For SEV, it bypasses a situation in which >> data might be written to the memory before we find out whether that data >> will be unreclaimable or unmovable. > > This sounds a more strict rule to avoid semantics unclear. > > So userspace needs to know what excatly happens for a 'bind' operation. > This is different when binds to different technologies. E.g. for SEV, it > may imply after this call, the memfile can be accessed (through mmap or > what ever) from userspace, while for current TDX this should be not allowed. I think this is actually a good thing. While SEV, TDX, pKVM, etc achieve similar goals and have broadly similar ways of achieving them, they really are different, and having userspace be aware of the differences seems okay to me. (Although I don't think that allowing userspace to mmap SEV shared pages is particularly wise -- it will result in faults or cache incoherence depending on the variant of SEV in use.) > > And I feel we still need a third flow/operation to indicate the > completion of the initialization on the memfile before the guest's > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed > and prevent future userspace access. After this point, then the memfile > becomes truely private fd. Even that is technology-dependent. For TDX, this operation doesn't really exist. For SEV, I'm not sure (I haven't read the specs in nearly enough detail). For pKVM, I guess it does exist and isn't quite the same as a shared->private conversion. Maybe this could be generalized a bit as an operation "measure and make private" that would be supported by the technologies for which it's useful. > >> >> >> -- >> >> Now I have a question, since I don't think anyone has really answered it: >> how does this all work with SEV- or pKVM-like technologies in which private >> and shared pages share the same address space? I sounds like you're >> proposing to have a big memfile that contains private and shared pages and >> to use that same memfile as pages are converted back and forth. IO and even >> real physical DMA could be done on that memfile. Am I understanding >> correctly? > > For TDX case, and probably SEV as well, this memfile contains private memory > only. But this design at least makes it possible for usage cases like > pKVM which wants both private/shared memory in the same memfile and rely > on other ways like mmap/munmap or mprotect to toggle private/shared instead > of fallocate/hole punching. Hmm. Then we still need some way to get KVM to generate the correct SEV pagetables. For TDX, there are private memslots and shared memslots, and they can overlap. If they overlap and both contain valid pages at the same address, then the results may not be what the guest-side ABI expects, but everything will work. So, when a single logical guest page transitions between shared and private, no change to the memslots is needed. For SEV, this is not the case: everything is in one set of pagetables, and there isn't a natural way to resolve overlaps. If the memslot code becomes efficient enough, then the memslots could be fragmented. Or the memfile could support private and shared data in the same memslot. And if pKVM does this, I don't see why SEV couldn't also do it and hopefully reuse the same code. > >> >> If so, I think this makes sense, but I'm wondering if the actual memslot >> setup should be different. For TDX, private memory lives in a logically >> separate memslot space. For SEV and pKVM, it doesn't. I assume the API can >> reflect this straightforwardly. > > I believe so. The flow should be similar but we do need pass different > flags during the 'bind' to the backing store for different usages. That > should be some new flags for pKVM but the callbacks (API here) between > memfile_notifile and its consumers can be reused. And also some different flag in the operation that installs the fd as a memslot? > >> >> And the corresponding TDX question: is the intent still that shared pages >> aren't allowed at all in a TDX memfile? If so, that would be the most >> direct mapping to what the hardware actually does. > > Exactly. TDX will still use fallocate/hole punching to turn on/off the > private page. Once off, the tra
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote: > On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote: >> On Tue, Apr 05, 2022, Quentin Perret wrote: >> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote: > Only when the register succeeds, the fd is > converted into a private fd, before that, the fd is just a normal (shared) > one. During this conversion, the previous data is preserved so you can put > some initial data in guest pages (whether the architecture allows this is > architecture-specific and out of the scope of this patch). I think this can be made to work, but it will be awkward. On TDX, for example, what exactly are the semantics supposed to be? An error code if the memory isn't all zero? An error code if it has ever been written? Fundamentally, I think this is because your proposed lifecycle for these memfiles results in a lightweight API but is awkward for the intended use cases. You're proposing, roughly: 1. Create a memfile. Now it's in a shared state with an unknown virt technology. It can be read and written. Let's call this state BRAND_NEW. 2. Bind to a VM. Now it's an a bound state. For TDX, for example, let's call the new state BOUND_TDX. In this state, the TDX rules are followed (private memory can't be converted, etc). The problem here is that the BOUND_NEW state allows things that are nonsensical in TDX, and the binding step needs to invent some kind of semantics for what happens when binding a nonempty memfile. So I would propose a somewhat different order: 1. Create a memfile. It's in the UNBOUND state and no operations whatsoever are allowed except binding or closing. 2. Bind the memfile to a VM (or at least to a VM technology). Now it's in the initial state appropriate for that VM. For TDX, this completely bypasses the cases where the data is prepopulated and TDX can't handle it cleanly. For SEV, it bypasses a situation in which data might be written to the memory before we find out whether that data will be unreclaimable or unmovable. -- Now I have a question, since I don't think anyone has really answered it: how does this all work with SEV- or pKVM-like technologies in which private and shared pages share the same address space? I sounds like you're proposing to have a big memfile that contains private and shared pages and to use that same memfile as pages are converted back and forth. IO and even real physical DMA could be done on that memfile. Am I understanding correctly? If so, I think this makes sense, but I'm wondering if the actual memslot setup should be different. For TDX, private memory lives in a logically separate memslot space. For SEV and pKVM, it doesn't. I assume the API can reflect this straightforwardly. And the corresponding TDX question: is the intent still that shared pages aren't allowed at all in a TDX memfile? If so, that would be the most direct mapping to what the hardware actually does. --Andy
Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote: > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >> past already with secretmem, it's not 100% that good of a fit (unmovable >> is worth than mlocked). But it gets the job done for now at least. > > No, it doesn't. There are too many different interpretations how > MELOCK is supposed to work > > eg VFIO accounts per-process so hostile users can just fork to go past > it. > > RDMA is per-process but uses a different counter, so you can double up > > iouring is per-user and users a 3rd counter, so it can triple up on > the above two > >> So I'm open for alternative to limit the amount of unmovable memory we >> might allocate for user space, and then we could convert seretmem as well. > > I think it has to be cgroup based considering where we are now :\ > So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable. It might even be swappable (in the potentially distant future). Anyway, here's a concrete proposal, with a bit of handwaving: We add new cgroup limits: memory.unmoveable memory.locked These can be set to an actual number or they can be set to the special value ROOT_CAP. If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs. If it's 0, then they can't. If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed. The default at boot is ROOT_CAP. Anyone who wants to configure it differently is free to do so. This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups. Nothing is charge per mm. To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.
Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote: > On Thu, Mar 10, 2022, Chao Peng wrote: >> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE >> memory behave like longterm pinned pages and thus should be accounted to >> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. >> >> Signed-off-by: Chao Peng >> --- >> mm/shmem.c | 25 - >> 1 file changed, 24 insertions(+), 1 deletion(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 7b43e274c9a2..ae46fb96494b 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, >> pgoff_t start, pgoff_t end) >> static void notify_invalidate_page(struct inode *inode, struct folio *folio, >> pgoff_t start, pgoff_t end) >> { >> -#ifdef CONFIG_MEMFILE_NOTIFIER >> struct shmem_inode_info *info = SHMEM_I(inode); >> >> +#ifdef CONFIG_MEMFILE_NOTIFIER >> start = max(start, folio->index); >> end = min(end, folio->index + folio_nr_pages(folio)); >> >> memfile_notifier_invalidate(>memfile_notifiers, start, end); >> #endif >> + >> +if (info->xflags & SHM_F_INACCESSIBLE) >> +atomic64_sub(end - start, >mm->pinned_vm); > > As Vishal's to-be-posted selftest discovered, this is broken as > current->mm may > be NULL. Or it may be a completely different mm, e.g. AFAICT there's > nothing that > prevents a different process from punching hole in the shmem backing. > How about just not charging the mm in the first place? There’s precedent: ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current status). In any case, for an administrator to try to assemble the various rlimits into a coherent policy is, and always has been, quite messy. ISTM cgroup limits, which can actually add across processes usefully, are much better. So, aside from the fact that these fds aren’t in a filesystem and are thus available by default, I’m not convinced that this accounting is useful or necessary. Maybe we could just have some switch require to enable creation of private memory in the first place, and anyone who flips that switch without configuring cgroups is subject to DoS.
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Apr 5, 2022, at 11:30 AM, Sean Christopherson wrote: > On Tue, Apr 05, 2022, Andy Lutomirski wrote: > >> resume guest >> *** host -> hypervisor -> guest *** >> Guest unshares the page. >> *** guest -> hypervisor *** >> Hypervisor removes PTE. TLBI. >> *** hypervisor -> guest *** >> >> Obviously considerable cleverness is needed to make a virt IOMMU like this >> work well, but still. >> >> Anyway, my suggestion is that the fd backing proposal get slightly modified >> to get it ready for multiple subtypes of backing object, which should be a >> pretty minimal change. Then, if someone actually needs any of this >> cleverness, it can be added later. In the mean time, the >> pread()/pwrite()/splice() scheme is pretty good. > > Tangentially related to getting private-fd ready for multiple things, > what about > implementing the pread()/pwrite()/splice() scheme in pKVM itself? I.e. > read() on > the VM fd, with the offset corresponding to gfn in some way. > Hmm, could make sense.
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Tue, Apr 5, 2022, at 3:36 AM, Quentin Perret wrote: > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote: >> >> >> On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote: >> > On Mon, Apr 04, 2022, Quentin Perret wrote: >> >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: >> >> FWIW, there are a couple of reasons why I'd like to have in-place >> >> conversions: >> >> >> >> - one goal of pKVM is to migrate some things away from the Arm >> >>Trustzone environment (e.g. DRM and the likes) and into protected VMs >> >>instead. This will give Linux a fighting chance to defend itself >> >>against these things -- they currently have access to _all_ memory. >> >>And transitioning pages between Linux and Trustzone (donations and >> >>shares) is fast and non-destructive, so we really do not want pKVM to >> >>regress by requiring the hypervisor to memcpy things; >> > >> > Is there actually a _need_ for the conversion to be non-destructive? >> > E.g. I assume >> > the "trusted" side of things will need to be reworked to run as a pKVM >> > guest, at >> > which point reworking its logic to understand that conversions are >> > destructive and >> > slow-ish doesn't seem too onerous. >> > >> >> - it can be very useful for protected VMs to do shared=>private >> >>conversions. Think of a VM receiving some data from the host in a >> >>shared buffer, and then it wants to operate on that buffer without >> >>risking to leak confidential informations in a transient state. In >> >>that case the most logical thing to do is to convert the buffer back >> >>to private, do whatever needs to be done on that buffer (decrypting a >> >>frame, ...), and then share it back with the host to consume it; >> > >> > If performance is a motivation, why would the guest want to do two >> > conversions >> > instead of just doing internal memcpy() to/from a private page? I >> > would be quite >> > surprised if multiple exits and TLB shootdowns is actually faster, >> > especially at >> > any kind of scale where zapping stage-2 PTEs will cause lock contention >> > and IPIs. >> >> I don't know the numbers or all the details, but this is arm64, which is a >> rather better architecture than x86 in this regard. So maybe it's not so >> bad, at least in very simple cases, ignoring all implementation details. >> (But see below.) Also the systems in question tend to have fewer CPUs than >> some of the massive x86 systems out there. > > Yep. I can try and do some measurements if that's really necessary, but > I'm really convinced the cost of the TLBI for the shared->private > conversion is going to be significantly smaller than the cost of memcpy > the buffer twice in the guest for us. To be fair, although the cost for > the CPU update is going to be low, the cost for IOMMU updates _might_ be > higher, but that very much depends on the hardware. On systems that use > e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and > the iotlb invalidation is done on the back of the CPU invalidation. So, > on systems with sane hardware the overhead is *really* quite small. > > Also, memcpy requires double the memory, it is pretty bad for power, and > it causes memory traffic which can't be a good thing for things running > concurrently. > >> If we actually wanted to support transitioning the same page between shared >> and private, though, we have a bit of an awkward situation. Private to >> shared is conceptually easy -- do some bookkeeping, reconstitute the direct >> map entry, and it's done. The other direction is a mess: all existing uses >> of the page need to be torn down. If the page has been recently used for >> DMA, this includes IOMMU entries. >> >> Quentin: let's ignore any API issues for now. Do you have a concept of how >> a nondestructive shared -> private transition could work well, even in >> principle? > > I had a high level idea for the workflow, but I haven't looked into the > implementation details. > > The idea would be to allow KVM *or* userspace to take a reference > to a page in the fd in an exclusive manner. KVM could take a reference > on a page (which would be necessary before to donating it to a guest) > using some kind of memfile_notifier as proposed in this series, and > userspace could do the same some other way (mmap
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote: > On Mon, Apr 04, 2022, Quentin Perret wrote: >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: >> FWIW, there are a couple of reasons why I'd like to have in-place >> conversions: >> >> - one goal of pKVM is to migrate some things away from the Arm >>Trustzone environment (e.g. DRM and the likes) and into protected VMs >>instead. This will give Linux a fighting chance to defend itself >>against these things -- they currently have access to _all_ memory. >>And transitioning pages between Linux and Trustzone (donations and >>shares) is fast and non-destructive, so we really do not want pKVM to >>regress by requiring the hypervisor to memcpy things; > > Is there actually a _need_ for the conversion to be non-destructive? > E.g. I assume > the "trusted" side of things will need to be reworked to run as a pKVM > guest, at > which point reworking its logic to understand that conversions are > destructive and > slow-ish doesn't seem too onerous. > >> - it can be very useful for protected VMs to do shared=>private >>conversions. Think of a VM receiving some data from the host in a >>shared buffer, and then it wants to operate on that buffer without >>risking to leak confidential informations in a transient state. In >>that case the most logical thing to do is to convert the buffer back >>to private, do whatever needs to be done on that buffer (decrypting a >>frame, ...), and then share it back with the host to consume it; > > If performance is a motivation, why would the guest want to do two > conversions > instead of just doing internal memcpy() to/from a private page? I > would be quite > surprised if multiple exits and TLB shootdowns is actually faster, > especially at > any kind of scale where zapping stage-2 PTEs will cause lock contention > and IPIs. I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard. So maybe it's not so bad, at least in very simple cases, ignoring all implementation details. (But see below.) Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there. If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation. Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done. The other direction is a mess: all existing uses of the page need to be torn down. If the page has been recently used for DMA, this includes IOMMU entries. Quentin: let's ignore any API issues for now. Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle? The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place. Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared". pread() and pwrite() would work on those pages, but not mmap(). (Or maybe mmap() but the resulting mappings would not permit GUP.) And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV. Hmm. Sean and Chao, are we making a bit of a mistake by making these fds technology-agnostic? That is, would we want to distinguish between a TDX backing fd, a SEV backing fd, a software-based backing fd, etc? API-wise this could work by requiring the fd to be bound to a KVM VM instance and possibly even configured a bit before any other operations would be allowed. (Destructive transitions nicely avoid all the nasty cases. If something is still pinning a shared page when it's "transitioned" to private (really just replaced with a new page), then the old page continues existing for as long as needed as a separate object.)
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Fri, Apr 1, 2022, at 7:59 AM, Quentin Perret wrote: > On Thursday 31 Mar 2022 at 09:04:56 (-0700), Andy Lutomirski wrote: > To answer your original question about memory 'conversion', the key > thing is that the pKVM hypervisor controls the stage-2 page-tables for > everyone in the system, all guests as well as the host. As such, a page > 'conversion' is nothing more than a permission change in the relevant > page-tables. > So I can see two different ways to approach this. One is that you split the whole address space in half and, just like SEV and TDX, allocate one bit to indicate the shared/private status of a page. This makes it work a lot like SEV and TDX. The other is to have shared and private pages be distinguished only by their hypercall history and the (protected) page tables. This saves some address space and some page table allocations, but it opens some cans of worms too. In particular, the guest and the hypervisor need to coordinate, in a way that the guest can trust, to ensure that the guest's idea of which pages are private match the host's. This model seems a bit harder to support nicely with the private memory fd model, but not necessarily impossible. Also, what are you trying to accomplish by having the host userspace mmap private pages? Is the idea that multiple guest could share the same page until such time as one of them tries to write to it? That would be kind of like having a third kind of memory that's visible to host and guests but is read-only for everyone. TDX and SEV can't support this at all (a private page belongs to one guest and one guest only, at least in SEV and in the current TDX SEAM spec). I imagine that this could be supported with private memory fds with some care without mmap, though -- the host could still populate the page with memcpy. Or I suppose a memslot could support using MAP_PRIVATE fds and have approximately the right semantics. --Andy
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Wed, Mar 30, 2022, at 10:58 AM, Sean Christopherson wrote: > On Wed, Mar 30, 2022, Quentin Perret wrote: >> On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote: >> > On 29/03/2022 18:01, Quentin Perret wrote: >> > > Is implicit sharing a thing? E.g., if a guest makes a memory access in >> > > the shared gpa range at an address that doesn't have a backing memslot, >> > > will KVM check whether there is a corresponding private memslot at the >> > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or >> > > would that just generate an MMIO exit as usual? >> > >> > My understanding is that the guest needs some way of tagging whether a >> > page is expected to be shared or private. On the architectures I'm aware >> > of this is done by effectively stealing a bit from the IPA space and >> > pretending it's a flag bit. >> >> Right, and that is in fact the main point of divergence we have I think. >> While I understand this might be necessary for TDX and the likes, this >> makes little sense for pKVM. This would effectively embed into the IPA a >> purely software-defined non-architectural property/protocol although we >> don't actually need to: we (pKVM) can reasonably expect the guest to >> explicitly issue hypercalls to share pages in-place. So I'd be really >> keen to avoid baking in assumptions about that model too deep in the >> host mm bits if at all possible. > > There is no assumption about stealing PA bits baked into this API. Even > within > x86 KVM, I consider it a hard requirement that the common flows not assume the > private vs. shared information is communicated through the PA. Quentin, I think we might need a clarification. The API in this patchset indeed has no requirement that a PA bit distinguish between private and shared, but I think it makes at least a weak assumption that *something*, a priori, distinguishes them. In particular, there are private memslots and shared memslots, so the logical flow of resolving a guest memory access looks like: 1. guest accesses a GVA 2. read guest paging structures 3. determine whether this is a shared or private access 4. read host (KVM memslots and anything else, EPT, NPT, RMP, etc) structures accordingly. In particular, the memslot to reference is different depending on the access type. For TDX, this maps on to the fd-based model perfectly: the host-side paging structures for the shared and private slots are completely separate. For SEV, the structures are shared and KVM will need to figure out what to do in case a private and shared memslot overlap. Presumably it's sufficient to declare that one of them wins, although actually determining which one is active for a given GPA may involve checking whether the backing store for a given page actually exists. But I don't understand pKVM well enough to understand how it fits in. Quentin, how is the shared vs private mode of a memory access determined? How do the paging structures work? Can a guest switch between shared and private by issuing a hypercall without changing any guest-side paging structures or anything else? It's plausible that SEV and (maybe) pKVM would be better served if memslots could be sparse or if there was otherwise a direct way for host userspace to indicate to KVM which address ranges are actually active (not hole-punched) in a given memslot or to otherwise be able to make a rule that two different memslots (one shared and one private) can't claim the same address.
Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On Thu, Mar 10, 2022 at 6:09 AM Chao Peng wrote: > > This is the v5 of this series which tries to implement the fd-based KVM > guest private memory. The patches are based on latest kvm/queue branch > commit: > > d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 Can this series be run and a VM booted without TDX? A feature like that might help push it forward. --Andy
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
On 2/23/22 04:05, Steven Price wrote: On 23/02/2022 11:49, Chao Peng wrote: On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote: On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote: On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote: On 1/18/22 05:21, Chao Peng wrote: From: "Kirill A. Shutemov" Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of the file is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this seal set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. At this time only shmem implements this seal. I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type of object. While this can apparently be done successfully and without races (as in this code), it's at least awkward. I think that either creating a special inaccessible memfd should be a single operation that create the correct type of object or there should be a clear justification for why it's a two-step process. Now one justification maybe from Stever's comment to patch-00: for ARM usage it can be used with creating a normal memfd, (partially)populate it with initial guest memory content (e.g. firmware), and then F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in KVM (definitely the current code needs to be changed to support that). Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work. Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will need to make sure access to existing mmap-ed area should be prevented, but that is hard. In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism. Yes, TDX requires a ioctl. Steven may comment on the ARM part. The Arm story is evolving so I can't give a definite answer yet. Our current prototyping works by creating the initial VM content in a memslot as with a normal VM and then calling an ioctl which throws the big switch and converts all the (populated) pages to be protected. At this point the RMM performs a measurement of the data that the VM is being populated with. The above (in our prototype) suffers from all the expected problems with a malicious VMM being able to trick the host kernel into accessing those pages after they have been protected (causing a fault detected by the hardware). The ideal (from our perspective) approach would be to follow the same flow but where the VMM populates a memfd rather than normal anonymous pages. The memfd could then be sealed and the pages converted to protected ones (with the RMM measuring them in the process). The question becomes how is that memfd populated? It would be nice if that could be done using normal operations on a memfd (i.e. using mmap()) and therefore this code could be (relatively) portable. This would mean that any pages mapped from the memfd would either need to block the sealing or be revoked at the time of sealing. The other approach is we could of course implement a special ioctl which effectively does a memcpy into the (created empty and sealed) memfd and does the necessary dance with the RMM to measure the contents. This would match the "transcript of the series of operations" described above - but seems much less ideal from the viewpoint of the VMM. A VMM that supports Other Vendors will need to understand this sort of model regardless. I don't particularly mind the idea of having the kernel consume a normal memfd and spit out a new object, but I find the concept of changing the type of the object in place, even if it has other references, and trying to control all the resulting races to be somewhat alarming. In pseudo-Rust, this is the difference between: fn convert_to_private(in: Memfd) and fn convert_to_private(in: Memfd) -> PrivateMemoryFd This doesn't map particularly nicely to the kernel, though. --Andy\
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote: > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote: >> On 1/18/22 05:21, Chao Peng wrote: >> > From: "Kirill A. Shutemov" >> > >> > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of >> > the file is inaccessible from userspace through ordinary MMU access >> > (e.g., read/write/mmap). However, the file content can be accessed >> > via a different mechanism (e.g. KVM MMU) indirectly. >> > >> > It provides semantics required for KVM guest private memory support >> > that a file descriptor with this seal set is going to be used as the >> > source of guest memory in confidential computing environments such >> > as Intel TDX/AMD SEV but may not be accessible from host userspace. >> > >> > At this time only shmem implements this seal. >> > >> >> I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE >> essentially transmutes a memfd into a different type of object. While this >> can apparently be done successfully and without races (as in this code), >> it's at least awkward. I think that either creating a special inaccessible >> memfd should be a single operation that create the correct type of object or >> there should be a clear justification for why it's a two-step process. > > Now one justification maybe from Stever's comment to patch-00: for ARM > usage it can be used with creating a normal memfd, (partially)populate > it with initial guest memory content (e.g. firmware), and then > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest in > KVM (definitely the current code needs to be changed to support that). Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? So this won't work. In any case, the whole confidential VM initialization story is a bit buddy. From the earlier emails, it sounds like ARM expects the host to fill in guest memory and measure it. From my recollection of Intel's scheme (which may well be wrong, and I could easily be confusing it with SGX), TDX instead measures what is essentially a transcript of the series of operations that initializes the VM. These are fundamentally not the same thing even if they accomplish the same end goal. For TDX, we unavoidably need an operation (ioctl or similar) that initializes things according to the VM's instructions, and ARM ought to be able to use roughly the same mechanism. Also, if we ever get fancy and teach the page allocator about memory with reduced directmap permissions, it may well be more efficient for userspace to shove data into a memfd via ioctl than it is to mmap it and write the data.
Re: [PATCH v4 04/12] mm/shmem: Support memfile_notifier
On 1/18/22 05:21, Chao Peng wrote: It maintains a memfile_notifier list in shmem_inode_info structure and implements memfile_pfn_ops callbacks defined by memfile_notifier. It then exposes them to memfile_notifier via shmem_get_memfile_notifier_info. We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be allocated by userspace for private memory. If there is no pages allocated at the offset then error should be returned so KVM knows that the memory is not private memory. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng static int memfile_get_notifier_info(struct inode *inode, struct memfile_notifier_list **list, struct memfile_pfn_ops **ops) { - return -EOPNOTSUPP; + int ret = -EOPNOTSUPP; +#ifdef CONFIG_SHMEM + ret = shmem_get_memfile_notifier_info(inode, list, ops); +#endif + return ret; } +int shmem_get_memfile_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + struct shmem_inode_info *info; + + if (!shmem_mapping(inode->i_mapping)) + return -EINVAL; + + info = SHMEM_I(inode); + *list = >memfile_notifiers; + if (ops) + *ops = _pfn_ops; + + return 0; I can't wrap my head around exactly who is supposed to call these functions and when, but there appears to be a missing check that the inode is actually a shmem inode. What is this code trying to do? It's very abstract.
Re: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE
On 1/18/22 05:21, Chao Peng wrote: From: "Kirill A. Shutemov" Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of the file is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this seal set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. At this time only shmem implements this seal. I don't dislike this *that* much, but I do dislike this. F_SEAL_INACCESSIBLE essentially transmutes a memfd into a different type of object. While this can apparently be done successfully and without races (as in this code), it's at least awkward. I think that either creating a special inaccessible memfd should be a single operation that create the correct type of object or there should be a clear justification for why it's a two-step process. (Imagine if the way to create an eventfd would be to call timerfd_create() and then do a special fcntl to turn it into an eventfd but only if it's not currently armed. This would be weird.)
Re: [RFC v2 PATCH 01/13] mm/shmem: Introduce F_SEAL_GUEST
On 11/19/21 05:47, Chao Peng wrote: From: "Kirill A. Shutemov" The new seal type provides semantics required for KVM guest private memory support. A file descriptor with the seal set is going to be used as source of guest memory in confidential computing environments such as Intel TDX and AMD SEV. F_SEAL_GUEST can only be set on empty memfd. After the seal is set userspace cannot read, write or mmap the memfd. I don't have a strong objection here, but, given that you're only supporting it for memfd, would a memfd_create() flag be more straightforward? If nothing else, it would avoid any possible locking issue. I'm also very very slightly nervous about a situation in which one program sends a memfd to an untrusted other process and that process truncates the memfd and then F_SEAL_GUESTs it. This could be mostly mitigated by also requiring that no other seals be set when F_SEAL_GUEST happens, but the alternative MFD_GUEST would eliminate this issue too.
Re: [RFC v2 PATCH 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
On 11/19/21 05:47, Chao Peng wrote: This RFC series try to implement the fd-based KVM guest private memory proposal described at [1] and an improved 'New Proposal' described at [2]. I generally like this. Thanks!
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin wrote: > > On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote: > > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin wrote: > > > > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > > > 4c. The guest kernel maintains an array of physical addresses that are > > > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > > > location through whatever protocol, and before resuming a > > > > moved/snapshotted/duplicated VM, it takes the responsibility for > > > > memzeroing this memory. The huge pro here would be that this > > > > eliminates all races, and reduces complexity quite a bit, because the > > > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > > > with this, and it can even optimize things like on-disk memory > > > > snapshots to simply not write out those pages to disk. > > > > > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > > > userspace API to deal with, and it'd be race free, and eliminate a lot > > > > of kernel complexity. > > > > > > Clearly this has a chance to break applications, right? > > > If there's an app that uses this as a non-system-calls way > > > to find out whether there was a fork, it will break > > > when wipe triggers without a fork ... > > > For example, imagine: > > > > > > MADV_WIPEONFORK > > > copy secret data to MADV_DONTFORK > > > fork > > > > > > > > > used to work, with this change it gets 0s instead of the secret data. > > > > > > > > > I am also not sure it's wise to expose each guest process > > > to the hypervisor like this. E.g. each process needs a > > > guest physical address of its own then. This is a finite resource. > > > > > > > > > The mmap interface proposed here is somewhat baroque, but it is > > > certainly simple to implement ... > > > > Wipe of fork/vmgenid/whatever could end up being much more problematic > > than it naively appears -- it could be wiped in the middle of a read. > > Either the API needs to handle this cleanly, or we need something more > > aggressive like signal-on-fork. > > > > --Andy > > > Right, it's not on fork, it's actually when process is snapshotted. > > If we assume it's CRIU we care about, then I > wonder what's wrong with something like > MADV_CHANGEONPTRACE_SEIZE > and basically say it's X bytes which change the value... I feel like we may be approaching this from the wrong end. Rather than saying "what data structure can the kernel expose that might plausibly be useful", how about we try identifying some specific userspace needs and see what a good solution could look like. I can identify two major cryptographic use cases: 1. A userspace RNG. The API exposed by the userspace end is a function that generates random numbers. The userspace code in turn wants to know some things from the kernel: it wants some best-quality-available random seed data from the kernel (and possibly an indication of how good it is) as well as an indication of whether the userspace memory may have been cloned or rolled back, or, failing that, an indication of whether a reseed is needed. Userspace could implement a wide variety of algorithms on top depending on its goals and compliance requirements, but the end goal is for the userspace part to be very, very fast. 2. A userspace crypto stack that wants to avoid shooting itself in the foot due to inadvertently doing the same thing twice. For example, an AES-GCM stack does not want to reuse an IV, *expecially* if there is even the slightest chance that it might reuse the IV for different data. This use case doesn't necessarily involve random numbers, but, if anything, it needs to be even faster than #1. The threats here are not really the same. For #1, a userspace RNG should be able to recover from a scenario in which an adversary clones the entire process *and gets to own the clone*. For example, in Android, an adversary can often gain complete control of a fork of the zygote -- this shouldn't adversely affect the security properties of other forks. Similarly, a server farm could operate by having one booted server that is cloned to create more workers. Those clones could be provisioned with secrets and permissions post-clone, and at attacker gaining control of a fresh clone could be considered acceptable. For #2, in contrast, if an adversary gains control of a clone of an AES-GCM session, they learn the key outright -- the relevant attack scenario is that the adversary gets to interact with two clones without compromising either clone per se. It's worth noting that, in both cases, there could possibly be more than one instance of an RNG or an AES-GCM session in the same process. This means that using signals is awkward but not necessarily impossibly. (This is an area in which Linux, and POSIX in general, is much weaker than Windows.)
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin wrote: > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > 4c. The guest kernel maintains an array of physical addresses that are > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > location through whatever protocol, and before resuming a > > moved/snapshotted/duplicated VM, it takes the responsibility for > > memzeroing this memory. The huge pro here would be that this > > eliminates all races, and reduces complexity quite a bit, because the > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > with this, and it can even optimize things like on-disk memory > > snapshots to simply not write out those pages to disk. > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > userspace API to deal with, and it'd be race free, and eliminate a lot > > of kernel complexity. > > Clearly this has a chance to break applications, right? > If there's an app that uses this as a non-system-calls way > to find out whether there was a fork, it will break > when wipe triggers without a fork ... > For example, imagine: > > MADV_WIPEONFORK > copy secret data to MADV_DONTFORK > fork > > > used to work, with this change it gets 0s instead of the secret data. > > > I am also not sure it's wise to expose each guest process > to the hypervisor like this. E.g. each process needs a > guest physical address of its own then. This is a finite resource. > > > The mmap interface proposed here is somewhat baroque, but it is > certainly simple to implement ... Wipe of fork/vmgenid/whatever could end up being much more problematic than it naively appears -- it could be wiped in the middle of a read. Either the API needs to handle this cleanly, or we need something more aggressive like signal-on-fork. --Andy
Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver
On Fri, Oct 16, 2020 at 6:40 PM Jann Horn wrote: > > [adding some more people who are interested in RNG stuff: Andy, Jason, > Theodore, Willy Tarreau, Eric Biggers. also linux-api@, because this > concerns some pretty fundamental API stuff related to RNG usage] > > On Fri, Oct 16, 2020 at 4:33 PM Catangiu, Adrian Costin > wrote: > > - Background > > > > The VM Generation ID is a feature defined by Microsoft (paper: > > http://go.microsoft.com/fwlink/?LinkId=260709) and supported by > > multiple hypervisor vendors. > > > > The feature is required in virtualized environments by apps that work > > with local copies/caches of world-unique data such as random values, > > uuids, monotonically increasing counters, etc. > > Such apps can be negatively affected by VM snapshotting when the VM > > is either cloned or returned to an earlier point in time. > > > > The VM Generation ID is a simple concept meant to alleviate the issue > > by providing a unique ID that changes each time the VM is restored > > from a snapshot. The hw provided UUID value can be used to > > differentiate between VMs or different generations of the same VM. > > > > - Problem > > > > The VM Generation ID is exposed through an ACPI device by multiple > > hypervisor vendors but neither the vendors or upstream Linux have no > > default driver for it leaving users to fend for themselves. > > > > Furthermore, simply finding out about a VM generation change is only > > the starting point of a process to renew internal states of possibly > > multiple applications across the system. This process could benefit > > from a driver that provides an interface through which orchestration > > can be easily done. > > > > - Solution > > > > This patch is a driver which exposes the Virtual Machine Generation ID > > via a char-dev FS interface that provides ID update sync and async > > notification, retrieval and confirmation mechanisms: > > > > When the device is 'open()'ed a copy of the current vm UUID is > > associated with the file handle. 'read()' operations block until the > > associated UUID is no longer up to date - until HW vm gen id changes - > > at which point the new UUID is provided/returned. Nonblocking 'read()' > > uses EWOULDBLOCK to signal that there is no _new_ UUID available. > > > > 'poll()' is implemented to allow polling for UUID updates. Such > > updates result in 'EPOLLIN' events. > > > > Subsequent read()s following a UUID update no longer block, but return > > the updated UUID. The application needs to acknowledge the UUID update > > by confirming it through a 'write()'. > > Only on writing back to the driver the right/latest UUID, will the > > driver mark this "watcher" as up to date and remove EPOLLIN status. > > > > 'mmap()' support allows mapping a single read-only shared page which > > will always contain the latest UUID value at offset 0. > > It would be nicer if that page just contained an incrementing counter, > instead of a UUID. It's not like the application cares *what* the UUID > changed to, just that it *did* change and all RNGs state now needs to > be reseeded from the kernel, right? And an application can't reliably > read the entire UUID from the memory mapping anyway, because the VM > might be forked in the middle. > > So I think your kernel driver should detect UUID changes and then turn > those into a monotonically incrementing counter. (Probably 64 bits > wide?) (That's probably also a little bit faster than comparing an > entire UUID.) > > An option might be to put that counter into the vDSO, instead of a > separate VMA; but I don't know how the other folks feel about that. > Andy, do you have opinions on this? That way, normal userspace code > that uses this infrastructure wouldn't have to mess around with a > special device at all. And it'd be usable in seccomp sandboxes and so > on without needing special plumbing. And libraries wouldn't have to > call open() and mess with file descriptor numbers. The vDSO might be annoyingly slow for this. Something like the rseq page might make sense. It could be a generic indication of "system went through some form of suspend".
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
> On Dec 28, 2018, at 6:54 PM, Matthew Wilcox wrote: > >> On Sat, Dec 29, 2018 at 12:12:27AM +, Peter Maydell wrote: >> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger wrot >>> On Dec 28, 2018, at 4:18 AM, Peter Maydell wrote: The problem is that there is no 32-bit API in some cases (unless I have misunderstood the kernel code) -- not all host architectures implement compat syscalls or allow them to be called from 64-bit processes or implement all the older syscall variants that had smaller offets. If there was a guaranteed "this syscall always exists and always gives me 32-bit offsets" we could use it. >>> >>> The "32bitapi" mount option would use 32-bit hash for seekdir >>> and telldir, regardless of what kernel API was used. That would >>> just set the FMODE_32BITHASH flag in the file->f_mode for all files. >> >> A mount option wouldn't be much use to QEMU -- we can't tell >> our users how to mount their filesystems, which they're >> often doing lots of other things with besides running QEMU. >> (Otherwise we could just tell them "don't use ext4", which >> would also solve the problem :-)) We need something we can >> use at the individual-syscall level. > > Could you use a prctl to set whether you were running in 32 or 64 bit > mode? Or do you change which kind of task you're emulating too often > to make this a good idea? How would this work? We already have the separate COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have a third degree of freedom. Either the arches people care about should add reasonable ways to issue 32-bit syscalls from 64-bit mode or there should be an explicit way to ask for the 32-bit directory offsets.
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
[sending again, slightly edited, due to email client issues] On Thu, Dec 27, 2018 at 9:25 AM Florian Weimer wrote: > > We have a bit of an interesting problem with respect to the d_off > field in struct dirent. > > When running a 64-bit kernel on certain file systems, notably ext4, > this field uses the full 63 bits even for small directories (strace -v > output, wrapped here for readability): > > getdents(3, [ > {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, > d_name="authorized_keys", d_type=DT_REG}, > {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", > d_type=DT_DIR}, > {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", > d_type=DT_DIR} > ], 32768) = 88 > > When running in 32-bit compat mode, this value is somehow truncated to > 31 bits, for both the getdents and the getdents64 (!) system call (at > least on i386). > ... > > However, both qemu-user and the 9p file system can run in such a way > that the kernel is entered from a 64-bit process, but the actual usage > is from a 32-bit process: I imagine that at least some of the problems you're seeing are due to this bug: https://lkml.org/lkml/2018/10/18/859 Presumably the right fix involves modifying the relevant VFS file operations to indicate the relevant ABI to the implementations. I would guess that 9p is triggering the “not really in the syscall you think you’re in” issue.
Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
> On Dec 27, 2018, at 10:18 AM, Florian Weimer wrote: > > We have a bit of an interesting problem with respect to the d_off > field in struct dirent. > > When running a 64-bit kernel on certain file systems, notably ext4, > this field uses the full 63 bits even for small directories (strace -v > output, wrapped here for readability): > > getdents(3, [ > {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, > d_name="authorized_keys", d_type=DT_REG}, > {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", > d_type=DT_DIR}, > {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", > d_type=DT_DIR} > ], 32768) = 88 > > When running in 32-bit compat mode, this value is somehow truncated to > 31 bits, for both the getdents and the getdents64 (!) system call (at > least on i386). I imagine you’re encountering this bug: https://lkml.org/lkml/2018/10/18/859 Presumably the right fix involves modifying the relevant VFS file operations to indicate the relevant ABI to the implementations. I would guess that 9p is triggering the “not really in the syscall you think you’re in” issue.
Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api
On Wed, Apr 27, 2016 at 7:54 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Wed, Apr 27, 2016 at 07:43:07AM -0700, Andy Lutomirski wrote: >> On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote: >> >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote: >> >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote: >> >> >> One correction: it's a feature of the device in the system. >> >> >> There could be a mix of devices bypassing and not >> >> >> bypassing the IOMMU. >> >> > >> >> > No, it really is not. A device can't chose to bypass the IOMMU. But the >> >> > IOMMU can chose to let the device bypass. So any fix here belongs >> >> > into the platform/iommu code too and not into some driver. >> >> > >> >> >> Sounds good. And a way to detect appropriate devices could >> >> >> be by looking at the feature flag, perhaps? >> >> > >> >> > Again, no! The way to detect that is to look into the iommu description >> >> > structures provided by the firmware. They provide everything necessary >> >> > to tell the iommu code which devices are not translated. >> >> > >> >> >> >> Except on PPC and SPARC. As far as I know, those are the only >> >> problematic platforms. >> >> >> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be >> >> fixed to report correct data in the DMAR tables? >> >> >> >> --Andy >> > >> > Meaning virtio or assigned devices? >> > For virtio - it's way too late since these are working configurations. >> > For assigned devices - they don't work on x86 so it doesn't have >> > to be disabled, it's safe to ignore. >> >> I mean actually prevent QEMU from running in q35-iommu mode with any >> virtio devices attached or maybe even turn off q35-iommu mode entirely >> [1]. Doesn't it require that the user literally pass the word >> "experimental" into QEMU right now? It did at some point IIRC. >> >> The reason I'm asking is that, other than q35-iommu, QEMU's virtio >> devices *don't* bypass the IOMMU except on PPC and SPARC, simply >> because there is no other configuration AFAICT that has virtio and and >> IOMMU. So maybe the right solution is to fix q35-iommu to use DMAR >> correctly (thus breaking q35-iommu users with older guest kernels, >> which hopefully don't actually exist) and to come up with a PPC- and >> SPARC-specific solution, or maybe OpenFirmware-specific solution, to >> handle PPC and SPARC down the road. >> >> [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever >> showed up in a release asking the QEMU team to please not do that >> until this issue was resolved. Sadly, that email was ignored :( >> >> --Andy > > Sorry, I didn't make myself clear. > Point is, QEMU is not the only virtio implementation out there. > So we can't know no virtio implementations have an IOMMU as long as > linux supports this IOMMU. > virtio always used physical addresses since it was born and if it > changes that it must do this in a way that does not break existing > users. Is there any non-QEMU virtio implementation can provide an IOMMU-bypassing virtio device on a platform that has a nontrivial IOMMU? --Andy
Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api
On Wed, Apr 27, 2016 at 7:38 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Wed, Apr 27, 2016 at 07:31:43AM -0700, Andy Lutomirski wrote: >> On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedel <j...@8bytes.org> wrote: >> > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote: >> >> One correction: it's a feature of the device in the system. >> >> There could be a mix of devices bypassing and not >> >> bypassing the IOMMU. >> > >> > No, it really is not. A device can't chose to bypass the IOMMU. But the >> > IOMMU can chose to let the device bypass. So any fix here belongs >> > into the platform/iommu code too and not into some driver. >> > >> >> Sounds good. And a way to detect appropriate devices could >> >> be by looking at the feature flag, perhaps? >> > >> > Again, no! The way to detect that is to look into the iommu description >> > structures provided by the firmware. They provide everything necessary >> > to tell the iommu code which devices are not translated. >> > >> >> Except on PPC and SPARC. As far as I know, those are the only >> problematic platforms. >> >> Is it too late to *disable* QEMU's q35-iommu thingy until it can be >> fixed to report correct data in the DMAR tables? >> >> --Andy > > Meaning virtio or assigned devices? > For virtio - it's way too late since these are working configurations. > For assigned devices - they don't work on x86 so it doesn't have > to be disabled, it's safe to ignore. I mean actually prevent QEMU from running in q35-iommu mode with any virtio devices attached or maybe even turn off q35-iommu mode entirely [1]. Doesn't it require that the user literally pass the word "experimental" into QEMU right now? It did at some point IIRC. The reason I'm asking is that, other than q35-iommu, QEMU's virtio devices *don't* bypass the IOMMU except on PPC and SPARC, simply because there is no other configuration AFAICT that has virtio and and IOMMU. So maybe the right solution is to fix q35-iommu to use DMAR correctly (thus breaking q35-iommu users with older guest kernels, which hopefully don't actually exist) and to come up with a PPC- and SPARC-specific solution, or maybe OpenFirmware-specific solution, to handle PPC and SPARC down the road. [1] I'm pretty sure I emailed the QEMU list before q35-iommu ever showed up in a release asking the QEMU team to please not do that until this issue was resolved. Sadly, that email was ignored :( --Andy
Re: [Qemu-devel] [PATCH V2 RFC] fixup! virtio: convert to use DMA api
On Wed, Apr 27, 2016 at 7:23 AM, Joerg Roedelwrote: > On Wed, Apr 27, 2016 at 04:37:04PM +0300, Michael S. Tsirkin wrote: >> One correction: it's a feature of the device in the system. >> There could be a mix of devices bypassing and not >> bypassing the IOMMU. > > No, it really is not. A device can't chose to bypass the IOMMU. But the > IOMMU can chose to let the device bypass. So any fix here belongs > into the platform/iommu code too and not into some driver. > >> Sounds good. And a way to detect appropriate devices could >> be by looking at the feature flag, perhaps? > > Again, no! The way to detect that is to look into the iommu description > structures provided by the firmware. They provide everything necessary > to tell the iommu code which devices are not translated. > Except on PPC and SPARC. As far as I know, those are the only problematic platforms. Is it too late to *disable* QEMU's q35-iommu thingy until it can be fixed to report correct data in the DMAR tables? --Andy
[Qemu-devel] [Bug 1574346] [NEW] TCG: mov to segment register is incorrectly emulated for AMD CPUs
Public bug reported: In TCG mode, the effect of: xorl %eax, %eax movl %eax, %gs is to mark the GS segment unusable and set its base to zero. After doing this, reading MSR_GS_BASE will return zero and using a GS prefix in long mode will treat the GS base as zero. This is correct for Intel CPUs but is incorrect for AMD CPUs. On an AMD CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave the base unchanged. To make it easier to use TCG to validate behavior on different CPUs, please consider changing the TCG behavior to match actual CPU behavior when emulating an AMD CPU. ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1574346 Title: TCG: mov to segment register is incorrectly emulated for AMD CPUs Status in QEMU: New Bug description: In TCG mode, the effect of: xorl %eax, %eax movl %eax, %gs is to mark the GS segment unusable and set its base to zero. After doing this, reading MSR_GS_BASE will return zero and using a GS prefix in long mode will treat the GS base as zero. This is correct for Intel CPUs but is incorrect for AMD CPUs. On an AMD CPU, writing 0 to %gs using mov, pop, or (I think) lgs will leave the base unchanged. To make it easier to use TCG to validate behavior on different CPUs, please consider changing the TCG behavior to match actual CPU behavior when emulating an AMD CPU. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1574346/+subscriptions
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin" <m...@redhat.com> wrote: > > On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote: > > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote: > > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> > > >> wrote: > > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote: > > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin > > >> >> <m...@redhat.com> wrote: > > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: > > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: > > >> >> >> > > > >> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host > > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, > > >> >> >> > > the host > > >> >> >> > > device would skip translation? Or is that problematic for > > >> >> >> > > vfio? > > >> >> >> > > > >> >> >> > Exactly that's problematic for security. > > >> >> >> > You can't allow guest driver to decide whether device skips > > >> >> >> > security. > > >> >> >> > > >> >> >> Right. Because fundamentally, this *isn't* a property of the > > >> >> >> endpoint > > >> >> >> device, and doesn't live in virtio itself. > > >> >> >> > > >> >> >> It's a property of the platform IOMMU, and lives there. > > >> >> > > > >> >> > It's a property of the hypervisor virtio implementation, and lives > > >> >> > there. > > >> >> > > >> >> It is now, but QEMU could, in principle, change the way it thinks > > >> >> about it so that virtio devices would use the QEMU DMA API but ask > > >> >> QEMU to pass everything through 1:1. This would be entirely invisible > > >> >> to guests but would make it be a property of the IOMMU implementation. > > >> >> At that point, maybe QEMU could find a (platform dependent) way to > > >> >> tell the guest what's going on. > > >> >> > > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle, > > >> >> set up 1:1 mappings in the guest so that the virtio devices would work > > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the > > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set > > >> >> up with an offset. I don't know too much about those platforms, but > > >> >> presumably the layout could be changed so that 1:1 really was 1:1. > > >> >> > > >> >> --Andy > > >> > > > >> > Sure. Do you see any reason why the decision to do this can't be > > >> > keyed off the virtio feature bit? > > >> > > >> I can think of three types of virtio host: > > >> > > >> a) virtio always bypasses the IOMMU. > > >> > > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say > > >> it does) -- i.e. virtio works like any other device. > > >> > > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do. > > > > > > d) some virtio devices bypass the IOMMU and some don't, > > > e.g. it's harder to support IOMMU with vhost. > > > > > > > > >> If this is keyed off a virtio feature bit and anyone tries to > > >> implement (c), the vfio is going to have a problem. And, if it's > > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar > > >> setups unless the Xen hypervisor adds a giant and probably unreliable > > >> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a > > >> default x86 QEMU configuration, and I'd really like to keep it that > > >> way. > > >> > > >> What could plausibly work using a virtio feature bit is for a device > > >> to say "hey, I'm a new device and I support the platform-defined IOMMU > > >> mechanism". This bit would be *set* on default IOMMU-less QEMU > > >> configurations and on physical virtio PCI cards. > > > > > > And clear on xen. > > > > How? QEMU has no idea that the guest is running Xen. > > I was under impression xen_enabled() is true in QEMU. > Am I wrong? I'd be rather surprised, given that QEMU would have to inspect the guest kernel to figure it out. I'm talking about Xen under QEMU. For example, if you feed QEMU a guest disk image that contains Fedora with the xen packages installed, you can boot it and get a grub menu. If you ask grub to boot Xen, you get Xen. If you ask grub to boot Linux directly, you don't get Xen. I assume xen_enabled is for QEMU under Xen, i.e. QEMU, running under Xen, supplying emulated devices to a Xen domU guest. Since QEMU is seeing the guest address space directly, this should be much the same as QEMU !xen_enabled -- if you boot plain Linux, everything works, but if you do Xen -> QEMU -> HVM guest running Xen PV -> Linux, then virtio drivers in the Xen PV Linux guest need to translate addresses. --Andy > > -- > MST
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote: >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote: >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <m...@redhat.com> >> >> wrote: >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: >> >> >> > >> >> >> > > I thought that PLATFORM served that purpose. Woudn't the host >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the >> >> >> > > host >> >> >> > > device would skip translation? Or is that problematic for vfio? >> >> >> > >> >> >> > Exactly that's problematic for security. >> >> >> > You can't allow guest driver to decide whether device skips security. >> >> >> >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint >> >> >> device, and doesn't live in virtio itself. >> >> >> >> >> >> It's a property of the platform IOMMU, and lives there. >> >> > >> >> > It's a property of the hypervisor virtio implementation, and lives >> >> > there. >> >> >> >> It is now, but QEMU could, in principle, change the way it thinks >> >> about it so that virtio devices would use the QEMU DMA API but ask >> >> QEMU to pass everything through 1:1. This would be entirely invisible >> >> to guests but would make it be a property of the IOMMU implementation. >> >> At that point, maybe QEMU could find a (platform dependent) way to >> >> tell the guest what's going on. >> >> >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle, >> >> set up 1:1 mappings in the guest so that the virtio devices would work >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set >> >> up with an offset. I don't know too much about those platforms, but >> >> presumably the layout could be changed so that 1:1 really was 1:1. >> >> >> >> --Andy >> > >> > Sure. Do you see any reason why the decision to do this can't be >> > keyed off the virtio feature bit? >> >> I can think of three types of virtio host: >> >> a) virtio always bypasses the IOMMU. >> >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say >> it does) -- i.e. virtio works like any other device. >> >> c) virtio may bypass the IOMMU depending on what the guest asks it to do. > > d) some virtio devices bypass the IOMMU and some don't, > e.g. it's harder to support IOMMU with vhost. > > >> If this is keyed off a virtio feature bit and anyone tries to >> implement (c), the vfio is going to have a problem. And, if it's >> keyed off a virtio feature bit, then (a) won't work on Xen or similar >> setups unless the Xen hypervisor adds a giant and probably unreliable >> kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a >> default x86 QEMU configuration, and I'd really like to keep it that >> way. >> >> What could plausibly work using a virtio feature bit is for a device >> to say "hey, I'm a new device and I support the platform-defined IOMMU >> mechanism". This bit would be *set* on default IOMMU-less QEMU >> configurations and on physical virtio PCI cards. > > And clear on xen. How? QEMU has no idea that the guest is running Xen.
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote: >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: >> >> > >> >> > > I thought that PLATFORM served that purpose. Woudn't the host >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host >> >> > > device would skip translation? Or is that problematic for vfio? >> >> > >> >> > Exactly that's problematic for security. >> >> > You can't allow guest driver to decide whether device skips security. >> >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint >> >> device, and doesn't live in virtio itself. >> >> >> >> It's a property of the platform IOMMU, and lives there. >> > >> > It's a property of the hypervisor virtio implementation, and lives there. >> >> It is now, but QEMU could, in principle, change the way it thinks >> about it so that virtio devices would use the QEMU DMA API but ask >> QEMU to pass everything through 1:1. This would be entirely invisible >> to guests but would make it be a property of the IOMMU implementation. >> At that point, maybe QEMU could find a (platform dependent) way to >> tell the guest what's going on. >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle, >> set up 1:1 mappings in the guest so that the virtio devices would work >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set >> up with an offset. I don't know too much about those platforms, but >> presumably the layout could be changed so that 1:1 really was 1:1. >> >> --Andy > > Sure. Do you see any reason why the decision to do this can't be > keyed off the virtio feature bit? I can think of three types of virtio host: a) virtio always bypasses the IOMMU. b) virtio never bypasses the IOMMU (unless DMAR tables or similar say it does) -- i.e. virtio works like any other device. c) virtio may bypass the IOMMU depending on what the guest asks it to do. If this is keyed off a virtio feature bit and anyone tries to implement (c), the vfio is going to have a problem. And, if it's keyed off a virtio feature bit, then (a) won't work on Xen or similar setups unless the Xen hypervisor adds a giant and probably unreliable kludge to support it. Meanwhile, 4.6-rc works fine under Xen on a default x86 QEMU configuration, and I'd really like to keep it that way. What could plausibly work using a virtio feature bit is for a device to say "hey, I'm a new device and I support the platform-defined IOMMU mechanism". This bit would be *set* on default IOMMU-less QEMU configurations and on physical virtio PCI cards. The guest could operate accordingly. I'm not sure I see a good way for feature negotiation to work the other direction, though. PPC and SPARC could only set this bit on emulated devices if they know that new guest kernels are in use. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkinwrote: > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote: >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote: >> > >> > > I thought that PLATFORM served that purpose. Woudn't the host >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host >> > > device would skip translation? Or is that problematic for vfio? >> > >> > Exactly that's problematic for security. >> > You can't allow guest driver to decide whether device skips security. >> >> Right. Because fundamentally, this *isn't* a property of the endpoint >> device, and doesn't live in virtio itself. >> >> It's a property of the platform IOMMU, and lives there. > > It's a property of the hypervisor virtio implementation, and lives there. It is now, but QEMU could, in principle, change the way it thinks about it so that virtio devices would use the QEMU DMA API but ask QEMU to pass everything through 1:1. This would be entirely invisible to guests but would make it be a property of the IOMMU implementation. At that point, maybe QEMU could find a (platform dependent) way to tell the guest what's going on. FWIW, as far as I can tell, PPC and SPARC really could, in principle, set up 1:1 mappings in the guest so that the virtio devices would work regardless of whether QEMU is ignoring the IOMMU or not -- I think the only obstacle is that the PPC and SPARC 1:1 mappings are currectly set up with an offset. I don't know too much about those platforms, but presumably the layout could be changed so that 1:1 really was 1:1. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote: >> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote: >> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote: >> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dw...@infradead.org> >> >> wrote: >> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell >> >> > the truth, and even legacy kernels ought to cope with that. >> >> > FSVO 'ought to' where I suspect some of them will actually crash with a >> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the >> >> > tables, which puts it back into the same camp as ARM and Power. >> >> >> >> I think x86 may get a bit of a free pass here. AFAIK the QEMU IOMMU >> >> implementation on x86 has always been "experimental", so it just might >> >> be okay to change it in a way that causes some older kernels to OOPS. >> >> >> >> --Andy >> > >> > Since it's experimental, it might be OK to change *guest kernels* >> > such that they oops on old QEMU. >> > But guest kernels were not experimental - so we need a QEMU mode that >> > makes them work fine. The more functionality is available in this QEMU >> > mode, the betterm because it's going to be the default for a while. For >> > the same reason, it is preferable to also have new kernels not crash in >> > this mode. >> > >> >> People add QEMU features that need new guest kernels all time time. >> If you enable virtio-scsi and try to boot a guest that's too old, it >> won't work. So I don't see anything fundamentally wrong with saying >> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest >> kernel is too old. It might be annoying, since old kernels do work on >> actual Q35 hardware, but it at least seems to be that it might be >> okay. >> >> --Andy > > Yes but we need a mode that makes both old and new kernels work, and > that should be the default for a while. this is what the > IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA > API, new kernels go "oh compatibility mode" and bypass the IOMMU > within DMA API. I thought that PLATFORM served that purpose. Woudn't the host advertise PLATFORM support and, if the guest doesn't ack it, the host device would skip translation? Or is that problematic for vfio? > > -- > MST -- Andy Lutomirski AMA Capital Management, LLC
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin"wrote: > > > I guess you are right in that we should split this part out. > What I wanted is really the combination > PASSTHROUGH && !PLATFORM so that we can say "ok we don't > need to guess, this device actually bypasses the IOMMU". What happens when you use a device like this on Xen or with a similar software translation layer? I think that a "please bypass IOMMU" feature would be better in the PCI, IOMMU, or platform code. For Xen, virtio would still want to use the DMA API, just without translating at the DMAR or hardware level. Doing it in virtio is awkward, because virtio is involved at the device level and the driver level, but the translation might be entirely in between. I think a nicer long-term approach would be to have a way to ask the guest to set up a full 1:1 mapping for performance, but to still handle the case where the guest refuses to do so or where there's more than one translation layer involved. But I agree that this part shouldn't delay the other part of your series. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <m...@redhat.com> wrote: > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote: >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dw...@infradead.org> >> wrote: >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell >> > the truth, and even legacy kernels ought to cope with that. >> > FSVO 'ought to' where I suspect some of them will actually crash with a >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the >> > tables, which puts it back into the same camp as ARM and Power. >> >> I think x86 may get a bit of a free pass here. AFAIK the QEMU IOMMU >> implementation on x86 has always been "experimental", so it just might >> be okay to change it in a way that causes some older kernels to OOPS. >> >> --Andy > > Since it's experimental, it might be OK to change *guest kernels* > such that they oops on old QEMU. > But guest kernels were not experimental - so we need a QEMU mode that > makes them work fine. The more functionality is available in this QEMU > mode, the betterm because it's going to be the default for a while. For > the same reason, it is preferable to also have new kernels not crash in > this mode. > People add QEMU features that need new guest kernels all time time. If you enable virtio-scsi and try to boot a guest that's too old, it won't work. So I don't see anything fundamentally wrong with saying that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest kernel is too old. It might be annoying, since old kernels do work on actual Q35 hardware, but it at least seems to be that it might be okay. --Andy
Re: [Qemu-devel] [PATCH RFC] fixup! virtio: convert to use DMA api
On Mon, Apr 18, 2016 at 11:29 AM, David Woodhousewrote: > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell > the truth, and even legacy kernels ought to cope with that. > FSVO 'ought to' where I suspect some of them will actually crash with a > NULL pointer dereference if there's no "catch-all" DMAR unit in the > tables, which puts it back into the same camp as ARM and Power. I think x86 may get a bit of a free pass here. AFAIK the QEMU IOMMU implementation on x86 has always been "experimental", so it just might be okay to change it in a way that causes some older kernels to OOPS. --Andy
Re: [Qemu-devel] [PATCH v3 1/4] firmware: introduce sysfs driver for QEMU's fw_cfg device
On Sat, Oct 3, 2015 at 4:28 PM, Gabriel L. Somlowrote: > From: Gabriel Somlo > > Make fw_cfg entries of type "file" available via sysfs. Entries > are listed under /sys/firmware/qemu_fw_cfg/by_key, in folders > named after each entry's selector key. Filename, selector value, > and size read-only attributes are included for each entry. Also, > a "raw" attribute allows retrieval of the full binary content of > each entry. > > This patch also provides a documentation file outlining the > guest-side "hardware" interface exposed by the QEMU fw_cfg device. > What's the status of "by_name"? There's a single (presumably incorrect) mention of it in a comment in this patch. I would prefer if the kernel populated by_name itself rather than deferring that to udev, since I'd like to use this facility in virtme, and I'd like to use fw_cfg very early on boot before I even start udev. --Andy
Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli aarca...@redhat.com wrote: On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: mremap like interface, or file+commands protocol interface. I tend to like mremap more, that's why I opted for a remap_anon_pages syscall kept orthogonal to the userfaultfd functionality (remap_anon_pages could be also used standalone as an accelerated mremap in some circumstances) but nothing prevents to just embed the same mechanism Sorry for the self followup, but something else comes to mind to elaborate this further. In term of interfaces, the most efficient I could think of to minimize the enter/exit kernel, would be to append the source address of the data received from the network transport, to the userfaultfd_write() command (by appending 8 bytes to the wakeup command). Said that, mixing the mechanism to be notified about userfaults with the mechanism to resolve an userfault to me looks a complication. I kind of liked to keep the userfaultfd protocol is very simple and doing just its thing. The userfaultfd doesn't need to know how the userfault was resolved, even mremap would work theoretically (until we run out of vmas). I thought it was simpler to keep it that way. However if we want to resolve the fault with a write() syscall this may be the most efficient way to do it, as we're already doing a write() into the pseudofd to wakeup the page fault that contains the destination address, I just need to append the source address to the wakeup command. I probably grossly overestimated the benefits of resolving the userfault with a zerocopy page move, sorry. So if we entirely drop the zerocopy behavior and the TLB flush of the old page like you suggested, the way to keep the userfaultfd mechanism decoupled from the userfault resolution mechanism would be to implement an atomic-copy syscall. That would work for SIGBUS userfaults too without requiring a pseudofd then. It would be enough then to call mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic wouldn't page fault or call GUP into the destination address (it can't otherwise the in-flight partial copy would be visible to the process, breaking the atomicity of the copy), but it would fill in the pte/trans_huge_pmd with the same strict behavior that remap_anon_pages currently has (in turn it would by design bypass the VM_USERFAULT check and be ideal for resolving userfaults). At the risk of asking a possibly useless question, would it make sense to splice data into a userfaultfd? --Andy mcopy_atomic could then be also extended to tmpfs and it would work without requiring the source page to be a tmpfs page too without having to convert page types on the fly. If I add mcopy_atomic, the patch in subject (10/17) can be dropped of course so it'd be even less intrusive than the current remap_anon_pages and it would require zero TLB flush during its runtime (it would just require an atomic copy). So should I try to embed a mcopy_atomic inside userfault_write or can I expose it to userland as a standalone new syscall? Or should I do something different? Comments? Thanks, Andrea -- Andy Lutomirski AMA Capital Management, LLC
Re: [Qemu-devel] [PATCH RFC 00/11] qemu: towards virtio-1 host support
On 10/07/2014 07:39 AM, Cornelia Huck wrote: This patchset aims to get us some way to implement virtio-1 compliant and transitional devices in qemu. Branch available at git://github.com/cohuck/qemu virtio-1 I've mainly focused on: - endianness handling - extended feature bits - virtio-ccw new/changed commands At the risk of some distraction, would it be worth thinking about a solution to the IOMMU bypassing mess as part of this? --Andy
[Qemu-devel] [PATCH qemu] i386, linux-headers: Add support for kvm_get_rng_seed
This updates x86's kvm_para.h for the feature bit definition and target-i386/cpu.c for the feature name and default. Signed-off-by: Andy Lutomirski l...@amacapital.net --- linux-headers/asm-x86/kvm_para.h | 2 ++ target-i386/cpu.c| 5 +++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/linux-headers/asm-x86/kvm_para.h b/linux-headers/asm-x86/kvm_para.h index e41c5c1..a9b27ce 100644 --- a/linux-headers/asm-x86/kvm_para.h +++ b/linux-headers/asm-x86/kvm_para.h @@ -24,6 +24,7 @@ #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 #define KVM_FEATURE_PV_UNHALT 7 +#define KVM_FEATURE_GET_RNG_SEED 8 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -40,6 +41,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_GET_RNG_SEED 0x4b564d05 struct kvm_steal_time { __u64 steal; diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 8fd1497..4ea7e6c 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -236,7 +236,7 @@ static const char *ext4_feature_name[] = { static const char *kvm_feature_name[] = { kvmclock, kvm_nopiodelay, kvm_mmu, kvmclock, kvm_asyncpf, kvm_steal_time, kvm_pv_eoi, kvm_pv_unhalt, -NULL, NULL, NULL, NULL, +kvm_get_rng_seed, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, @@ -368,7 +368,8 @@ static uint32_t kvm_default_features[FEATURE_WORDS] = { (1 KVM_FEATURE_ASYNC_PF) | (1 KVM_FEATURE_STEAL_TIME) | (1 KVM_FEATURE_PV_EOI) | -(1 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT), +(1 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | +(1 KVM_FEATURE_GET_RNG_SEED), [FEAT_1_ECX] = CPUID_EXT_X2APIC, }; -- 1.9.3
Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: Hello everyone, There's a large CC list for this RFC because this adds two new syscalls (userfaultfd and remap_anon_pages) and MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API or on a completely different API if somebody has better ideas are welcome now. cc:linux-api -- this is certainly worthy of linux-api discussion. The combination of these features are what I would propose to implement postcopy live migration in qemu, and in general demand paging of remote memory, hosted in different cloud nodes. The MADV_USERFAULT feature should be generic enough that it can provide the userfaults to the Android volatile range feature too, on access of reclaimed volatile pages. If the access could ever happen in kernel context through syscalls (not not just from userland context), then userfaultfd has to be used to make the userfault unnoticeable to the syscall (no error will be returned). This latter feature is more advanced than what volatile ranges alone could do with SIGBUS so far (but it's optional, if the process doesn't call userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS will also fire for any blocked userfault that was waiting a userfaultfd_write ack). userfaultfd is also a generic enough feature, that it allows KVM to implement postcopy live migration without having to modify a single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all other GUP features works just fine in combination with userfaults (userfaults trigger async page faults in the guest scheduler so those guest processes that aren't waiting for userfaults can keep running in the guest vcpus). remap_anon_pages is the syscall to use to resolve the userfaults (it's not mandatory, vmsplice will likely still be used in the case of local postcopy live migration just to upgrade the qemu binary, but remap_anon_pages is faster and ideal for transferring memory across the network, it's zerocopy and doesn't touch the vma: it only holds the mmap_sem for reading). The current behavior of remap_anon_pages is very strict to avoid any chance of memory corruption going unnoticed. mremap is not strict like that: if there's a synchronization bug it would drop the destination range silently resulting in subtle memory corruption for example. remap_anon_pages would return -EEXIST in that case. If there are holes in the source range remap_anon_pages will return -ENOENT. If remap_anon_pages is used always with 2M naturally aligned addresses, transparent hugepages will not be splitted. In there could be 4k (or any size) holes in the 2M (or any size) source range, remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to relax some of its strict checks (-ENOENT won't be returned if RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as a noop on any hole in the source range). This flag is generally useful when implementing userfaults with THP granularity, but it shouldn't be set if doing the userfaults with PAGE_SIZE granularity if the developer wants to benefit from the strict -ENOENT behavior. The remap_anon_pages syscall API is not vectored, as I expect it to be used mainly for demand paging (where there can be just one faulting range per userfault) or for large ranges (with the THP model as an alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k granularity before starting the guest in the destination node) where vectoring isn't going to provide much performance advantages (thanks to the THP coarser granularity). On the rmap side remap_anon_pages doesn't add much complexity: there's no need of nonlinear anon vmas to support it because I added the constraint that it will fail if the mapcount is more than 1. So in general the source range of remap_anon_pages should be marked MADV_DONTFORK to prevent any risk of failure if the process ever forks (like qemu can in some case). One part that hasn't been tested is the poll() syscall on the userfaultfd because the postcopy migration thread currently is more efficient waiting on blocking read()s (I'll write some code to test poll() too). I also appended below a patch to trinity to exercise remap_anon_pages and userfaultfd and it completes trinity successfully. The code can be found here: git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault The branch is rebased so you can get updates for example with: git fetch git checkout -f origin/userfault Comments welcome, thanks! Andrea From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli aarca...@redhat.com Date: Wed, 2 Jul 2014 18:32:35 +0200 Subject: [PATCH] add remap_anon_pages and userfaultfd Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- include/syscalls-x86_64.h | 2 + syscalls/remap_anon_pages.c | 100
Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: Once an userfaultfd is created MADV_USERFAULT regions talks through the userfaultfd protocol with the thread responsible for doing the memory externalization of the process. The protocol starts by userland writing the requested/preferred USERFAULT_PROTOCOL version into the userfault fd (64bit write), if kernel knows it, it will ack it by allowing userland to read 64bit from the userfault fd that will contain the same 64bit USERFAULT_PROTOCOL version that userland asked. Otherwise userfault will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it will have to try again by writing an older protocol version if suitable for its usage too, and read it back again until it stops reading -1ULL. After that the userfaultfd protocol starts. The protocol consists in the userfault fd reads 64bit in size providing userland the fault addresses. After a userfault address has been read and the fault is resolved by userland, the application must write back 128bits in the form of [ start, end ] range (64bit each) that will tell the kernel such a range has been mapped. Multiple read userfaults can be resolved in a single range write. poll() can be used to know when there are new userfaults to read (POLLIN) and when there are threads waiting a wakeup through a range write (POLLOUT). Signed-off-by: Andrea Arcangeli aarca...@redhat.com +#ifdef CONFIG_PROC_FS +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct userfaultfd_ctx *ctx = f-private_data; + int ret; + wait_queue_t *wq; + struct userfaultfd_wait_queue *uwq; + unsigned long pending = 0, total = 0; + + spin_lock(ctx-fault_wqh.lock); + list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) { + uwq = container_of(wq, struct userfaultfd_wait_queue, wq); + if (uwq-pending) + pending++; + total++; + } + spin_unlock(ctx-fault_wqh.lock); + + ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total); This should show the protocol version, too. + +SYSCALL_DEFINE1(userfaultfd, int, flags) +{ + int fd, error; + struct file *file; This looks like it can't be used more than once in a process. That will be unfortunate for libraries. Would it be feasible to either have userfaultfd claim a range of addresses or for a vma to be explicitly associated with a userfaultfd? (In the latter case, giant PROT_NONE MAP_NORESERVE mappings could be used.)
Re: [Qemu-devel] Turning off default storage devices?
On Mon, Apr 14, 2014 at 1:15 AM, Markus Armbruster arm...@redhat.com wrote: Peter Crosthwaite peter.crosthwa...@xilinx.com writes: Hi Andy, On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski l...@amacapital.net wrote: Currently, -M q35 boots linux quite a bit slower than the default machine type. This seems to be because it takes a few hundred ms to determine that there's nothing attached to the AHCI controller. In virtio setups, there will probably never be anything attached to the AHCI controller. Would it be possible to add something like -machine default_storage=off to turn off default storage devices? This could include the AHCI on q35 and the cdrom and such on pc. There's precedent: -machine usb=off turns off the default USB controllers, which is great for setups that use xhci. Is there a more generic solution to your problem? Can you implement command line device removal in a non specific way and avoid having to invent AHCI or even storage specific arguments. You could considering bringing the xhci use case you mentioned under the same umbrella. USB has always been off by default, at least for the boards I'm familiar with, due to the USB emulation's non-trivial CPU use. There's no such thing as a Q35 board without USB in the physical world. Can't stop us from making a virtual one, of course. Likewise, there's no such thing as a Q35 board without AHCI in the physical world, and again that can't stop us from making a virtual one. The difference to USB is that our q35 machines have always had AHCI even with -nodefaults. You seem to propose adding a switch to disable AHCI, yet leave it enabled with -nodefaults. -nodefaults should give you a board with all the optional components suppressed. Will this break libvirt, which may expect -nodefaults to still come with an IDE bus? On the one hand, I'd rather not add exceptions to -nodefaults give me the board with all its optional components suppressed semantics. On the other hand, a few hundred ms are a long time. That's why I proposed a new option. Yes, it's ugly :/ --Andy
Re: [Qemu-devel] Turning off default storage devices?
On Wed, Apr 9, 2014 at 8:13 PM, Peter Crosthwaite peter.crosthwa...@xilinx.com wrote: On Thu, Apr 10, 2014 at 9:57 AM, Andy Lutomirski l...@amacapital.net wrote: On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite peter.crosthwa...@xilinx.com wrote: Hi Andy, On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski l...@amacapital.net wrote: Currently, -M q35 boots linux quite a bit slower than the default machine type. This seems to be because it takes a few hundred ms to determine that there's nothing attached to the AHCI controller. In virtio setups, there will probably never be anything attached to the AHCI controller. Would it be possible to add something like -machine default_storage=off to turn off default storage devices? This could include the AHCI on q35 and the cdrom and such on pc. There's precedent: -machine usb=off turns off the default USB controllers, which is great for setups that use xhci. Is there a more generic solution to your problem? Can you implement command line device removal in a non specific way and avoid having to invent AHCI or even storage specific arguments. You could considering bringing the xhci use case you mentioned under the same umbrella. An option like -suppress-default-device foobar to turn off the device named foobar would work, but what happens if that device is a bus? Lets call that a misuse in the first instance. But in general, when attaching devices QEMU should be able to gracefully fail on unresolved deps. So it would be reasonable to work on that assumption given that every device should be able to handle a missing bus/gpio/interrupt etc. due to -device misuseability. Will this just cause QEMU to crash? Maybe the machine code would have to opt in to allowing this kind of suppression, and there could be a general error of you try to suppress a device that can't be suppressed. I would argue that there is no such thing. You may end up with a useless machine but its still valid to supress something and then by extension all its dependants are non functional. The q35 code is: /* ahci and SATA device, for q35 1 ahci controller is built-in */ ahci = pci_create_simple_multifunction(host_bus, PCI_DEVFN(ICH9_SATA1_DEV, ICH9_SATA1_FUNC), true, ich9-ahci); idebus[0] = qdev_get_child_bus(ahci-qdev, ide.0); idebus[1] = qdev_get_child_bus(ahci-qdev, ide.1); It looks like making pci_create_simple_multifunction return null will crash quite quickly. Even fixing the next two lines will just cause null pointer dereferences later on. Is there a different way to indicate that a device wasn't actually created? --Andy
[Qemu-devel] Turning off default storage devices?
Currently, -M q35 boots linux quite a bit slower than the default machine type. This seems to be because it takes a few hundred ms to determine that there's nothing attached to the AHCI controller. In virtio setups, there will probably never be anything attached to the AHCI controller. Would it be possible to add something like -machine default_storage=off to turn off default storage devices? This could include the AHCI on q35 and the cdrom and such on pc. There's precedent: -machine usb=off turns off the default USB controllers, which is great for setups that use xhci. Thanks, Andy
Re: [Qemu-devel] Turning off default storage devices?
On Wed, Apr 9, 2014 at 4:53 PM, Peter Crosthwaite peter.crosthwa...@xilinx.com wrote: Hi Andy, On Thu, Apr 10, 2014 at 5:55 AM, Andy Lutomirski l...@amacapital.net wrote: Currently, -M q35 boots linux quite a bit slower than the default machine type. This seems to be because it takes a few hundred ms to determine that there's nothing attached to the AHCI controller. In virtio setups, there will probably never be anything attached to the AHCI controller. Would it be possible to add something like -machine default_storage=off to turn off default storage devices? This could include the AHCI on q35 and the cdrom and such on pc. There's precedent: -machine usb=off turns off the default USB controllers, which is great for setups that use xhci. Is there a more generic solution to your problem? Can you implement command line device removal in a non specific way and avoid having to invent AHCI or even storage specific arguments. You could considering bringing the xhci use case you mentioned under the same umbrella. An option like -suppress-default-device foobar to turn off the device named foobar would work, but what happens if that device is a bus? Will this just cause QEMU to crash? Maybe the machine code would have to opt in to allowing this kind of suppression, and there could be a general error of you try to suppress a device that can't be suppressed. I can try to code this up, but I know nothing about QEMU internals. I'm just a user :) --Andy
[Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver
Running: ./virtme-run --installed-kernel from this virtme commit: https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f results in a bunch of missing lines of text once bootup finishes. Pressing enter a few times gradually fixes it. I don't know whether this is a qemu bug or a Linux bug. I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent 3.14-rc kernel. For the latter, cirrus is built-in (not a module), I'm running: virtme-run --kimg arch/x86/boot/bzImage and I see more profound corruption. --Andy
Re: [Qemu-devel] Framebuffer corruption in QEMU or Linux's cirrus driver
On Tue, Apr 1, 2014 at 3:09 PM, Andy Lutomirski l...@amacapital.net wrote: Running: ./virtme-run --installed-kernel from this virtme commit: https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f results in a bunch of missing lines of text once bootup finishes. Pressing enter a few times gradually fixes it. I don't know whether this is a qemu bug or a Linux bug. I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent 3.14-rc kernel. For the latter, cirrus is built-in (not a module), I'm running: virtme-run --kimg arch/x86/boot/bzImage and I see more profound corruption. I'm guessing this is a cirrus drm bug. bochs-drm (using virtme-run --installed-kernel --qemu-opts -vga std) does not appear to have the same issue. Neither does qxl. (qxl is painfully slow, though, and it doesn't seem to be using UC memory.) --Andy
[Qemu-devel] Re: [PATCH-V4 0/7] virtio-9p:Introducing security model for VirtFS
Venkateswararao Jujjuri (JV) wrote: This patch series introduces the security model for VirtFS. Brief description of this patch series: It introduces two type of security models for VirtFS. They are: mapped and passthrough. The following is common to both security models. * Client's VFS determines/enforces the access control. Largely server should never return EACCESS. * Client sends gid/mode-bit information as part of creation only. Changes from V3 --- o Return NULL instead of exit(1) on failure in virtio_9p_init() o Capitalized sm_passthrough, sm_mappe o Added handling for EINTR for read/write. o Corrected default permissions for mkdir in mapped mode. o Added additional error handling. Changes from V2 --- o Removed warnings resulting from chmod/chown. o Added code to fail normally if secuirty_model option is not specified. Changes from V1 --- o Added support for chmod and chown. o Used chmod/chown to set credentials instead of setuid/setgid. o Fixed a bug where uid used instated of uid. Security model: mapped -- VirtFS server(QEMU) intercepts and maps all the file object create requests. Files on the fileserver will be created with QEMU's user credentials and the client-user's credentials are stored in extended attributes. During getattr() server extracts the client-user's credentials from extended attributes and sends to the client. Given that only the user space extended attributes are available to regular files, special files are created as regular files on the fileserver and the appropriate mode bits are stored in xattrs and will be extracted during getattr. If the extended attributes are missing, server sends back the filesystem stat() unaltered. This provision will make the files created on the fileserver usable to client. Points to be considered * Filesystem will be VirtFS'ized. Meaning, other filesystems may not understand the credentials of the files created under this model. How hard would it be to make this compatible with rsync's --fake-super? (--fake-super already does almost what you're doing, and if you make the formats compatible, then rsync could be used to translate. OTOH, rsyncing a VirtFS-ified filesystem to a remote --fake-super system might have odd side-effects.) --Andy