Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
Hello Steve, On Thu, Apr 01, 2021 at 06:40:06PM -0700, Steve Rutherford wrote: > On Fri, Mar 19, 2021 at 11:00 AM Ashish Kalra wrote: > > > > On Thu, Mar 11, 2021 at 12:48:07PM -0800, Steve Rutherford wrote: > > > On Thu, Mar 11, 2021 at 10:15 AM Ashish Kalra > > > wrote: > > > > > > > > On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > > > > > [+Marc] > > > > > > > > > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > > > > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford > > > > > > > > wrote: > > > > > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > > > > > wrote: > > > > > > > > > Thanks for grabbing the data! > > > > > > > > > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire > > > > > > > > > for > > > > > > > > > hypercall exiting, so I think that would be the current > > > > > > > > > consensus. > > > > > > > > > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > > > > > > > > > If we want to do hypercall exiting, this should be in a > > > > > > > > > follow-up > > > > > > > > > series where we implement something more generic, e.g. a > > > > > > > > > hypercall > > > > > > > > > exiting bitmap or hypercall exit list. If we are taking the > > > > > > > > > hypercall > > > > > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > > > > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > > > > > interception. Or > > > > > > > rather, I think hypercall interception should be an orthogonal > > > > > > > implementation. > > > > > > > > > > > > > > The guest, including guest firmware, needs to be aware that the > > > > > > > hypercall is > > > > > > > supported, and the ABI needs to be well-defined. Relying on > > > > > > > userspace VMMs to > > > > > > > implement a common ABI is an unnecessary risk. > > > > > > > > > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM > > > > > > > enforce the ABI but > > > > > > > require further VMM intervention. But, I just don't see the > > > > > > > point, it would > > > > > > > save only a few lines of code. It would also limit what KVM > > > > > > > could do in the > > > > > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit > > > > > > > to userspace, > > > > > > > then mandatory interception would essentially make it impossible > > > > > > > for KVM to do > > > > > > > bookkeeping while still honoring the interception request. > > > > > > > > > > > > > > However, I do think it would make sense to have the userspace > > > > > > > exit be a generic > > > > > > > exit type. But hey, we already have the necessary ABI defined > > > > > > > for that! It's > > > > > > > just not used anywhere. > > > > > > > > > > > > > > /* KVM_EXIT_HYPERCALL */ > > > > > > > struct { > > > > > > > __u64 nr; > > > > > > > __u64 args[6]; > > > > > > > __u64 ret; > > > > > > > __u32 longmode; > > > > > > > __u32 pad; > > > > > > > } hypercall; > > > > > > > > > > > > > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would > > > > > > > > > need to > > > > > > > > > confirm that). Then userspace could also be in control of > > > > > > > > > the cpuid bit. > > > > > > > > > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 > > > > > > > bits of data. > > > > > > > The data limitation could be fudged by shoving data into > > > > > > > non-standard GPRs, but > > > > > > > that will result in truly heinous guest code, and extensibility > > > > > > > issues. > > > > > > > > > > > We may also need to pass-through the MSR to userspace, as it is a part of > > this > > complete host (userspace/kernel), OVMF and guest kernel negotiation of > > the SEV live migration feature. > > > > Host (userspace/kernel) advertises it's support for SEV live migration > > feature via the CPUID bits, which is queried by OVMF and which in turn > > adds a new UEFI runtime variable to indicate support for SEV live > > migration, which is later queried during guest kernel boot and > > accordingly the guest does a wrmrsl() to custom MSR to complete SEV > > live migration negotiation and enable it. > > > > Now, the GET_SHARED_REGION_LIST ioctl returns error, until this MSR write > > enables SEV live migration, hence, preventing userspace to start live > > migration before the feature support has been negotiated and enabled on > > all the three components - host, guest OVMF and kernel. > > > > But, now with this ioctl not existing anymore, we will need to > > pass-through the MSR to userspace too, for it to only initiate live > > migration once the feature negotiation has been completed. > > I can't tell if you were waiting for feedback on this
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Fri, Mar 19, 2021 at 11:00 AM Ashish Kalra wrote: > > On Thu, Mar 11, 2021 at 12:48:07PM -0800, Steve Rutherford wrote: > > On Thu, Mar 11, 2021 at 10:15 AM Ashish Kalra wrote: > > > > > > On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > > > > [+Marc] > > > > > > > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > > > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > > > > wrote: > > > > > > > > Thanks for grabbing the data! > > > > > > > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire > > > > > > > > for > > > > > > > > hypercall exiting, so I think that would be the current > > > > > > > > consensus. > > > > > > > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > > > > > > > If we want to do hypercall exiting, this should be in a > > > > > > > > follow-up > > > > > > > > series where we implement something more generic, e.g. a > > > > > > > > hypercall > > > > > > > > exiting bitmap or hypercall exit list. If we are taking the > > > > > > > > hypercall > > > > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > > > > interception. Or > > > > > > rather, I think hypercall interception should be an orthogonal > > > > > > implementation. > > > > > > > > > > > > The guest, including guest firmware, needs to be aware that the > > > > > > hypercall is > > > > > > supported, and the ABI needs to be well-defined. Relying on > > > > > > userspace VMMs to > > > > > > implement a common ABI is an unnecessary risk. > > > > > > > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM > > > > > > enforce the ABI but > > > > > > require further VMM intervention. But, I just don't see the point, > > > > > > it would > > > > > > save only a few lines of code. It would also limit what KVM could > > > > > > do in the > > > > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > > > > userspace, > > > > > > then mandatory interception would essentially make it impossible > > > > > > for KVM to do > > > > > > bookkeeping while still honoring the interception request. > > > > > > > > > > > > However, I do think it would make sense to have the userspace exit > > > > > > be a generic > > > > > > exit type. But hey, we already have the necessary ABI defined for > > > > > > that! It's > > > > > > just not used anywhere. > > > > > > > > > > > > /* KVM_EXIT_HYPERCALL */ > > > > > > struct { > > > > > > __u64 nr; > > > > > > __u64 args[6]; > > > > > > __u64 ret; > > > > > > __u32 longmode; > > > > > > __u32 pad; > > > > > > } hypercall; > > > > > > > > > > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would > > > > > > > > need to > > > > > > > > confirm that). Then userspace could also be in control of the > > > > > > > > cpuid bit. > > > > > > > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits > > > > > > of data. > > > > > > The data limitation could be fudged by shoving data into > > > > > > non-standard GPRs, but > > > > > > that will result in truly heinous guest code, and extensibility > > > > > > issues. > > > > > > > > We may also need to pass-through the MSR to userspace, as it is a part of this > complete host (userspace/kernel), OVMF and guest kernel negotiation of > the SEV live migration feature. > > Host (userspace/kernel) advertises it's support for SEV live migration > feature via the CPUID bits, which is queried by OVMF and which in turn > adds a new UEFI runtime variable to indicate support for SEV live > migration, which is later queried during guest kernel boot and > accordingly the guest does a wrmrsl() to custom MSR to complete SEV > live migration negotiation and enable it. > > Now, the GET_SHARED_REGION_LIST ioctl returns error, until this MSR write > enables SEV live migration, hence, preventing userspace to start live > migration before the feature support has been negotiated and enabled on > all the three components - host, guest OVMF and kernel. > > But, now with this ioctl not existing anymore, we will need to > pass-through the MSR to userspace too, for it to only initiate live > migration once the feature negotiation has been completed. Hey Ashish, I can't tell if you were waiting for feedback on this before posting the follow-up patch series. Here are a few options: 1) Add the MSR explicitly to the list of custom kvm MSRs, but don't have it hooked up anywhere. The expectation would be for the VMM to use msr intercepts to handle the reads and writes. If that seems weird, have svm_set_msr (or
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Mar 11, 2021 at 12:48:07PM -0800, Steve Rutherford wrote: > On Thu, Mar 11, 2021 at 10:15 AM Ashish Kalra wrote: > > > > On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > > > [+Marc] > > > > > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > > > wrote: > > > > > > > Thanks for grabbing the data! > > > > > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > > > series where we implement something more generic, e.g. a hypercall > > > > > > > exiting bitmap or hypercall exit list. If we are taking the > > > > > > > hypercall > > > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > > > interception. Or > > > > > rather, I think hypercall interception should be an orthogonal > > > > > implementation. > > > > > > > > > > The guest, including guest firmware, needs to be aware that the > > > > > hypercall is > > > > > supported, and the ABI needs to be well-defined. Relying on > > > > > userspace VMMs to > > > > > implement a common ABI is an unnecessary risk. > > > > > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce > > > > > the ABI but > > > > > require further VMM intervention. But, I just don't see the point, > > > > > it would > > > > > save only a few lines of code. It would also limit what KVM could do > > > > > in the > > > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > > > userspace, > > > > > then mandatory interception would essentially make it impossible for > > > > > KVM to do > > > > > bookkeeping while still honoring the interception request. > > > > > > > > > > However, I do think it would make sense to have the userspace exit be > > > > > a generic > > > > > exit type. But hey, we already have the necessary ABI defined for > > > > > that! It's > > > > > just not used anywhere. > > > > > > > > > > /* KVM_EXIT_HYPERCALL */ > > > > > struct { > > > > > __u64 nr; > > > > > __u64 args[6]; > > > > > __u64 ret; > > > > > __u32 longmode; > > > > > __u32 pad; > > > > > } hypercall; > > > > > > > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would need > > > > > > > to > > > > > > > confirm that). Then userspace could also be in control of the > > > > > > > cpuid bit. > > > > > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits > > > > > of data. > > > > > The data limitation could be fudged by shoving data into non-standard > > > > > GPRs, but > > > > > that will result in truly heinous guest code, and extensibility > > > > > issues. > > > > > We may also need to pass-through the MSR to userspace, as it is a part of this complete host (userspace/kernel), OVMF and guest kernel negotiation of the SEV live migration feature. Host (userspace/kernel) advertises it's support for SEV live migration feature via the CPUID bits, which is queried by OVMF and which in turn adds a new UEFI runtime variable to indicate support for SEV live migration, which is later queried during guest kernel boot and accordingly the guest does a wrmrsl() to custom MSR to complete SEV live migration negotiation and enable it. Now, the GET_SHARED_REGION_LIST ioctl returns error, until this MSR write enables SEV live migration, hence, preventing userspace to start live migration before the feature support has been negotiated and enabled on all the three components - host, guest OVMF and kernel. But, now with this ioctl not existing anymore, we will need to pass-through the MSR to userspace too, for it to only initiate live migration once the feature negotiation has been completed. Thanks, Ashish > > > > > The data limitation is a moot point, because the x86-only thing is a > > > > > deal > > > > > breaker. arm64's pKVM work has a near-identical use case for a guest > > > > > to share > > > > > memory with a host. I can't think of a clever way to avoid having to > > > > > support > > > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not > > > > > have > > > > > multiple KVM variants. > > > > > > > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > > > > patch-set and probably relevant to arm64 nVHE hypervisor > > > > mode/implementation, and potentially makes sense as it adds guest > > > >
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Mar 11, 2021 at 10:15 AM Ashish Kalra wrote: > > On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > > [+Marc] > > > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > > wrote: > > > > > > Thanks for grabbing the data! > > > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > > series where we implement something more generic, e.g. a hypercall > > > > > > exiting bitmap or hypercall exit list. If we are taking the > > > > > > hypercall > > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > > interception. Or > > > > rather, I think hypercall interception should be an orthogonal > > > > implementation. > > > > > > > > The guest, including guest firmware, needs to be aware that the > > > > hypercall is > > > > supported, and the ABI needs to be well-defined. Relying on userspace > > > > VMMs to > > > > implement a common ABI is an unnecessary risk. > > > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce > > > > the ABI but > > > > require further VMM intervention. But, I just don't see the point, it > > > > would > > > > save only a few lines of code. It would also limit what KVM could do > > > > in the > > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > > userspace, > > > > then mandatory interception would essentially make it impossible for > > > > KVM to do > > > > bookkeeping while still honoring the interception request. > > > > > > > > However, I do think it would make sense to have the userspace exit be a > > > > generic > > > > exit type. But hey, we already have the necessary ABI defined for > > > > that! It's > > > > just not used anywhere. > > > > > > > > /* KVM_EXIT_HYPERCALL */ > > > > struct { > > > > __u64 nr; > > > > __u64 args[6]; > > > > __u64 ret; > > > > __u32 longmode; > > > > __u32 pad; > > > > } hypercall; > > > > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > > > confirm that). Then userspace could also be in control of the > > > > > > cpuid bit. > > > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > > > data. > > > > The data limitation could be fudged by shoving data into non-standard > > > > GPRs, but > > > > that will result in truly heinous guest code, and extensibility issues. > > > > > > > > The data limitation is a moot point, because the x86-only thing is a > > > > deal > > > > breaker. arm64's pKVM work has a near-identical use case for a guest > > > > to share > > > > memory with a host. I can't think of a clever way to avoid having to > > > > support > > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not > > > > have > > > > multiple KVM variants. > > > > > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > > > patch-set and probably relevant to arm64 nVHE hypervisor > > > mode/implementation, and potentially makes sense as it adds guest > > > memory protection as both host and guest kernels are running on the same > > > privilege level ? > > > > > > Though i do see that the pKVM stuff adds two hypercalls, specifically : > > > > > > pkvm_create_mappings() ( I assume this is for setting shared memory > > > regions between host and guest) & > > > pkvm_create_private_mappings(). > > > > > > And the use-cases are quite similar to memory protection architectues > > > use cases, for example, use with virtio devices, guest DMA I/O, etc. > > > > These hypercalls are both private to the host kernel communicating with > > its hypervisor counterpart, so I don't think they're particularly > > relevant here. As far as I can see, the more useful thing is to allow > > the guest to communicate back to the host (and the VMM) that it has opened > > up a memory window, perhaps for virtio rings or some other shared memory. > > > > We hacked this up as a prototype in the past: > > > >
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > [+Marc] > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > wrote: > > > > > Thanks for grabbing the data! > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > series where we implement something more generic, e.g. a hypercall > > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > interception. Or > > > rather, I think hypercall interception should be an orthogonal > > > implementation. > > > > > > The guest, including guest firmware, needs to be aware that the hypercall > > > is > > > supported, and the ABI needs to be well-defined. Relying on userspace > > > VMMs to > > > implement a common ABI is an unnecessary risk. > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > > ABI but > > > require further VMM intervention. But, I just don't see the point, it > > > would > > > save only a few lines of code. It would also limit what KVM could do in > > > the > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > userspace, > > > then mandatory interception would essentially make it impossible for KVM > > > to do > > > bookkeeping while still honoring the interception request. > > > > > > However, I do think it would make sense to have the userspace exit be a > > > generic > > > exit type. But hey, we already have the necessary ABI defined for that! > > > It's > > > just not used anywhere. > > > > > > /* KVM_EXIT_HYPERCALL */ > > > struct { > > > __u64 nr; > > > __u64 args[6]; > > > __u64 ret; > > > __u32 longmode; > > > __u32 pad; > > > } hypercall; > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > > confirm that). Then userspace could also be in control of the cpuid > > > > > bit. > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > > data. > > > The data limitation could be fudged by shoving data into non-standard > > > GPRs, but > > > that will result in truly heinous guest code, and extensibility issues. > > > > > > The data limitation is a moot point, because the x86-only thing is a deal > > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > > share > > > memory with a host. I can't think of a clever way to avoid having to > > > support > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > > multiple KVM variants. > > > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > > patch-set and probably relevant to arm64 nVHE hypervisor > > mode/implementation, and potentially makes sense as it adds guest > > memory protection as both host and guest kernels are running on the same > > privilege level ? > > > > Though i do see that the pKVM stuff adds two hypercalls, specifically : > > > > pkvm_create_mappings() ( I assume this is for setting shared memory > > regions between host and guest) & > > pkvm_create_private_mappings(). > > > > And the use-cases are quite similar to memory protection architectues > > use cases, for example, use with virtio devices, guest DMA I/O, etc. > > These hypercalls are both private to the host kernel communicating with > its hypervisor counterpart, so I don't think they're particularly > relevant here. As far as I can see, the more useful thing is to allow > the guest to communicate back to the host (and the VMM) that it has opened > up a memory window, perhaps for virtio rings or some other shared memory. > > We hacked this up as a prototype in the past: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fandroid-kvm.googlesource.com%2Flinux%2F%2B%2Fd12a9e2c12a52cf7140d40cd9fa092dc8a85fac9%255E%2521%2Fdata=04%7C01%7Cashish.kalra%40amd.com%7C7ae6bbd9fa6442f9edcc08d8de75d14b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637503944913839841%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=Juon5nJ7BB6moTWYssRXOWrDOrYfZLmA%2BLrz3s12Ook%3Dreserved=0 > > but that's all arm64-specific and if we're solving the same problem as > you, then let's avoid arch-specific stuff if possible. The way in which > the guest discovers the interface
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Tue, Mar 9, 2021 at 7:42 PM Kalra, Ashish wrote: > > > > > On Mar 9, 2021, at 3:22 AM, Steve Rutherford wrote: > > > > On Mon, Mar 8, 2021 at 1:11 PM Brijesh Singh wrote: > >> > >> > >>> On 3/8/21 1:51 PM, Sean Christopherson wrote: > >>> On Mon, Mar 08, 2021, Ashish Kalra wrote: > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > +Will and Quentin (arm64) > > > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM > > details at this > > point. > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > >> On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > >>> On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > >>> wrote: > >>> Thanks for grabbing the data! > >>> > >>> I am fine with both paths. Sean has stated an explicit desire for > >>> hypercall exiting, so I think that would be the current consensus. > > Yep, though it'd be good to get Paolo's input, too. > > > >>> If we want to do hypercall exiting, this should be in a follow-up > >>> series where we implement something more generic, e.g. a hypercall > >>> exiting bitmap or hypercall exit list. If we are taking the hypercall > >>> exit route, we can drop the kvm side of the hypercall. > > I don't think this is a good candidate for arbitrary hypercall > > interception. Or > > rather, I think hypercall interception should be an orthogonal > > implementation. > > > > The guest, including guest firmware, needs to be aware that the > > hypercall is > > supported, and the ABI needs to be well-defined. Relying on userspace > > VMMs to > > implement a common ABI is an unnecessary risk. > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce > > the ABI but > > require further VMM intervention. But, I just don't see the point, it > > would > > save only a few lines of code. It would also limit what KVM could do > > in the > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > userspace, > > then mandatory interception would essentially make it impossible for > > KVM to do > > bookkeeping while still honoring the interception request. > > > > However, I do think it would make sense to have the userspace exit be a > > generic > > exit type. But hey, we already have the necessary ABI defined for > > that! It's > > just not used anywhere. > > > >/* KVM_EXIT_HYPERCALL */ > >struct { > >__u64 nr; > >__u64 args[6]; > >__u64 ret; > >__u32 longmode; > >__u32 pad; > >} hypercall; > > > > > >>> Userspace could also handle the MSR using MSR filters (would need to > >>> confirm that). Then userspace could also be in control of the cpuid > >>> bit. > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > data. > > The data limitation could be fudged by shoving data into non-standard > > GPRs, but > > that will result in truly heinous guest code, and extensibility issues. > > > > The data limitation is a moot point, because the x86-only thing is a > > deal > > breaker. arm64's pKVM work has a near-identical use case for a guest > > to share > > memory with a host. I can't think of a clever way to avoid having to > > support > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not > > have > > multiple KVM variants. > > > Potentially, there is another reason for in-kernel hypercall handling > considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > of each guest page, for instance pages in hypervisor state, i.e., pages > with C=0 and pages in guest valid state with C=1. > > Now, there shouldn't be a need for page encryption status hypercalls on > SEV-SNP as KVM can track & reference guest page status directly using > the RMP table. > >>> Relying on the RMP table itself would require locking the RMP table for an > >>> extended duration, and walking the entire RMP to find shared pages would > >>> be > >>> very inefficient. > >>> > As KVM maintains the RMP table, therefore we will need SET/GET type of > interfaces to provide the guest page encryption status to userspace. > >>> Hrm, somehow I temporarily forgot about SNP and TDX adding their own > >>> hypercalls > >>> for converting between shared and private. And in the case of TDX, the > >>> hypercall > >>> can't be trusted, i.e. is just a hint, otherwise the guest could induce a > >>> #MC in > >>> the host. > >>> > >>> But, the different guest behavior doesn't require KVM to maintain a > >>> list/tree, > >>> e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page > >>> encryption > >>> status changes
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
> On Mar 9, 2021, at 3:22 AM, Steve Rutherford wrote: > > On Mon, Mar 8, 2021 at 1:11 PM Brijesh Singh wrote: >> >> >>> On 3/8/21 1:51 PM, Sean Christopherson wrote: >>> On Mon, Mar 08, 2021, Ashish Kalra wrote: On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > +Will and Quentin (arm64) > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details > at this > point. > > On Fri, Feb 26, 2021, Ashish Kalra wrote: >> On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: >>> On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra >>> wrote: >>> Thanks for grabbing the data! >>> >>> I am fine with both paths. Sean has stated an explicit desire for >>> hypercall exiting, so I think that would be the current consensus. > Yep, though it'd be good to get Paolo's input, too. > >>> If we want to do hypercall exiting, this should be in a follow-up >>> series where we implement something more generic, e.g. a hypercall >>> exiting bitmap or hypercall exit list. If we are taking the hypercall >>> exit route, we can drop the kvm side of the hypercall. > I don't think this is a good candidate for arbitrary hypercall > interception. Or > rather, I think hypercall interception should be an orthogonal > implementation. > > The guest, including guest firmware, needs to be aware that the hypercall > is > supported, and the ABI needs to be well-defined. Relying on userspace > VMMs to > implement a common ABI is an unnecessary risk. > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > ABI but > require further VMM intervention. But, I just don't see the point, it > would > save only a few lines of code. It would also limit what KVM could do in > the > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > userspace, > then mandatory interception would essentially make it impossible for KVM > to do > bookkeeping while still honoring the interception request. > > However, I do think it would make sense to have the userspace exit be a > generic > exit type. But hey, we already have the necessary ABI defined for that! > It's > just not used anywhere. > >/* KVM_EXIT_HYPERCALL */ >struct { >__u64 nr; >__u64 args[6]; >__u64 ret; >__u32 longmode; >__u32 pad; >} hypercall; > > >>> Userspace could also handle the MSR using MSR filters (would need to >>> confirm that). Then userspace could also be in control of the cpuid >>> bit. > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > data. > The data limitation could be fudged by shoving data into non-standard > GPRs, but > that will result in truly heinous guest code, and extensibility issues. > > The data limitation is a moot point, because the x86-only thing is a deal > breaker. arm64's pKVM work has a near-identical use case for a guest to > share > memory with a host. I can't think of a clever way to avoid having to > support > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > multiple KVM variants. > Potentially, there is another reason for in-kernel hypercall handling considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state of each guest page, for instance pages in hypervisor state, i.e., pages with C=0 and pages in guest valid state with C=1. Now, there shouldn't be a need for page encryption status hypercalls on SEV-SNP as KVM can track & reference guest page status directly using the RMP table. >>> Relying on the RMP table itself would require locking the RMP table for an >>> extended duration, and walking the entire RMP to find shared pages would be >>> very inefficient. >>> As KVM maintains the RMP table, therefore we will need SET/GET type of interfaces to provide the guest page encryption status to userspace. >>> Hrm, somehow I temporarily forgot about SNP and TDX adding their own >>> hypercalls >>> for converting between shared and private. And in the case of TDX, the >>> hypercall >>> can't be trusted, i.e. is just a hint, otherwise the guest could induce a >>> #MC in >>> the host. >>> >>> But, the different guest behavior doesn't require KVM to maintain a >>> list/tree, >>> e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page >>> encryption >>> status changes would also suffice. >>> >>> Actually, that made me think of another argument against maintaining a list >>> in >>> KVM: there's no way to notify userspace that a page's status has changed. >>> Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. >>> Obviously not a
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Mon, Mar 08, 2021, Steve Rutherford wrote: > On Mon, Mar 8, 2021 at 1:11 PM Brijesh Singh wrote: > > On 3/8/21 1:51 PM, Sean Christopherson wrote: > > > If the guest does the hypercall after writing the page, then the guest is > > > hosed > > > if it gets migrated while writing the page (scenario #1): > > > > > > vCPU Userspace > > > zero_bytes[0:N] > > > > > shared> > > > > > > zero_bytes[N+1:4095] > > > set_shared (dest) > > > kaboom! > > > > > > Maybe I am missing something, this is not any different from a normal > > operation inside a guest. Making a page shared/private in the page table > > does not update the content of the page itself. In your above case, I > > assume zero_bytes[N+1:4095] are written by the destination VM. The > > memory region was private in the source VM page table, so, those writes > > will be performed encrypted. The destination VM later changed the memory > > to shared, but nobody wrote to the memory after it has been transitioned > > to the shared, so a reader of the memory should get ciphertext and > > unless there was a write after the set_shared (dest). Sorry, that wasn't clear, there's an implied page table update to make the page shared before zero_bytes. > > > If userspace does GET_DIRTY after GET_LIST, then the host would transfer > > > bad > > > data by consuming a stale list (scenario #2): > > > > > > vCPU Userspace > > > get_list (from KVM or internally) > > > set_shared (src) > > > zero_page (src) > > > get_dirty > > > > > > > > > kaboom! > > > > > > I don't remember how things are done in recent Ashish Qemu/KVM patches > > but in previous series, the get_dirty() happens before the querying the > > encrypted state. There was some logic in VMM to resync the encrypted > > bitmap during the final migration stage and perform any additional data > > transfer since last sync. It's likely that Ashish's patches did the correct thing, I just wanted to point out that both host and guest need to do everything in a very specific order. > > > If both guest and host order things to avoid #1 and #2, the host can still > > > migrate the wrong data (scenario #3): > > > > > > vCPU Userspace > > > set_private > > > zero_bytes[0:4096] > > > get_dirty > > > set_shared (src) > > > get_list > > > > > > > > > set_private (dest) > > > kaboom! > > > > > > Since there was no write to the memory after the set_shared (src), so > > the content of the page should not have changed. After the set_private > > (dest), the caller should be seeing the same content written by the > > zero_bytes[0:4096] > > I think Sean was going for the situation where the VM has moved to the > destination, which would have changed the VEK. That way the guest > would be decrypting the old ciphertext with the new (wrong) key. I think that's what I was saying? I was pointing out that the userspace VMM would see the page as "shared" and so read the guest memory directly instead of routing it through the PSP. Anyways, my intent wasn't to point out a known issue anywhere. I was working through how GET_LIST would interact with GET_DIRTY_LOG to convince myself that the various flows were bombproof. I wanted to record those thoughts so that I can remind myself of the requirements when I inevitably forget them in the future. > > > Scenario #3 is unlikely, but plausible, e.g. if the guest bails from its > > > conversion flow for whatever reason, after making the initial hypercall. > > > Maybe > > > it goes without saying, but to address #3, the guest must consider > > > existing data > > > as lost the instant it tells the host the page has been converted to a > > > different > > > type. > > > > > >> For the above reason if we do in-kernel hypercall handling for page > > >> encryption status (which we probably won't require for SEV-SNP & > > >> correspondingly there will be no hypercall exiting), > > > As above, that doesn't preclude KVM from exiting to userspace on > > > conversion. > > > > > >> then we can implement a standard GET/SET ioctl interface to get/set the > > >> guest > > >> page encryption status for userspace, which will work across SEV, SEV-ES > > >> and > > >> SEV-SNP.
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > [+Marc] > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > wrote: > > > > > Thanks for grabbing the data! > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > series where we implement something more generic, e.g. a hypercall > > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > interception. Or > > > rather, I think hypercall interception should be an orthogonal > > > implementation. > > > > > > The guest, including guest firmware, needs to be aware that the hypercall > > > is > > > supported, and the ABI needs to be well-defined. Relying on userspace > > > VMMs to > > > implement a common ABI is an unnecessary risk. > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > > ABI but > > > require further VMM intervention. But, I just don't see the point, it > > > would > > > save only a few lines of code. It would also limit what KVM could do in > > > the > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > userspace, > > > then mandatory interception would essentially make it impossible for KVM > > > to do > > > bookkeeping while still honoring the interception request. > > > > > > However, I do think it would make sense to have the userspace exit be a > > > generic > > > exit type. But hey, we already have the necessary ABI defined for that! > > > It's > > > just not used anywhere. > > > > > > /* KVM_EXIT_HYPERCALL */ > > > struct { > > > __u64 nr; > > > __u64 args[6]; > > > __u64 ret; > > > __u32 longmode; > > > __u32 pad; > > > } hypercall; > > > Doc mentions to use KVM_EXIT_IO for x86 to implement "hypercall to userspace" functionality. Any reasons for not using KVM_EXIT_HYPERCALL interface any more, even though it is still there and not removed ? > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > > confirm that). Then userspace could also be in control of the cpuid > > > > > bit. > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > > data. > > > The data limitation could be fudged by shoving data into non-standard > > > GPRs, but > > > that will result in truly heinous guest code, and extensibility issues. > > > > > > The data limitation is a moot point, because the x86-only thing is a deal > > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > > share > > > memory with a host. I can't think of a clever way to avoid having to > > > support > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > > multiple KVM variants. > > > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > > patch-set and probably relevant to arm64 nVHE hypervisor > > mode/implementation, and potentially makes sense as it adds guest > > memory protection as both host and guest kernels are running on the same > > privilege level ? > > > > Though i do see that the pKVM stuff adds two hypercalls, specifically : > > > > pkvm_create_mappings() ( I assume this is for setting shared memory > > regions between host and guest) & > > pkvm_create_private_mappings(). > > > > And the use-cases are quite similar to memory protection architectues > > use cases, for example, use with virtio devices, guest DMA I/O, etc. > > These hypercalls are both private to the host kernel communicating with > its hypervisor counterpart, so I don't think they're particularly > relevant here. As far as I can see, the more useful thing is to allow > the guest to communicate back to the host (and the VMM) that it has opened > up a memory window, perhaps for virtio rings or some other shared memory. > > We hacked this up as a prototype in the past: > >
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Mon, Mar 8, 2021 at 1:11 PM Brijesh Singh wrote: > > > On 3/8/21 1:51 PM, Sean Christopherson wrote: > > On Mon, Mar 08, 2021, Ashish Kalra wrote: > >> On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > >>> +Will and Quentin (arm64) > >>> > >>> Moving the non-KVM x86 folks to bcc, I don't they care about KVM details > >>> at this > >>> point. > >>> > >>> On Fri, Feb 26, 2021, Ashish Kalra wrote: > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > wrote: > > Thanks for grabbing the data! > > > > I am fine with both paths. Sean has stated an explicit desire for > > hypercall exiting, so I think that would be the current consensus. > >>> Yep, though it'd be good to get Paolo's input, too. > >>> > > If we want to do hypercall exiting, this should be in a follow-up > > series where we implement something more generic, e.g. a hypercall > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > exit route, we can drop the kvm side of the hypercall. > >>> I don't think this is a good candidate for arbitrary hypercall > >>> interception. Or > >>> rather, I think hypercall interception should be an orthogonal > >>> implementation. > >>> > >>> The guest, including guest firmware, needs to be aware that the hypercall > >>> is > >>> supported, and the ABI needs to be well-defined. Relying on userspace > >>> VMMs to > >>> implement a common ABI is an unnecessary risk. > >>> > >>> We could make KVM's default behavior be a nop, i.e. have KVM enforce the > >>> ABI but > >>> require further VMM intervention. But, I just don't see the point, it > >>> would > >>> save only a few lines of code. It would also limit what KVM could do in > >>> the > >>> future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > >>> userspace, > >>> then mandatory interception would essentially make it impossible for KVM > >>> to do > >>> bookkeeping while still honoring the interception request. > >>> > >>> However, I do think it would make sense to have the userspace exit be a > >>> generic > >>> exit type. But hey, we already have the necessary ABI defined for that! > >>> It's > >>> just not used anywhere. > >>> > >>> /* KVM_EXIT_HYPERCALL */ > >>> struct { > >>> __u64 nr; > >>> __u64 args[6]; > >>> __u64 ret; > >>> __u32 longmode; > >>> __u32 pad; > >>> } hypercall; > >>> > >>> > > Userspace could also handle the MSR using MSR filters (would need to > > confirm that). Then userspace could also be in control of the cpuid > > bit. > >>> An MSR is not a great fit; it's x86 specific and limited to 64 bits of > >>> data. > >>> The data limitation could be fudged by shoving data into non-standard > >>> GPRs, but > >>> that will result in truly heinous guest code, and extensibility issues. > >>> > >>> The data limitation is a moot point, because the x86-only thing is a deal > >>> breaker. arm64's pKVM work has a near-identical use case for a guest to > >>> share > >>> memory with a host. I can't think of a clever way to avoid having to > >>> support > >>> TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > >>> multiple KVM variants. > >>> > >> Potentially, there is another reason for in-kernel hypercall handling > >> considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > >> of each guest page, for instance pages in hypervisor state, i.e., pages > >> with C=0 and pages in guest valid state with C=1. > >> > >> Now, there shouldn't be a need for page encryption status hypercalls on > >> SEV-SNP as KVM can track & reference guest page status directly using > >> the RMP table. > > Relying on the RMP table itself would require locking the RMP table for an > > extended duration, and walking the entire RMP to find shared pages would be > > very inefficient. > > > >> As KVM maintains the RMP table, therefore we will need SET/GET type of > >> interfaces to provide the guest page encryption status to userspace. > > Hrm, somehow I temporarily forgot about SNP and TDX adding their own > > hypercalls > > for converting between shared and private. And in the case of TDX, the > > hypercall > > can't be trusted, i.e. is just a hint, otherwise the guest could induce a > > #MC in > > the host. > > > > But, the different guest behavior doesn't require KVM to maintain a > > list/tree, > > e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page > > encryption > > status changes would also suffice. > > > > Actually, that made me think of another argument against maintaining a list > > in > > KVM: there's no way to notify userspace that a page's status has changed. > > Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. > > Obviously not a huge issue, but it does make migration slightly less > > efficient. > > > > On
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Mon, Mar 8, 2021 at 11:52 AM Sean Christopherson wrote: > > On Mon, Mar 08, 2021, Ashish Kalra wrote: > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > +Will and Quentin (arm64) > > > > > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details > > > at this > > > point. > > > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > wrote: > > > > > Thanks for grabbing the data! > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > series where we implement something more generic, e.g. a hypercall > > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > interception. Or > > > rather, I think hypercall interception should be an orthogonal > > > implementation. > > > > > > The guest, including guest firmware, needs to be aware that the hypercall > > > is > > > supported, and the ABI needs to be well-defined. Relying on userspace > > > VMMs to > > > implement a common ABI is an unnecessary risk. > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > > ABI but > > > require further VMM intervention. But, I just don't see the point, it > > > would > > > save only a few lines of code. It would also limit what KVM could do in > > > the > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > userspace, > > > then mandatory interception would essentially make it impossible for KVM > > > to do > > > bookkeeping while still honoring the interception request. > > > > > > However, I do think it would make sense to have the userspace exit be a > > > generic > > > exit type. But hey, we already have the necessary ABI defined for that! > > > It's > > > just not used anywhere. > > > > > > /* KVM_EXIT_HYPERCALL */ > > > struct { > > > __u64 nr; > > > __u64 args[6]; > > > __u64 ret; > > > __u32 longmode; > > > __u32 pad; > > > } hypercall; > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > > confirm that). Then userspace could also be in control of the cpuid > > > > > bit. > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > > data. > > > The data limitation could be fudged by shoving data into non-standard > > > GPRs, but > > > that will result in truly heinous guest code, and extensibility issues. > > > > > > The data limitation is a moot point, because the x86-only thing is a deal > > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > > share > > > memory with a host. I can't think of a clever way to avoid having to > > > support > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > > multiple KVM variants. > > > > > > > Potentially, there is another reason for in-kernel hypercall handling > > considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > > of each guest page, for instance pages in hypervisor state, i.e., pages > > with C=0 and pages in guest valid state with C=1. > > > > Now, there shouldn't be a need for page encryption status hypercalls on > > SEV-SNP as KVM can track & reference guest page status directly using > > the RMP table. > > Relying on the RMP table itself would require locking the RMP table for an > extended duration, and walking the entire RMP to find shared pages would be > very inefficient. > > > As KVM maintains the RMP table, therefore we will need SET/GET type of > > interfaces to provide the guest page encryption status to userspace. > > Hrm, somehow I temporarily forgot about SNP and TDX adding their own > hypercalls > for converting between shared and private. And in the case of TDX, the > hypercall > can't be trusted, i.e. is just a hint, otherwise the guest could induce a #MC > in > the host. > > But, the different guest behavior doesn't require KVM to maintain a list/tree, > e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page encryption > status changes would also suffice. > > Actually, that made me think of another argument against maintaining a list in > KVM: there's no way to notify userspace that a page's status has changed. > Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. > Obviously not a huge issue, but it does make migration slightly less > efficient. > > On a related topic, there are fatal race conditions that will
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Mon, Mar 08, 2021 at 03:11:41PM -0600, Brijesh Singh wrote: > > On 3/8/21 1:51 PM, Sean Christopherson wrote: > > On Mon, Mar 08, 2021, Ashish Kalra wrote: > >> On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > >>> +Will and Quentin (arm64) > >>> > >>> Moving the non-KVM x86 folks to bcc, I don't they care about KVM details > >>> at this > >>> point. > >>> > >>> On Fri, Feb 26, 2021, Ashish Kalra wrote: > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > wrote: > > Thanks for grabbing the data! > > > > I am fine with both paths. Sean has stated an explicit desire for > > hypercall exiting, so I think that would be the current consensus. > >>> Yep, though it'd be good to get Paolo's input, too. > >>> > > If we want to do hypercall exiting, this should be in a follow-up > > series where we implement something more generic, e.g. a hypercall > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > exit route, we can drop the kvm side of the hypercall. > >>> I don't think this is a good candidate for arbitrary hypercall > >>> interception. Or > >>> rather, I think hypercall interception should be an orthogonal > >>> implementation. > >>> > >>> The guest, including guest firmware, needs to be aware that the hypercall > >>> is > >>> supported, and the ABI needs to be well-defined. Relying on userspace > >>> VMMs to > >>> implement a common ABI is an unnecessary risk. > >>> > >>> We could make KVM's default behavior be a nop, i.e. have KVM enforce the > >>> ABI but > >>> require further VMM intervention. But, I just don't see the point, it > >>> would > >>> save only a few lines of code. It would also limit what KVM could do in > >>> the > >>> future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > >>> userspace, > >>> then mandatory interception would essentially make it impossible for KVM > >>> to do > >>> bookkeeping while still honoring the interception request. > >>> > >>> However, I do think it would make sense to have the userspace exit be a > >>> generic > >>> exit type. But hey, we already have the necessary ABI defined for that! > >>> It's > >>> just not used anywhere. > >>> > >>> /* KVM_EXIT_HYPERCALL */ > >>> struct { > >>> __u64 nr; > >>> __u64 args[6]; > >>> __u64 ret; > >>> __u32 longmode; > >>> __u32 pad; > >>> } hypercall; > >>> > >>> > > Userspace could also handle the MSR using MSR filters (would need to > > confirm that). Then userspace could also be in control of the cpuid > > bit. > >>> An MSR is not a great fit; it's x86 specific and limited to 64 bits of > >>> data. > >>> The data limitation could be fudged by shoving data into non-standard > >>> GPRs, but > >>> that will result in truly heinous guest code, and extensibility issues. > >>> > >>> The data limitation is a moot point, because the x86-only thing is a deal > >>> breaker. arm64's pKVM work has a near-identical use case for a guest to > >>> share > >>> memory with a host. I can't think of a clever way to avoid having to > >>> support > >>> TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > >>> multiple KVM variants. > >>> > >> Potentially, there is another reason for in-kernel hypercall handling > >> considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > >> of each guest page, for instance pages in hypervisor state, i.e., pages > >> with C=0 and pages in guest valid state with C=1. > >> > >> Now, there shouldn't be a need for page encryption status hypercalls on > >> SEV-SNP as KVM can track & reference guest page status directly using > >> the RMP table. > > Relying on the RMP table itself would require locking the RMP table for an > > extended duration, and walking the entire RMP to find shared pages would be > > very inefficient. > > > >> As KVM maintains the RMP table, therefore we will need SET/GET type of > >> interfaces to provide the guest page encryption status to userspace. > > Hrm, somehow I temporarily forgot about SNP and TDX adding their own > > hypercalls > > for converting between shared and private. And in the case of TDX, the > > hypercall > > can't be trusted, i.e. is just a hint, otherwise the guest could induce a > > #MC in > > the host. > > > > But, the different guest behavior doesn't require KVM to maintain a > > list/tree, > > e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page > > encryption > > status changes would also suffice. > > > > Actually, that made me think of another argument against maintaining a list > > in > > KVM: there's no way to notify userspace that a page's status has changed. > > Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. > > Obviously not a huge issue, but it does make migration slightly less > > efficient. > > > > On a
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On 3/8/21 1:51 PM, Sean Christopherson wrote: > On Mon, Mar 08, 2021, Ashish Kalra wrote: >> On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: >>> +Will and Quentin (arm64) >>> >>> Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at >>> this >>> point. >>> >>> On Fri, Feb 26, 2021, Ashish Kalra wrote: On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > wrote: > Thanks for grabbing the data! > > I am fine with both paths. Sean has stated an explicit desire for > hypercall exiting, so I think that would be the current consensus. >>> Yep, though it'd be good to get Paolo's input, too. >>> > If we want to do hypercall exiting, this should be in a follow-up > series where we implement something more generic, e.g. a hypercall > exiting bitmap or hypercall exit list. If we are taking the hypercall > exit route, we can drop the kvm side of the hypercall. >>> I don't think this is a good candidate for arbitrary hypercall >>> interception. Or >>> rather, I think hypercall interception should be an orthogonal >>> implementation. >>> >>> The guest, including guest firmware, needs to be aware that the hypercall is >>> supported, and the ABI needs to be well-defined. Relying on userspace VMMs >>> to >>> implement a common ABI is an unnecessary risk. >>> >>> We could make KVM's default behavior be a nop, i.e. have KVM enforce the >>> ABI but >>> require further VMM intervention. But, I just don't see the point, it would >>> save only a few lines of code. It would also limit what KVM could do in the >>> future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to >>> userspace, >>> then mandatory interception would essentially make it impossible for KVM to >>> do >>> bookkeeping while still honoring the interception request. >>> >>> However, I do think it would make sense to have the userspace exit be a >>> generic >>> exit type. But hey, we already have the necessary ABI defined for that! >>> It's >>> just not used anywhere. >>> >>> /* KVM_EXIT_HYPERCALL */ >>> struct { >>> __u64 nr; >>> __u64 args[6]; >>> __u64 ret; >>> __u32 longmode; >>> __u32 pad; >>> } hypercall; >>> >>> > Userspace could also handle the MSR using MSR filters (would need to > confirm that). Then userspace could also be in control of the cpuid bit. >>> An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. >>> The data limitation could be fudged by shoving data into non-standard GPRs, >>> but >>> that will result in truly heinous guest code, and extensibility issues. >>> >>> The data limitation is a moot point, because the x86-only thing is a deal >>> breaker. arm64's pKVM work has a near-identical use case for a guest to >>> share >>> memory with a host. I can't think of a clever way to avoid having to >>> support >>> TDX's and SNP's hypervisor-agnostic variants, but we can at least not have >>> multiple KVM variants. >>> >> Potentially, there is another reason for in-kernel hypercall handling >> considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state >> of each guest page, for instance pages in hypervisor state, i.e., pages >> with C=0 and pages in guest valid state with C=1. >> >> Now, there shouldn't be a need for page encryption status hypercalls on >> SEV-SNP as KVM can track & reference guest page status directly using >> the RMP table. > Relying on the RMP table itself would require locking the RMP table for an > extended duration, and walking the entire RMP to find shared pages would be > very inefficient. > >> As KVM maintains the RMP table, therefore we will need SET/GET type of >> interfaces to provide the guest page encryption status to userspace. > Hrm, somehow I temporarily forgot about SNP and TDX adding their own > hypercalls > for converting between shared and private. And in the case of TDX, the > hypercall > can't be trusted, i.e. is just a hint, otherwise the guest could induce a #MC > in > the host. > > But, the different guest behavior doesn't require KVM to maintain a list/tree, > e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page encryption > status changes would also suffice. > > Actually, that made me think of another argument against maintaining a list in > KVM: there's no way to notify userspace that a page's status has changed. > Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. > Obviously not a huge issue, but it does make migration slightly less > efficient. > > On a related topic, there are fatal race conditions that will require careful > coordination between guest and host, and will effectively be wired into the > ABI. > SNP and TDX don't suffer these issues because host awareness of status is > atomic > with respect to the guest actually writing the page with the new
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Mon, Mar 08, 2021 at 11:51:57AM -0800, Sean Christopherson wrote: > On Mon, Mar 08, 2021, Ashish Kalra wrote: > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > +Will and Quentin (arm64) > > > > > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details > > > at this > > > point. > > > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > wrote: > > > > > Thanks for grabbing the data! > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > series where we implement something more generic, e.g. a hypercall > > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > interception. Or > > > rather, I think hypercall interception should be an orthogonal > > > implementation. > > > > > > The guest, including guest firmware, needs to be aware that the hypercall > > > is > > > supported, and the ABI needs to be well-defined. Relying on userspace > > > VMMs to > > > implement a common ABI is an unnecessary risk. > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > > ABI but > > > require further VMM intervention. But, I just don't see the point, it > > > would > > > save only a few lines of code. It would also limit what KVM could do in > > > the > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > userspace, > > > then mandatory interception would essentially make it impossible for KVM > > > to do > > > bookkeeping while still honoring the interception request. > > > > > > However, I do think it would make sense to have the userspace exit be a > > > generic > > > exit type. But hey, we already have the necessary ABI defined for that! > > > It's > > > just not used anywhere. > > > > > > /* KVM_EXIT_HYPERCALL */ > > > struct { > > > __u64 nr; > > > __u64 args[6]; > > > __u64 ret; > > > __u32 longmode; > > > __u32 pad; > > > } hypercall; > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > > confirm that). Then userspace could also be in control of the cpuid > > > > > bit. > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > > data. > > > The data limitation could be fudged by shoving data into non-standard > > > GPRs, but > > > that will result in truly heinous guest code, and extensibility issues. > > > > > > The data limitation is a moot point, because the x86-only thing is a deal > > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > > share > > > memory with a host. I can't think of a clever way to avoid having to > > > support > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > > multiple KVM variants. > > > > > > > Potentially, there is another reason for in-kernel hypercall handling > > considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > > of each guest page, for instance pages in hypervisor state, i.e., pages > > with C=0 and pages in guest valid state with C=1. > > > > Now, there shouldn't be a need for page encryption status hypercalls on > > SEV-SNP as KVM can track & reference guest page status directly using > > the RMP table. > > Relying on the RMP table itself would require locking the RMP table for an > extended duration, and walking the entire RMP to find shared pages would be > very inefficient. > > > As KVM maintains the RMP table, therefore we will need SET/GET type of > > interfaces to provide the guest page encryption status to userspace. > > Hrm, somehow I temporarily forgot about SNP and TDX adding their own > hypercalls > for converting between shared and private. And in the case of TDX, the > hypercall > can't be trusted, i.e. is just a hint, otherwise the guest could induce a #MC > in > the host. One question here, is this because if hypercall can cause direct modifications to the shared EPT, it can induce #MC in the host ? Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Mon, Mar 08, 2021, Ashish Kalra wrote: > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > +Will and Quentin (arm64) > > > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at > > this > > point. > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > wrote: > > > > Thanks for grabbing the data! > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > hypercall exiting, so I think that would be the current consensus. > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > series where we implement something more generic, e.g. a hypercall > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > exit route, we can drop the kvm side of the hypercall. > > > > I don't think this is a good candidate for arbitrary hypercall > > interception. Or > > rather, I think hypercall interception should be an orthogonal > > implementation. > > > > The guest, including guest firmware, needs to be aware that the hypercall is > > supported, and the ABI needs to be well-defined. Relying on userspace VMMs > > to > > implement a common ABI is an unnecessary risk. > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > ABI but > > require further VMM intervention. But, I just don't see the point, it would > > save only a few lines of code. It would also limit what KVM could do in the > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > userspace, > > then mandatory interception would essentially make it impossible for KVM to > > do > > bookkeeping while still honoring the interception request. > > > > However, I do think it would make sense to have the userspace exit be a > > generic > > exit type. But hey, we already have the necessary ABI defined for that! > > It's > > just not used anywhere. > > > > /* KVM_EXIT_HYPERCALL */ > > struct { > > __u64 nr; > > __u64 args[6]; > > __u64 ret; > > __u32 longmode; > > __u32 pad; > > } hypercall; > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > confirm that). Then userspace could also be in control of the cpuid > > > > bit. > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. > > The data limitation could be fudged by shoving data into non-standard GPRs, > > but > > that will result in truly heinous guest code, and extensibility issues. > > > > The data limitation is a moot point, because the x86-only thing is a deal > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > share > > memory with a host. I can't think of a clever way to avoid having to > > support > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > multiple KVM variants. > > > > Potentially, there is another reason for in-kernel hypercall handling > considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state > of each guest page, for instance pages in hypervisor state, i.e., pages > with C=0 and pages in guest valid state with C=1. > > Now, there shouldn't be a need for page encryption status hypercalls on > SEV-SNP as KVM can track & reference guest page status directly using > the RMP table. Relying on the RMP table itself would require locking the RMP table for an extended duration, and walking the entire RMP to find shared pages would be very inefficient. > As KVM maintains the RMP table, therefore we will need SET/GET type of > interfaces to provide the guest page encryption status to userspace. Hrm, somehow I temporarily forgot about SNP and TDX adding their own hypercalls for converting between shared and private. And in the case of TDX, the hypercall can't be trusted, i.e. is just a hint, otherwise the guest could induce a #MC in the host. But, the different guest behavior doesn't require KVM to maintain a list/tree, e.g. adding a dedicated KVM_EXIT_* for notifying userspace of page encryption status changes would also suffice. Actually, that made me think of another argument against maintaining a list in KVM: there's no way to notify userspace that a page's status has changed. Userspace would need to query KVM to do GET_LIST after every GET_DIRTY. Obviously not a huge issue, but it does make migration slightly less efficient. On a related topic, there are fatal race conditions that will require careful coordination between guest and host, and will effectively be wired into the ABI. SNP and TDX don't suffer these issues because host awareness of status is atomic with respect to the guest actually writing the page with the new encryption status. For SEV live migration... If the
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > +Will and Quentin (arm64) > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at > this > point. > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > wrote: > > > Thanks for grabbing the data! > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > hypercall exiting, so I think that would be the current consensus. > > Yep, though it'd be good to get Paolo's input, too. > > > > If we want to do hypercall exiting, this should be in a follow-up > > > series where we implement something more generic, e.g. a hypercall > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > exit route, we can drop the kvm side of the hypercall. > > I don't think this is a good candidate for arbitrary hypercall interception. > Or > rather, I think hypercall interception should be an orthogonal implementation. > > The guest, including guest firmware, needs to be aware that the hypercall is > supported, and the ABI needs to be well-defined. Relying on userspace VMMs to > implement a common ABI is an unnecessary risk. > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the ABI > but > require further VMM intervention. But, I just don't see the point, it would > save only a few lines of code. It would also limit what KVM could do in the > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to userspace, > then mandatory interception would essentially make it impossible for KVM to do > bookkeeping while still honoring the interception request. > > However, I do think it would make sense to have the userspace exit be a > generic > exit type. But hey, we already have the necessary ABI defined for that! It's > just not used anywhere. > > /* KVM_EXIT_HYPERCALL */ > struct { > __u64 nr; > __u64 args[6]; > __u64 ret; > __u32 longmode; > __u32 pad; > } hypercall; > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > confirm that). Then userspace could also be in control of the cpuid bit. > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. > The data limitation could be fudged by shoving data into non-standard GPRs, > but > that will result in truly heinous guest code, and extensibility issues. > > The data limitation is a moot point, because the x86-only thing is a deal > breaker. arm64's pKVM work has a near-identical use case for a guest to share > memory with a host. I can't think of a clever way to avoid having to support > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > multiple KVM variants. > Potentially, there is another reason for in-kernel hypercall handling considering SEV-SNP. In case of SEV-SNP the RMP table tracks the state of each guest page, for instance pages in hypervisor state, i.e., pages with C=0 and pages in guest valid state with C=1. Now, there shouldn't be a need for page encryption status hypercalls on SEV-SNP as KVM can track & reference guest page status directly using the RMP table. As KVM maintains the RMP table, therefore we will need SET/GET type of interfaces to provide the guest page encryption status to userspace. For the above reason if we do in-kernel hypercall handling for page encryption status (which we probably won't require for SEV-SNP & correspondingly there will be no hypercall exiting), then we can implement a standard GET/SET ioctl interface to get/set the guest page encryption status for userspace, which will work across SEV, SEV-ES and SEV-SNP. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Wed, Mar 03, 2021 at 06:54:41PM +, Will Deacon wrote: > [+Marc] > > On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > > wrote: > > > > > Thanks for grabbing the data! > > > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > > hypercall exiting, so I think that would be the current consensus. > > > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > > series where we implement something more generic, e.g. a hypercall > > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > > exit route, we can drop the kvm side of the hypercall. > > > > > > I don't think this is a good candidate for arbitrary hypercall > > > interception. Or > > > rather, I think hypercall interception should be an orthogonal > > > implementation. > > > > > > The guest, including guest firmware, needs to be aware that the hypercall > > > is > > > supported, and the ABI needs to be well-defined. Relying on userspace > > > VMMs to > > > implement a common ABI is an unnecessary risk. > > > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > > ABI but > > > require further VMM intervention. But, I just don't see the point, it > > > would > > > save only a few lines of code. It would also limit what KVM could do in > > > the > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > > userspace, > > > then mandatory interception would essentially make it impossible for KVM > > > to do > > > bookkeeping while still honoring the interception request. > > > > > > However, I do think it would make sense to have the userspace exit be a > > > generic > > > exit type. But hey, we already have the necessary ABI defined for that! > > > It's > > > just not used anywhere. > > > > > > /* KVM_EXIT_HYPERCALL */ > > > struct { > > > __u64 nr; > > > __u64 args[6]; > > > __u64 ret; > > > __u32 longmode; > > > __u32 pad; > > > } hypercall; > > > > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > > confirm that). Then userspace could also be in control of the cpuid > > > > > bit. > > > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of > > > data. > > > The data limitation could be fudged by shoving data into non-standard > > > GPRs, but > > > that will result in truly heinous guest code, and extensibility issues. > > > > > > The data limitation is a moot point, because the x86-only thing is a deal > > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > > share > > > memory with a host. I can't think of a clever way to avoid having to > > > support > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > > multiple KVM variants. > > > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > > patch-set and probably relevant to arm64 nVHE hypervisor > > mode/implementation, and potentially makes sense as it adds guest > > memory protection as both host and guest kernels are running on the same > > privilege level ? > > > > Though i do see that the pKVM stuff adds two hypercalls, specifically : > > > > pkvm_create_mappings() ( I assume this is for setting shared memory > > regions between host and guest) & > > pkvm_create_private_mappings(). > > > > And the use-cases are quite similar to memory protection architectues > > use cases, for example, use with virtio devices, guest DMA I/O, etc. > > These hypercalls are both private to the host kernel communicating with > its hypervisor counterpart, so I don't think they're particularly > relevant here. Yes, i have the same thoughts here as this looked like a private hypercall interface between host kernel and hypervisor, rather than between guest and host/hypervisor. > As far as I can see, the more useful thing is to allow > the guest to communicate back to the host (and the VMM) that it has opened > up a memory window, perhaps for virtio rings or some other shared memory. > Yes, this is our main use case. > We hacked this up as a prototype in the past: > >
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
[+Marc] On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > wrote: > > > > Thanks for grabbing the data! > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > hypercall exiting, so I think that would be the current consensus. > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > series where we implement something more generic, e.g. a hypercall > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > exit route, we can drop the kvm side of the hypercall. > > > > I don't think this is a good candidate for arbitrary hypercall > > interception. Or > > rather, I think hypercall interception should be an orthogonal > > implementation. > > > > The guest, including guest firmware, needs to be aware that the hypercall is > > supported, and the ABI needs to be well-defined. Relying on userspace VMMs > > to > > implement a common ABI is an unnecessary risk. > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > ABI but > > require further VMM intervention. But, I just don't see the point, it would > > save only a few lines of code. It would also limit what KVM could do in the > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > userspace, > > then mandatory interception would essentially make it impossible for KVM to > > do > > bookkeeping while still honoring the interception request. > > > > However, I do think it would make sense to have the userspace exit be a > > generic > > exit type. But hey, we already have the necessary ABI defined for that! > > It's > > just not used anywhere. > > > > /* KVM_EXIT_HYPERCALL */ > > struct { > > __u64 nr; > > __u64 args[6]; > > __u64 ret; > > __u32 longmode; > > __u32 pad; > > } hypercall; > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > confirm that). Then userspace could also be in control of the cpuid > > > > bit. > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. > > The data limitation could be fudged by shoving data into non-standard GPRs, > > but > > that will result in truly heinous guest code, and extensibility issues. > > > > The data limitation is a moot point, because the x86-only thing is a deal > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > share > > memory with a host. I can't think of a clever way to avoid having to > > support > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > multiple KVM variants. > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > patch-set and probably relevant to arm64 nVHE hypervisor > mode/implementation, and potentially makes sense as it adds guest > memory protection as both host and guest kernels are running on the same > privilege level ? > > Though i do see that the pKVM stuff adds two hypercalls, specifically : > > pkvm_create_mappings() ( I assume this is for setting shared memory > regions between host and guest) & > pkvm_create_private_mappings(). > > And the use-cases are quite similar to memory protection architectues > use cases, for example, use with virtio devices, guest DMA I/O, etc. These hypercalls are both private to the host kernel communicating with its hypervisor counterpart, so I don't think they're particularly relevant here. As far as I can see, the more useful thing is to allow the guest to communicate back to the host (and the VMM) that it has opened up a memory window, perhaps for virtio rings or some other shared memory. We hacked this up as a prototype in the past: https://android-kvm.googlesource.com/linux/+/d12a9e2c12a52cf7140d40cd9fa092dc8a85fac9%5E%21/ but that's all arm64-specific and if we're solving the same problem as you, then let's avoid arch-specific stuff if possible. The way in which the guest discovers the interface will be arch-specific (we already have a mechanism for that and some hypercalls are already allocated by specs from Arm), but the interface back to the VMM and some (most?) of the host handling could be shared. > But, isn't this patch set still RFC, and though i agree that it adds > an infrastructure for standardised communication between the host and > it's guests for mutually controlled shared memory regions and > surely adds some kind of portability between hypervisor > implementations, but nothing is standardised still, right ? No, and it seems that you're further ahead than us in terms of implementation in this area. We're happy to
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > +Will and Quentin (arm64) > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at > this > point. > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > wrote: > > > Thanks for grabbing the data! > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > hypercall exiting, so I think that would be the current consensus. > > Yep, though it'd be good to get Paolo's input, too. > > > > If we want to do hypercall exiting, this should be in a follow-up > > > series where we implement something more generic, e.g. a hypercall > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > exit route, we can drop the kvm side of the hypercall. > > I don't think this is a good candidate for arbitrary hypercall interception. > Or > rather, I think hypercall interception should be an orthogonal implementation. > > The guest, including guest firmware, needs to be aware that the hypercall is > supported, and the ABI needs to be well-defined. Relying on userspace VMMs to > implement a common ABI is an unnecessary risk. > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the ABI > but > require further VMM intervention. But, I just don't see the point, it would > save only a few lines of code. It would also limit what KVM could do in the > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to userspace, > then mandatory interception would essentially make it impossible for KVM to do > bookkeeping while still honoring the interception request. > > However, I do think it would make sense to have the userspace exit be a > generic > exit type. But hey, we already have the necessary ABI defined for that! It's > just not used anywhere. > > /* KVM_EXIT_HYPERCALL */ > struct { > __u64 nr; > __u64 args[6]; > __u64 ret; > __u32 longmode; > __u32 pad; > } hypercall; > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > confirm that). Then userspace could also be in control of the cpuid bit. > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. > The data limitation could be fudged by shoving data into non-standard GPRs, > but > that will result in truly heinous guest code, and extensibility issues. > > The data limitation is a moot point, because the x86-only thing is a deal > breaker. arm64's pKVM work has a near-identical use case for a guest to share > memory with a host. I can't think of a clever way to avoid having to support > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > multiple KVM variants. Looking at arm64's pKVM work, i see that it is a recently introduced RFC patch-set and probably relevant to arm64 nVHE hypervisor mode/implementation, and potentially makes sense as it adds guest memory protection as both host and guest kernels are running on the same privilege level ? Though i do see that the pKVM stuff adds two hypercalls, specifically : pkvm_create_mappings() ( I assume this is for setting shared memory regions between host and guest) & pkvm_create_private_mappings(). And the use-cases are quite similar to memory protection architectues use cases, for example, use with virtio devices, guest DMA I/O, etc. But, isn't this patch set still RFC, and though i agree that it adds an infrastructure for standardised communication between the host and it's guests for mutually controlled shared memory regions and surely adds some kind of portability between hypervisor implementations, but nothing is standardised still, right ? Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Tue, Mar 02, 2021 at 02:55:43PM +, Ashish Kalra wrote: > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote: > > +Will and Quentin (arm64) > > > > Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at > > this > > point. > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote: > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra > > > > wrote: > > > > Thanks for grabbing the data! > > > > > > > > I am fine with both paths. Sean has stated an explicit desire for > > > > hypercall exiting, so I think that would be the current consensus. > > > > Yep, though it'd be good to get Paolo's input, too. > > > > > > If we want to do hypercall exiting, this should be in a follow-up > > > > series where we implement something more generic, e.g. a hypercall > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > > > exit route, we can drop the kvm side of the hypercall. > > > > I don't think this is a good candidate for arbitrary hypercall > > interception. Or > > rather, I think hypercall interception should be an orthogonal > > implementation. > > > > The guest, including guest firmware, needs to be aware that the hypercall is > > supported, and the ABI needs to be well-defined. Relying on userspace VMMs > > to > > implement a common ABI is an unnecessary risk. > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the > > ABI but > > require further VMM intervention. But, I just don't see the point, it would > > save only a few lines of code. It would also limit what KVM could do in the > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to > > userspace, > > then mandatory interception would essentially make it impossible for KVM to > > do > > bookkeeping while still honoring the interception request. > > > > However, I do think it would make sense to have the userspace exit be a > > generic > > exit type. But hey, we already have the necessary ABI defined for that! > > It's > > just not used anywhere. > > > > /* KVM_EXIT_HYPERCALL */ > > struct { > > __u64 nr; > > __u64 args[6]; > > __u64 ret; > > __u32 longmode; > > __u32 pad; > > } hypercall; > > > > > > > > Userspace could also handle the MSR using MSR filters (would need to > > > > confirm that). Then userspace could also be in control of the cpuid > > > > bit. > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. > > The data limitation could be fudged by shoving data into non-standard GPRs, > > but > > that will result in truly heinous guest code, and extensibility issues. > > > > The data limitation is a moot point, because the x86-only thing is a deal > > breaker. arm64's pKVM work has a near-identical use case for a guest to > > share > > memory with a host. I can't think of a clever way to avoid having to > > support > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have > > multiple KVM variants. > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC > patch-set and probably relevant to arm64 nVHE hypervisor > mode/implementation, and potentially makes sense as it adds guest > memory protection as both host and guest kernels are running on the same > privilege level ? > > Though i do see that the pKVM stuff adds two hypercalls, specifically : > > pkvm_create_mappings() ( I assume this is for setting shared memory > regions between host and guest) & > pkvm_create_private_mappings(). > > And the use-cases are quite similar to memory protection architectues > use cases, for example, use with virtio devices, guest DMA I/O, etc. > > But, isn't this patch set still RFC, and though i agree that it adds > an infrastructure for standardised communication between the host and > it's guests for mutually controlled shared memory regions and > surely adds some kind of portability between hypervisor > implementations, but nothing is standardised still, right ? > And to add here, the hypercall implementation is in-HYP mode, there is no infrastructure as part of this patch-set to do hypercall exiting and handling it in user-space. Though arguably, we may able to add a hypercall exiting code path on the amd64 implementation for the same hypercall interfaces ? Alternatively, we implement this in-kernel and then add SET/GET ioctl interfaces to export the shared pages/regions list to user-space. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
+Will and Quentin (arm64) Moving the non-KVM x86 folks to bcc, I don't they care about KVM details at this point. On Fri, Feb 26, 2021, Ashish Kalra wrote: > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra wrote: > > Thanks for grabbing the data! > > > > I am fine with both paths. Sean has stated an explicit desire for > > hypercall exiting, so I think that would be the current consensus. Yep, though it'd be good to get Paolo's input, too. > > If we want to do hypercall exiting, this should be in a follow-up > > series where we implement something more generic, e.g. a hypercall > > exiting bitmap or hypercall exit list. If we are taking the hypercall > > exit route, we can drop the kvm side of the hypercall. I don't think this is a good candidate for arbitrary hypercall interception. Or rather, I think hypercall interception should be an orthogonal implementation. The guest, including guest firmware, needs to be aware that the hypercall is supported, and the ABI needs to be well-defined. Relying on userspace VMMs to implement a common ABI is an unnecessary risk. We could make KVM's default behavior be a nop, i.e. have KVM enforce the ABI but require further VMM intervention. But, I just don't see the point, it would save only a few lines of code. It would also limit what KVM could do in the future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to userspace, then mandatory interception would essentially make it impossible for KVM to do bookkeeping while still honoring the interception request. However, I do think it would make sense to have the userspace exit be a generic exit type. But hey, we already have the necessary ABI defined for that! It's just not used anywhere. /* KVM_EXIT_HYPERCALL */ struct { __u64 nr; __u64 args[6]; __u64 ret; __u32 longmode; __u32 pad; } hypercall; > > Userspace could also handle the MSR using MSR filters (would need to > > confirm that). Then userspace could also be in control of the cpuid bit. An MSR is not a great fit; it's x86 specific and limited to 64 bits of data. The data limitation could be fudged by shoving data into non-standard GPRs, but that will result in truly heinous guest code, and extensibility issues. The data limitation is a moot point, because the x86-only thing is a deal breaker. arm64's pKVM work has a near-identical use case for a guest to share memory with a host. I can't think of a clever way to avoid having to support TDX's and SNP's hypervisor-agnostic variants, but we can at least not have multiple KVM variants. > > Essentially, I think you could drop most of the host kernel work if > > there were generic support for hypercall exiting. Then userspace would > > be responsible for all of that. Thoughts on this? > So if i understand it correctly, i will submitting v11 of this patch-set > with in-kernel support for page encryption status hypercalls and shared > pages list and the userspace control of SEV live migration feature > support and fixes for MSR handling. At this point, I'd say hold off on putting more effort into an implementation until we have consensus. > In subsequent follow-up patches we will add generic support for hypercall > exiting and then drop kvm side of hypercall and also add userspace > support for MSR handling. > > Thanks, > Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
Hello Steve, On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote: > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra wrote: > > > > On Wed, Feb 24, 2021 at 10:22:33AM -0800, Sean Christopherson wrote: > > > On Wed, Feb 24, 2021, Ashish Kalra wrote: > > > > # Samples: 19K of event 'kvm:kvm_hypercall' > > > > # Event count (approx.): 19573 > > > > # > > > > # Overhead Command Shared Object Symbol > > > > # ... . > > > > # > > > >100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_emulate_hypercall > > > > > > > > Out of these 19573 hypercalls, # of page encryption status hcalls are > > > > 19479, > > > > so almost all hypercalls here are page encryption status hypercalls. > > > > > > Oof. > > > > > > > The above data indicates that there will be ~2% more Heavyweight VMEXITs > > > > during SEV guest boot if we do page encryption status hypercalls > > > > pass-through to host userspace. > > > > > > > > But, then Brijesh pointed out to me and highlighted that currently > > > > OVMF is doing lot of VMEXITs because they don't use the DMA pool to > > > > minimize the C-bit toggles, > > > > in other words, OVMF bounce buffer does page state change on every DMA > > > > allocate and free. > > > > > > > > So here is the performance analysis after kernel and initrd have been > > > > loaded into memory using grub and then starting perf just before > > > > booting the kernel. > > > > > > > > These are the performance #'s after kernel and initrd have been loaded > > > > into memory, > > > > then perf is attached and kernel is booted : > > > > > > > > # Samples: 1M of event 'kvm:kvm_userspace_exit' > > > > # Event count (approx.): 1081235 > > > > # > > > > # Overhead Trace output > > > > # > > > > # > > > > 99.77% reason KVM_EXIT_IO (2) > > > > 0.23% reason KVM_EXIT_MMIO (6) > > > > > > > > # Samples: 1K of event 'kvm:kvm_hypercall' > > > > # Event count (approx.): 1279 > > > > # > > > > > > > > So as the above data indicates, Linux is only making ~1K hypercalls, > > > > compared to ~18K hypercalls made by OVMF in the above use case. > > > > > > > > Does the above adds a prerequisite that OVMF needs to be optimized if > > > > and before hypercall pass-through can be done ? > > > > > > Disclaimer: my math could be totally wrong. > > > > > > I doubt it's a hard requirement. Assuming a conversative roundtrip time > > > of 50k > > > cycles, those 18K hypercalls will add well under a 1/2 a second of boot > > > time. > > > If userspace can push the roundtrip time down to 10k cycles, the overhead > > > is > > > more like 50 milliseconds. > > > > > > That being said, this does seem like a good OVMF cleanup, irrespective of > > > this > > > new hypercall. I assume it's not cheap to convert a page between > > > encrypted and > > > decrypted. > > > > > > Thanks much for getting the numbers! > > > > Considering the above data and guest boot time latencies > > (and potential issues with OVMF and optimizations required there), > > do we have any consensus on whether we want to do page encryption > > status hypercall passthrough or not ? > > > > Thanks, > > Ashish > > Thanks for grabbing the data! > > I am fine with both paths. Sean has stated an explicit desire for > hypercall exiting, so I think that would be the current consensus. > > If we want to do hypercall exiting, this should be in a follow-up > series where we implement something more generic, e.g. a hypercall > exiting bitmap or hypercall exit list. If we are taking the hypercall > exit route, we can drop the kvm side of the hypercall. Userspace could > also handle the MSR using MSR filters (would need to confirm that). > Then userspace could also be in control of the cpuid bit. > > Essentially, I think you could drop most of the host kernel work if > there were generic support for hypercall exiting. Then userspace would > be responsible for all of that. Thoughts on this? > So if i understand it correctly, i will submitting v11 of this patch-set with in-kernel support for page encryption status hypercalls and shared pages list and the userspace control of SEV live migration feature support and fixes for MSR handling. In subsequent follow-up patches we will add generic support for hypercall exiting and then drop kvm side of hypercall and also add userspace support for MSR handling. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 25, 2021 at 2:59 PM Steve Rutherford wrote: > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra wrote: > > > > On Wed, Feb 24, 2021 at 10:22:33AM -0800, Sean Christopherson wrote: > > > On Wed, Feb 24, 2021, Ashish Kalra wrote: > > > > # Samples: 19K of event 'kvm:kvm_hypercall' > > > > # Event count (approx.): 19573 > > > > # > > > > # Overhead Command Shared Object Symbol > > > > # ... . > > > > # > > > >100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_emulate_hypercall > > > > > > > > Out of these 19573 hypercalls, # of page encryption status hcalls are > > > > 19479, > > > > so almost all hypercalls here are page encryption status hypercalls. > > > > > > Oof. > > > > > > > The above data indicates that there will be ~2% more Heavyweight VMEXITs > > > > during SEV guest boot if we do page encryption status hypercalls > > > > pass-through to host userspace. > > > > > > > > But, then Brijesh pointed out to me and highlighted that currently > > > > OVMF is doing lot of VMEXITs because they don't use the DMA pool to > > > > minimize the C-bit toggles, > > > > in other words, OVMF bounce buffer does page state change on every DMA > > > > allocate and free. > > > > > > > > So here is the performance analysis after kernel and initrd have been > > > > loaded into memory using grub and then starting perf just before > > > > booting the kernel. > > > > > > > > These are the performance #'s after kernel and initrd have been loaded > > > > into memory, > > > > then perf is attached and kernel is booted : > > > > > > > > # Samples: 1M of event 'kvm:kvm_userspace_exit' > > > > # Event count (approx.): 1081235 > > > > # > > > > # Overhead Trace output > > > > # > > > > # > > > > 99.77% reason KVM_EXIT_IO (2) > > > > 0.23% reason KVM_EXIT_MMIO (6) > > > > > > > > # Samples: 1K of event 'kvm:kvm_hypercall' > > > > # Event count (approx.): 1279 > > > > # > > > > > > > > So as the above data indicates, Linux is only making ~1K hypercalls, > > > > compared to ~18K hypercalls made by OVMF in the above use case. > > > > > > > > Does the above adds a prerequisite that OVMF needs to be optimized if > > > > and before hypercall pass-through can be done ? > > > > > > Disclaimer: my math could be totally wrong. > > > > > > I doubt it's a hard requirement. Assuming a conversative roundtrip time > > > of 50k > > > cycles, those 18K hypercalls will add well under a 1/2 a second of boot > > > time. > > > If userspace can push the roundtrip time down to 10k cycles, the overhead > > > is > > > more like 50 milliseconds. > > > > > > That being said, this does seem like a good OVMF cleanup, irrespective of > > > this > > > new hypercall. I assume it's not cheap to convert a page between > > > encrypted and > > > decrypted. > > > > > > Thanks much for getting the numbers! > > > > Considering the above data and guest boot time latencies > > (and potential issues with OVMF and optimizations required there), > > do we have any consensus on whether we want to do page encryption > > status hypercall passthrough or not ? > > > > Thanks, > > Ashish > > Thanks for grabbing the data! > > I am fine with both paths. Sean has stated an explicit desire for > hypercall exiting, so I think that would be the current consensus. > > If we want to do hypercall exiting, this should be in a follow-up > series where we implement something more generic, e.g. a hypercall > exiting bitmap or hypercall exit list. If we are taking the hypercall > exit route, we can drop the kvm side of the hypercall. Userspace could > also handle the MSR using MSR filters (would need to confirm that). > Then userspace could also be in control of the cpuid bit. > > Essentially, I think you could drop most of the host kernel work if > there were generic support for hypercall exiting. Then userspace would > be responsible for all of that. Thoughts on this? > > --Steve This could even go a step further, and use an MSR write from within the guest instead of a hypercall, which could be patched through to userspace without host modification, if I understand the MSR filtering correctly. --Steve
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra wrote: > > On Wed, Feb 24, 2021 at 10:22:33AM -0800, Sean Christopherson wrote: > > On Wed, Feb 24, 2021, Ashish Kalra wrote: > > > # Samples: 19K of event 'kvm:kvm_hypercall' > > > # Event count (approx.): 19573 > > > # > > > # Overhead Command Shared Object Symbol > > > # ... . > > > # > > >100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_emulate_hypercall > > > > > > Out of these 19573 hypercalls, # of page encryption status hcalls are > > > 19479, > > > so almost all hypercalls here are page encryption status hypercalls. > > > > Oof. > > > > > The above data indicates that there will be ~2% more Heavyweight VMEXITs > > > during SEV guest boot if we do page encryption status hypercalls > > > pass-through to host userspace. > > > > > > But, then Brijesh pointed out to me and highlighted that currently > > > OVMF is doing lot of VMEXITs because they don't use the DMA pool to > > > minimize the C-bit toggles, > > > in other words, OVMF bounce buffer does page state change on every DMA > > > allocate and free. > > > > > > So here is the performance analysis after kernel and initrd have been > > > loaded into memory using grub and then starting perf just before booting > > > the kernel. > > > > > > These are the performance #'s after kernel and initrd have been loaded > > > into memory, > > > then perf is attached and kernel is booted : > > > > > > # Samples: 1M of event 'kvm:kvm_userspace_exit' > > > # Event count (approx.): 1081235 > > > # > > > # Overhead Trace output > > > # > > > # > > > 99.77% reason KVM_EXIT_IO (2) > > > 0.23% reason KVM_EXIT_MMIO (6) > > > > > > # Samples: 1K of event 'kvm:kvm_hypercall' > > > # Event count (approx.): 1279 > > > # > > > > > > So as the above data indicates, Linux is only making ~1K hypercalls, > > > compared to ~18K hypercalls made by OVMF in the above use case. > > > > > > Does the above adds a prerequisite that OVMF needs to be optimized if > > > and before hypercall pass-through can be done ? > > > > Disclaimer: my math could be totally wrong. > > > > I doubt it's a hard requirement. Assuming a conversative roundtrip time of > > 50k > > cycles, those 18K hypercalls will add well under a 1/2 a second of boot > > time. > > If userspace can push the roundtrip time down to 10k cycles, the overhead is > > more like 50 milliseconds. > > > > That being said, this does seem like a good OVMF cleanup, irrespective of > > this > > new hypercall. I assume it's not cheap to convert a page between encrypted > > and > > decrypted. > > > > Thanks much for getting the numbers! > > Considering the above data and guest boot time latencies > (and potential issues with OVMF and optimizations required there), > do we have any consensus on whether we want to do page encryption > status hypercall passthrough or not ? > > Thanks, > Ashish Thanks for grabbing the data! I am fine with both paths. Sean has stated an explicit desire for hypercall exiting, so I think that would be the current consensus. If we want to do hypercall exiting, this should be in a follow-up series where we implement something more generic, e.g. a hypercall exiting bitmap or hypercall exit list. If we are taking the hypercall exit route, we can drop the kvm side of the hypercall. Userspace could also handle the MSR using MSR filters (would need to confirm that). Then userspace could also be in control of the cpuid bit. Essentially, I think you could drop most of the host kernel work if there were generic support for hypercall exiting. Then userspace would be responsible for all of that. Thoughts on this? --Steve
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Wed, Feb 24, 2021 at 10:22:33AM -0800, Sean Christopherson wrote: > On Wed, Feb 24, 2021, Ashish Kalra wrote: > > # Samples: 19K of event 'kvm:kvm_hypercall' > > # Event count (approx.): 19573 > > # > > # Overhead Command Shared Object Symbol > > # ... . > > # > >100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_emulate_hypercall > > > > Out of these 19573 hypercalls, # of page encryption status hcalls are 19479, > > so almost all hypercalls here are page encryption status hypercalls. > > Oof. > > > The above data indicates that there will be ~2% more Heavyweight VMEXITs > > during SEV guest boot if we do page encryption status hypercalls > > pass-through to host userspace. > > > > But, then Brijesh pointed out to me and highlighted that currently > > OVMF is doing lot of VMEXITs because they don't use the DMA pool to > > minimize the C-bit toggles, > > in other words, OVMF bounce buffer does page state change on every DMA > > allocate and free. > > > > So here is the performance analysis after kernel and initrd have been > > loaded into memory using grub and then starting perf just before booting > > the kernel. > > > > These are the performance #'s after kernel and initrd have been loaded into > > memory, > > then perf is attached and kernel is booted : > > > > # Samples: 1M of event 'kvm:kvm_userspace_exit' > > # Event count (approx.): 1081235 > > # > > # Overhead Trace output > > # > > # > > 99.77% reason KVM_EXIT_IO (2) > > 0.23% reason KVM_EXIT_MMIO (6) > > > > # Samples: 1K of event 'kvm:kvm_hypercall' > > # Event count (approx.): 1279 > > # > > > > So as the above data indicates, Linux is only making ~1K hypercalls, > > compared to ~18K hypercalls made by OVMF in the above use case. > > > > Does the above adds a prerequisite that OVMF needs to be optimized if > > and before hypercall pass-through can be done ? > > Disclaimer: my math could be totally wrong. > > I doubt it's a hard requirement. Assuming a conversative roundtrip time of > 50k > cycles, those 18K hypercalls will add well under a 1/2 a second of boot time. > If userspace can push the roundtrip time down to 10k cycles, the overhead is > more like 50 milliseconds. > > That being said, this does seem like a good OVMF cleanup, irrespective of this > new hypercall. I assume it's not cheap to convert a page between encrypted > and > decrypted. > > Thanks much for getting the numbers! Considering the above data and guest boot time latencies (and potential issues with OVMF and optimizations required there), do we have any consensus on whether we want to do page encryption status hypercall passthrough or not ? Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Wed, Feb 24, 2021, Ashish Kalra wrote: > # Samples: 19K of event 'kvm:kvm_hypercall' > # Event count (approx.): 19573 > # > # Overhead Command Shared Object Symbol > # ... . > # >100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_emulate_hypercall > > Out of these 19573 hypercalls, # of page encryption status hcalls are 19479, > so almost all hypercalls here are page encryption status hypercalls. Oof. > The above data indicates that there will be ~2% more Heavyweight VMEXITs > during SEV guest boot if we do page encryption status hypercalls > pass-through to host userspace. > > But, then Brijesh pointed out to me and highlighted that currently > OVMF is doing lot of VMEXITs because they don't use the DMA pool to minimize > the C-bit toggles, > in other words, OVMF bounce buffer does page state change on every DMA > allocate and free. > > So here is the performance analysis after kernel and initrd have been > loaded into memory using grub and then starting perf just before booting the > kernel. > > These are the performance #'s after kernel and initrd have been loaded into > memory, > then perf is attached and kernel is booted : > > # Samples: 1M of event 'kvm:kvm_userspace_exit' > # Event count (approx.): 1081235 > # > # Overhead Trace output > # > # > 99.77% reason KVM_EXIT_IO (2) > 0.23% reason KVM_EXIT_MMIO (6) > > # Samples: 1K of event 'kvm:kvm_hypercall' > # Event count (approx.): 1279 > # > > So as the above data indicates, Linux is only making ~1K hypercalls, > compared to ~18K hypercalls made by OVMF in the above use case. > > Does the above adds a prerequisite that OVMF needs to be optimized if > and before hypercall pass-through can be done ? Disclaimer: my math could be totally wrong. I doubt it's a hard requirement. Assuming a conversative roundtrip time of 50k cycles, those 18K hypercalls will add well under a 1/2 a second of boot time. If userspace can push the roundtrip time down to 10k cycles, the overhead is more like 50 milliseconds. That being said, this does seem like a good OVMF cleanup, irrespective of this new hypercall. I assume it's not cheap to convert a page between encrypted and decrypted. Thanks much for getting the numbers!
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 18, 2021 at 12:32:47PM -0600, Kalra, Ashish wrote: > [AMD Public Use] > > > -Original Message- > From: Sean Christopherson > Sent: Tuesday, February 16, 2021 7:03 PM > To: Kalra, Ashish > Cc: pbonz...@redhat.com; t...@linutronix.de; mi...@redhat.com; > h...@zytor.com; rkrc...@redhat.com; j...@8bytes.org; b...@suse.de; Lendacky, > Thomas ; x...@kernel.org; k...@vger.kernel.org; > linux-kernel@vger.kernel.org; srutherf...@google.com; > venu.busire...@oracle.com; Singh, Brijesh > Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST > ioctl > > On Thu, Feb 04, 2021, Ashish Kalra wrote: > > From: Brijesh Singh > > > > The ioctl is used to retrieve a guest's shared pages list. > > >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is passed > >through to userspace? That way, userspace could manage the set of pages >in > >whatever data structure they want, and these get/set ioctls go away. > > I will be more concerned about performance hit during guest DMA I/O if the > page encryption status hypercalls are passed through to user-space, > a lot of guest DMA I/O dynamically sets up pages for encryption and then > flips them at DMA completion, so guest I/O will surely take a performance > hit with this pass-through stuff. > Here are some rough performance numbers comparing # of heavy-weight VMEXITs compared to # of hypercalls, during a SEV guest boot: (launch of a ubuntu 18.04 guest) # ./perf record -e kvm:kvm_userspace_exit -e kvm:kvm_hypercall -a ./qemu-system-x86_64 -enable-kvm -cpu host -machine q35 -smp 16,maxcpus=64 -m 512M -drive if=pflash,format=raw,unit=0,file=/home/ashish/sev-migration/qemu-5.1.50/OVMF_CODE.fd,readonly -drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd -drive file=../ubuntu-18.04.qcow2,if=none,id=disk0,format=qcow2 -device virtio-scsi-pci,id=scsi,disable-legacy=on,iommu_platform=true -device scsi-hd,drive=disk0 -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1,policy=0x0 -machine memory-encryption=sev0 -trace events=/tmp/events -nographic -monitor pty -monitor unix:monitor-source,server,nowait -qmp unix:/tmp/qmp-sock,server,nowait -device virtio-rng-pci,disable-legacy=on,iommu_platform=true ... ... root@diesel2540:/home/ashish/sev-migration/qemu-5.1.50# ./perf report # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 981K of event 'kvm:kvm_userspace_exit' # Event count (approx.): 981021 # # Overhead Command Shared Object Symbol # ... .. # 100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_vcpu_ioctl # Samples: 19K of event 'kvm:kvm_hypercall' # Event count (approx.): 19573 # # Overhead Command Shared Object Symbol # ... . # 100.00% qemu-system-x86 [kernel.vmlinux] [k] kvm_emulate_hypercall Out of these 19573 hypercalls, # of page encryption status hcalls are 19479, so almost all hypercalls here are page encryption status hypercalls. The above data indicates that there will be ~2% more Heavyweight VMEXITs during SEV guest boot if we do page encryption status hypercalls pass-through to host userspace. But, then Brijesh pointed out to me and highlighted that currently OVMF is doing lot of VMEXITs because they don't use the DMA pool to minimize the C-bit toggles, in other words, OVMF bounce buffer does page state change on every DMA allocate and free. So here is the performance analysis after kernel and initrd have been loaded into memory using grub and then starting perf just before booting the kernel. These are the performance #'s after kernel and initrd have been loaded into memory, then perf is attached and kernel is booted : # Samples: 1M of event 'kvm:kvm_userspace_exit' # Event count (approx.): 1081235 # # Overhead Trace output # # 99.77% reason KVM_EXIT_IO (2) 0.23% reason KVM_EXIT_MMIO (6) # Samples: 1K of event 'kvm:kvm_hypercall' # Event count (approx.): 1279 # So as the above data indicates, Linux is only making ~1K hypercalls, compared to ~18K hypercalls made by OVMF in the above use case. Does the above adds a prerequisite that OVMF needs to be optimized if and before hypercall pass-through can be done ? Thanks, Ashish
RE: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
[AMD Public Use] -Original Message- From: Sean Christopherson Sent: Tuesday, February 16, 2021 7:03 PM To: Kalra, Ashish Cc: pbonz...@redhat.com; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; rkrc...@redhat.com; j...@8bytes.org; b...@suse.de; Lendacky, Thomas ; x...@kernel.org; k...@vger.kernel.org; linux-kernel@vger.kernel.org; srutherf...@google.com; venu.busire...@oracle.com; Singh, Brijesh Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl On Thu, Feb 04, 2021, Ashish Kalra wrote: > From: Brijesh Singh > > The ioctl is used to retrieve a guest's shared pages list. >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is passed >through to userspace? That way, userspace could manage the set of pages >in >whatever data structure they want, and these get/set ioctls go away. I will be more concerned about performance hit during guest DMA I/O if the page encryption status hypercalls are passed through to user-space, a lot of guest DMA I/O dynamically sets up pages for encryption and then flips them at DMA completion, so guest I/O will surely take a performance hit with this pass-through stuff. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 18, 2021, Kalra, Ashish wrote: > From: Sean Christopherson > > On Thu, Feb 18, 2021, Kalra, Ashish wrote: > > From: Sean Christopherson > > > > On Wed, Feb 17, 2021, Kalra, Ashish wrote: > > >> From: Sean Christopherson On Thu, Feb 04, 2021, > > >> Ashish Kalra wrote: > > >> > From: Brijesh Singh > > >> > > > >> > The ioctl is used to retrieve a guest's shared pages list. > > >> > > >> >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS > > >> >is passed through to userspace? That way, userspace could manage > > >> >the set of pages >in whatever data structure they want, and these > > >> >get/set ioctls go away. > > >> > > >> What is the advantage of passing KVM_HC_PAGE_ENC_STATUS through to > > >> user-space ? > > >> > > >> As such it is just a simple interface to get the shared page list > > >> via the get/set ioctl's. simply an array is passed to these ioctl > > >> to get/set the shared pages list. > >> > >> > It eliminates any probability of the kernel choosing the wrong data > >> > structure, and it's two fewer ioctls to maintain and test. > >> > >> The set shared pages list ioctl cannot be avoided as it needs to be > >> issued to setup the shared pages list on the migrated VM, it cannot be > >> achieved by passing KVM_HC_PAGE_ENC_STATUS through to user-space. > > >Why's that? AIUI, KVM doesn't do anything with the list other than pass it > >back to userspace. Assuming that's the case, userspace can just hold onto > >the list >for the next migration. > > KVM does use it as part of the SEV DBG_DECTYPT API, within sev_dbg_decrypt() > to check if the guest page(s) are encrypted or not, and accordingly use it to > decide whether to decrypt the guest page(s) and return that back to > user-space or just return it as it is. Why is handling shared memory KVM's responsibility? Userspace shouldn't be asking KVM to decrypt memory it knows isn't encrypted. My understanding is that bogus decryption won't harm the kernel, it will only corrupt the guest. In other words, patch 16 can be dropped if managing the set of shared pages is punted to userspace.
RE: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
[AMD Public Use] -Original Message- From: Sean Christopherson Sent: Thursday, February 18, 2021 10:39 AM To: Kalra, Ashish Cc: pbonz...@redhat.com; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; rkrc...@redhat.com; j...@8bytes.org; b...@suse.de; Lendacky, Thomas ; x...@kernel.org; k...@vger.kernel.org; linux-kernel@vger.kernel.org; srutherf...@google.com; venu.busire...@oracle.com; Singh, Brijesh Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl On Thu, Feb 18, 2021, Kalra, Ashish wrote: > From: Sean Christopherson > > On Wed, Feb 17, 2021, Kalra, Ashish wrote: > >> From: Sean Christopherson On Thu, Feb 04, 2021, > >> Ashish Kalra wrote: > >> > From: Brijesh Singh > >> > > >> > The ioctl is used to retrieve a guest's shared pages list. > >> > >> >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS > >> >is passed through to userspace? That way, userspace could manage > >> >the set of pages >in whatever data structure they want, and these get/set > >> >ioctls go away. > >> > >> What is the advantage of passing KVM_HC_PAGE_ENC_STATUS through to > >> user-space ? > >> > >> As such it is just a simple interface to get the shared page list > >> via the get/set ioctl's. simply an array is passed to these ioctl > >> to get/set the shared pages list. >> >> > It eliminates any probability of the kernel choosing the wrong data >> > structure, and it's two fewer ioctls to maintain and test. >> >> The set shared pages list ioctl cannot be avoided as it needs to be >> issued to setup the shared pages list on the migrated VM, it cannot be >> achieved by passing KVM_HC_PAGE_ENC_STATUS through to user-space. >Why's that? AIUI, KVM doesn't do anything with the list other than pass it >back to userspace. Assuming that's the case, userspace can just hold onto the >list >for the next migration. KVM does use it as part of the SEV DBG_DECTYPT API, within sev_dbg_decrypt() to check if the guest page(s) are encrypted or not, and accordingly use it to decide whether to decrypt the guest page(s) and return that back to user-space or just return it as it is. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 18, 2021, Kalra, Ashish wrote: > From: Sean Christopherson > > On Wed, Feb 17, 2021, Kalra, Ashish wrote: > >> From: Sean Christopherson On Thu, Feb 04, 2021, > >> Ashish Kalra wrote: > >> > From: Brijesh Singh > >> > > >> > The ioctl is used to retrieve a guest's shared pages list. > >> > >> >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is > >> >passed through to userspace? That way, userspace could manage the > >> >set of pages >in whatever data structure they want, and these get/set > >> >ioctls go away. > >> > >> What is the advantage of passing KVM_HC_PAGE_ENC_STATUS through to > >> user-space ? > >> > >> As such it is just a simple interface to get the shared page list via > >> the get/set ioctl's. simply an array is passed to these ioctl to > >> get/set the shared pages list. > > > It eliminates any probability of the kernel choosing the wrong data > > structure, and it's two fewer ioctls to maintain and test. > > The set shared pages list ioctl cannot be avoided as it needs to be issued to > setup the shared pages list on the migrated VM, it cannot be achieved by > passing KVM_HC_PAGE_ENC_STATUS through to user-space. Why's that? AIUI, KVM doesn't do anything with the list other than pass it back to userspace. Assuming that's the case, userspace can just hold onto the list for the next migration. > So it makes sense to add both get/set shared pages list ioctl, passing > through to user-space is just adding more complexity without any significant > gains. No, it reduces complexity for KVM by avoiding locking and memslot dependencies.
RE: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
[AMD Public Use] -Original Message- From: Sean Christopherson Sent: Wednesday, February 17, 2021 10:13 AM To: Kalra, Ashish Cc: pbonz...@redhat.com; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; rkrc...@redhat.com; j...@8bytes.org; b...@suse.de; Lendacky, Thomas ; x...@kernel.org; k...@vger.kernel.org; linux-kernel@vger.kernel.org; srutherf...@google.com; venu.busire...@oracle.com; Singh, Brijesh Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl On Wed, Feb 17, 2021, Kalra, Ashish wrote: >> From: Sean Christopherson On Thu, Feb 04, 2021, >> Ashish Kalra wrote: >> > From: Brijesh Singh >> > >> > The ioctl is used to retrieve a guest's shared pages list. >> >> >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is >> >passed through to userspace? That way, userspace could manage the >> >set of pages >in whatever data structure they want, and these get/set >> >ioctls go away. >> >> What is the advantage of passing KVM_HC_PAGE_ENC_STATUS through to >> user-space ? >> >> As such it is just a simple interface to get the shared page list via >> the get/set ioctl's. simply an array is passed to these ioctl to >> get/set the shared pages list. > It eliminates any probability of the kernel choosing the wrong data > structure, and it's two fewer ioctls to maintain and test. The set shared pages list ioctl cannot be avoided as it needs to be issued to setup the shared pages list on the migrated VM, it cannot be achieved by passing KVM_HC_PAGE_ENC_STATUS through to user-space. So it makes sense to add both get/set shared pages list ioctl, passing through to user-space is just adding more complexity without any significant gains. > >Also, aren't there plans for an in-guest migration helper? If so, do > >we have any idea what that interface will look like? E.g. if we're > >going to end up with a full >fledged driver in the guest, why not > >bite the bullet now and bypass KVM entirely? > > Even the in-guest migration helper will be using page encryption > status hypercalls, so some interface is surely required. >If it's a driver with a more extensive interace, then the hypercalls can be >replaced by a driver operation. That's obviously a big if, though. > Also the in-guest migration will be mainly an OVMF component, won't > really be a full fledged kernel driver in the guest. >Is there code and/or a description of what the proposed helper would look like? Not right now, there are prototype(s) under development, I assume they will be posted upstream soon. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Wed, Feb 17, 2021, Kalra, Ashish wrote: > From: Sean Christopherson > On Thu, Feb 04, 2021, Ashish Kalra wrote: > > From: Brijesh Singh > > > > The ioctl is used to retrieve a guest's shared pages list. > > >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is passed > >through to userspace? That way, userspace could manage the set of pages >in > >whatever data structure they want, and these get/set ioctls go away. > > What is the advantage of passing KVM_HC_PAGE_ENC_STATUS through to user-space > ? > > As such it is just a simple interface to get the shared page list via the > get/set ioctl's. simply an array is passed to these ioctl to get/set the > shared pages list. It eliminates any probability of the kernel choosing the wrong data structure, and it's two fewer ioctls to maintain and test. > >Also, aren't there plans for an in-guest migration helper? If so, do we > >have any idea what that interface will look like? E.g. if we're going to > >end up with a full >fledged driver in the guest, why not bite the bullet now > >and bypass KVM entirely? > > Even the in-guest migration helper will be using page encryption status > hypercalls, so some interface is surely required. If it's a driver with a more extensive interace, then the hypercalls can be replaced by a driver operation. That's obviously a big if, though. > Also the in-guest migration will be mainly an OVMF component, won't really > be a full fledged kernel driver in the guest. Is there code and/or a description of what the proposed helper would look like?
RE: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
[AMD Public Use] -Original Message- From: Sean Christopherson Sent: Tuesday, February 16, 2021 7:03 PM To: Kalra, Ashish Cc: pbonz...@redhat.com; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; rkrc...@redhat.com; j...@8bytes.org; b...@suse.de; Lendacky, Thomas ; x...@kernel.org; k...@vger.kernel.org; linux-kernel@vger.kernel.org; srutherf...@google.com; venu.busire...@oracle.com; Singh, Brijesh Subject: Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl On Thu, Feb 04, 2021, Ashish Kalra wrote: > From: Brijesh Singh > > The ioctl is used to retrieve a guest's shared pages list. >What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is passed >through to userspace? That way, userspace could manage the set of pages >in >whatever data structure they want, and these get/set ioctls go away. What is the advantage of passing KVM_HC_PAGE_ENC_STATUS through to user-space ? As such it is just a simple interface to get the shared page list via the get/set ioctl's. simply an array is passed to these ioctl to get/set the shared pages list. >Also, aren't there plans for an in-guest migration helper? If so, do we have >any idea what that interface will look like? E.g. if we're going to end up >with a full >fledged driver in the guest, why not bite the bullet now and >bypass KVM entirely? Even the in-guest migration helper will be using page encryption status hypercalls, so some interface is surely required. Also the in-guest migration will be mainly an OVMF component, won't really be a full fledged kernel driver in the guest. Thanks, Ashish
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 04, 2021, Ashish Kalra wrote: > From: Brijesh Singh > > The ioctl is used to retrieve a guest's shared pages list. > > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: Paolo Bonzini > Cc: "Radim Krčmář" AFAIK, Radim is no longer involved with KVM, and his RedHat email is bouncing. Probably best to drop his Cc in future revisions.
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On Thu, Feb 04, 2021, Ashish Kalra wrote: > From: Brijesh Singh > > The ioctl is used to retrieve a guest's shared pages list. What's the performance hit to boot time if KVM_HC_PAGE_ENC_STATUS is passed through to userspace? That way, userspace could manage the set of pages in whatever data structure they want, and these get/set ioctls go away. Also, aren't there plans for an in-guest migration helper? If so, do we have any idea what that interface will look like? E.g. if we're going to end up with a full fledged driver in the guest, why not bite the bullet now and bypass KVM entirely?
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
Hello Tom, On Thu, Feb 04, 2021 at 10:14:37AM -0600, Tom Lendacky wrote: > On 2/3/21 6:39 PM, Ashish Kalra wrote: > > From: Brijesh Singh > > > > The ioctl is used to retrieve a guest's shared pages list. > > > > ... > > > +int svm_get_shared_pages_list(struct kvm *kvm, > > + struct kvm_shared_pages_list *list) > > +{ > > + struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info; > > + struct shared_region_array_entry *array; > > + struct shared_region *pos; > > + int ret, nents = 0; > > + unsigned long sz; > > + > > + if (!sev_guest(kvm)) > > + return -ENOTTY; > > + > > + if (!list->size) > > + return -EINVAL; > > + > > + if (!sev->shared_pages_list_count) > > + return put_user(0, list->pnents); > > + > > + sz = sev->shared_pages_list_count * sizeof(struct > > shared_region_array_entry); > > + if (sz > list->size) > > + return -E2BIG; > > + > > + array = kmalloc(sz, GFP_KERNEL); > > + if (!array) > > + return -ENOMEM; > > + > > + mutex_lock(>lock); > > I think this lock needs to be taken before the memory size is calculated. If > the list is expanded after obtaining the size and before taking the lock, > you will run off the end of the array. > Yes, as the page encryption status hcalls can happen simultaneously to this ioctl, therefore this is an issue, so i will take the lock before the memory size is calculated. Thanks, Ashish > > > + list_for_each_entry(pos, >shared_pages_list, list) { > > + array[nents].gfn_start = pos->gfn_start; > > + array[nents++].gfn_end = pos->gfn_end; > > + } > > + mutex_unlock(>lock); > > + > > + ret = -EFAULT; > > + if (copy_to_user(list->buffer, array, sz)) > > + goto out; > > + if (put_user(nents, list->pnents)) > > + goto out; > > + ret = 0; > > +out: > > + kfree(array); > > + return ret; > > +} > > +
Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
On 2/3/21 6:39 PM, Ashish Kalra wrote: From: Brijesh Singh The ioctl is used to retrieve a guest's shared pages list. ... +int svm_get_shared_pages_list(struct kvm *kvm, + struct kvm_shared_pages_list *list) +{ + struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info; + struct shared_region_array_entry *array; + struct shared_region *pos; + int ret, nents = 0; + unsigned long sz; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (!list->size) + return -EINVAL; + + if (!sev->shared_pages_list_count) + return put_user(0, list->pnents); + + sz = sev->shared_pages_list_count * sizeof(struct shared_region_array_entry); + if (sz > list->size) + return -E2BIG; + + array = kmalloc(sz, GFP_KERNEL); + if (!array) + return -ENOMEM; + + mutex_lock(>lock); I think this lock needs to be taken before the memory size is calculated. If the list is expanded after obtaining the size and before taking the lock, you will run off the end of the array. Thanks, Tom + list_for_each_entry(pos, >shared_pages_list, list) { + array[nents].gfn_start = pos->gfn_start; + array[nents++].gfn_end = pos->gfn_end; + } + mutex_unlock(>lock); + + ret = -EFAULT; + if (copy_to_user(list->buffer, array, sz)) + goto out; + if (put_user(nents, list->pnents)) + goto out; + ret = 0; +out: + kfree(array); + return ret; +} +
[PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl
From: Brijesh Singh The ioctl is used to retrieve a guest's shared pages list. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Paolo Bonzini Cc: "Radim Krčmář" Cc: Joerg Roedel Cc: Borislav Petkov Cc: Tom Lendacky Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Brijesh Singh Co-developed-by: Ashish Kalra Signed-off-by: Ashish Kalra --- Documentation/virt/kvm/api.rst | 24 arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm/sev.c | 49 + arch/x86/kvm/svm/svm.c | 1 + arch/x86/kvm/svm/svm.h | 1 + arch/x86/kvm/x86.c | 12 include/uapi/linux/kvm.h| 9 ++ 7 files changed, 98 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 99ceb978c8b0..59ef537c0cdd 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4677,6 +4677,30 @@ This ioctl resets VCPU registers and control structures according to the clear cpu reset definition in the POP. However, the cpu is not put into ESA mode. This reset is a superset of the initial reset. +4.125 KVM_GET_SHARED_PAGES_LIST (vm ioctl) +--- + +:Capability: basic +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_shared_pages_list (in/out) +:Returns: 0 on success, -1 on error + +/* for KVM_GET_SHARED_PAGES_LIST */ +struct kvm_shared_pages_list { + int __user *pnents; + void __user *buffer; + __u32 size; +}; + +The encrypted VMs have the concept of private and shared pages. The private +pages are encrypted with the guest-specific key, while the shared pages may +be encrypted with the hypervisor key. The KVM_GET_SHARED_PAGES_LIST can +be used to get guest's shared/unencrypted memory regions list. +This list can be used during the guest migration. If the page +is private then the userspace need to use SEV migration commands to transmit +the page. + 4.125 KVM_S390_PV_COMMAND - diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2da5f5e2a10e..cd354d830e13 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1303,6 +1303,8 @@ struct kvm_x86_ops { void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector); int (*page_enc_status_hc)(struct kvm *kvm, unsigned long gpa, unsigned long sz, unsigned long mode); + int (*get_shared_pages_list)(struct kvm *kvm, +struct kvm_shared_pages_list *list); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 55c628df5155..701d74c8b15b 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -50,6 +50,11 @@ struct shared_region { unsigned long gfn_start, gfn_end; }; +struct shared_region_array_entry { + unsigned long gfn_start; + unsigned long gfn_end; +}; + static int sev_flush_asids(void) { int ret, error = 0; @@ -1622,6 +1627,50 @@ int svm_page_enc_status_hc(struct kvm *kvm, unsigned long gpa, return ret; } +int svm_get_shared_pages_list(struct kvm *kvm, + struct kvm_shared_pages_list *list) +{ + struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info; + struct shared_region_array_entry *array; + struct shared_region *pos; + int ret, nents = 0; + unsigned long sz; + + if (!sev_guest(kvm)) + return -ENOTTY; + + if (!list->size) + return -EINVAL; + + if (!sev->shared_pages_list_count) + return put_user(0, list->pnents); + + sz = sev->shared_pages_list_count * sizeof(struct shared_region_array_entry); + if (sz > list->size) + return -E2BIG; + + array = kmalloc(sz, GFP_KERNEL); + if (!array) + return -ENOMEM; + + mutex_lock(>lock); + list_for_each_entry(pos, >shared_pages_list, list) { + array[nents].gfn_start = pos->gfn_start; + array[nents++].gfn_end = pos->gfn_end; + } + mutex_unlock(>lock); + + ret = -EFAULT; + if (copy_to_user(list->buffer, array, sz)) + goto out; + if (put_user(nents, list->pnents)) + goto out; + ret = 0; +out: + kfree(array); + return ret; +} + int svm_mem_enc_op(struct kvm *kvm, void __user *argp) { struct kvm_sev_cmd sev_cmd; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index bb249ec625fc..533ce47ff158 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4538,6 +4538,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector, .page_enc_status_hc = svm_page_enc_status_hc, + .get_shared_pages_list =